6. References  
					1] O. J. Rasanen, “Computational modeling of phonetic and lexical  
					¨ ¨  
					learning in early language acquisition: Existing models and future  
					[22] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un-  
					supervised speech representation learning using WaveNet autoen-  
					coders,” IEEE Trans. Audio, Speech, Language Process., vol. 27,  
					no. 12, pp. 2041–2053, 2019.  
					[
					[
					[
					directions,” Speech Commun., vol. 54, pp. 975–997, 2012.  
					[
					23] A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura,  
					VQVAE unsupervised unit discovery and multi-scale code2spec  
					inverter for Zerospeech Challenge 2019,” in Proc. Interspeech,  
					019.  
					2] T. Schatz and N. H. Feldman, “Neural network vs. HMM speech  
					recognition systems as models of human cross-linguistic phonetic  
					perception,” in Proc. CCN, 2018.  
					“
					2
					3] C. Shain and M. Elsner, “Measuring the perceptual availability  
					of phonological features during language acquisition using unsu-  
					pervised binary stochastic autoencoders,” in Proc. HLT-NAACL,  
					[
					24] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete  
					representation learning,” in Proc. NeurIPS, 2017.  
					2019.  
					[25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,  
					A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,  
					[
					[
					[
					4] M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, “The Zero  
					Resource Speech Challenge 2015: Proposed approaches and re-  
					sults,” in Proc. SLTU, 2016.  
					“
					WaveNet: a generative model for raw audio,” arXiv preprint  
					arXiv:1609.03499, 2016.  
					[
					26] Y. Bengio, N. Leonard, and A. Courville, “Estimating or propagat-  
					´
					ing gradients through stochastic neurons for conditional computa-  
					tion,” arXiv preprint arXiv:1308.3432, 2013.  
					5] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard,  
					L. Besacier, X. Anguera, and E. Dupoux, “The Zero Resource  
					Speech Challenge 2017,” in Proc. ASRU, 2017.  
					[
					27] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz,  
					R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving  
					robust universal neural vocoding,” in Proc. Interspeech, 2019.  
					6] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea,  
					X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black et al.,  
					“
					The Zero Resource Speech Challenge 2019: TTS without T,” in  
					Proc. Interspeech, 2019.  
					[28] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Gar-  
					cia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al.,  
					[
					7] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-  
					speech synthesis,” in Proc. ICASSP, 1998.  
					“
					Mixed precision training,” in Proc. ICLR, 2018.  
					[
					29] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-  
					tion,” in Proc. ICLR, 2015.  
					[
					8] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice  
					conversion without parallel data by adversarially learning disentan-  
					gled audio representations,” in Proc. Interspeech, 2018.  
					[30] L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit,  
					and N. Shazeer, “Fast decoding in sequence models using discrete  
					latent variables,” in Proc. ICML, 2018.  
					[
					9] N. Zeghidour, G. Synnaeve, N. Usunier, and E. Dupoux, “Joint  
					learning of speaker and phonetic similarities with Siamese net-  
					works,” in Proc. Interspeech, 2016.  
					[31] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-  
					supervised learning of discrete speech representations,” in Proc.  
					ICLR, 2020.  
					[
					[
					[
					[
					10] M. Heck, S. Sakti, and S. Nakamura, “Learning supervised fea-  
					ture transformations on zero resources for improved acoustic unit  
					discovery,” IEICE T. Inf. Syst., vol. 101, no. 1, pp. 205–214, 2018.  
					[32] A. van den Oord, Y. Li, and O. Vinyals, “Representation  
					learning with contrastive predictive coding,” arXiv preprint  
					arXiv:1807.03748, 2018.  
					11] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised  
					autoregressive model for speech representation learning,” in Proc.  
					Interspeech, 2019.  
					[33] M. Rivie  
					pretraining transfers well across languages,” in Proc. ICASSP,  
					020.  
					re, A. Joulin, P.-E. Mazare, and E. Dupoux, “Unsupervised  
					` ´  
					12] W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of  
					bidirectional speech encoders via masked reconstruction,” in Proc.  
					ICASSP, 2020.  
					2
					[34] J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazare,  
					` ´  
					J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-  
					light: A benchmark for asr with limited or no supervision,” in Proc.  
					ICASSP, 2020.  
					13] P.-J. Last, H. A. Engelbrecht, and H. Kamper, “Unsupervised  
					feature learning for speech using correspondence and siamese  
					networks,” IEEE Signal Proc. Let., vol. 27, pp. 421–425, 2020.  
					[
					[
					[
					35] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “De-  
					velopment of HMM-based Indonesian speech synthesis,” in Proc.  
					O-COCOSDA, 2008.  
					[
					[
					[
					[
					14] J. L. Flanagan, Speech analysis synthesis and perception. Springer  
					Science & Business Media, 2013, vol. 3.  
					15] B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised  
					learning of acoustic sub-word units,” in Proc. ACL, 2008.  
					36] S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Nakamura,  
					“
					Development of Indonesian large vocabulary continuous speech  
					16] C.-y. Lee and J. R. Glass, “A nonparametric Bayesian approach to  
					acoustic model discovery,” in Proc. ACL, 2012.  
					recognition system within A-STAR project,” in Proc. TCAST, 2008.  
					37] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and  
					E. Dupoux, “Evaluating speech features with the minimal-pair  
					ABX task: Analysis of the classical MFC/PLP pipeline,” in Proc.  
					Interspeech, 2013.  
					17] M.-H. Siu, H. Gish, A. Chan, W. Belfield, and S. Lowe, “Unsuper-  
					vised training of an HMM-based self-organizing unit recognizer  
					with applications to topic classification and keyword discovery,”  
					Comput. Speech Lang., vol. 28, no. 1, pp. 210–223, 2014.  
					[
					38] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural  
					[
					[
					[
					[
					18] C.-y. Lee, T. O’Donnell, and J. R. Glass, “Unsupervised lexicon  
					network speech synthesis system.” in Proc. SSW, 2016.  
					discovery from acoustic input,” Trans. ACL, vol. 3, pp. 389–403,  
					[
					39] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-  
					time speaker-dependent neural vocoder,” in Proc. ICASSP, 2018.  
					2
					015.  
					19] L. Ondel, L. Burget, and J. Cˇ ernocky  
					acoustic unit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86,  
					016.  
					`
					, “Variational inference for  
					[40] M. Chen and T. Hain, “Unsupervised acoustic unit representation  
					learning for voice conversion using WaveNet auto-encoders,” in  
					submitted to Interspeech, 2020.  
					2
					20] L. Badino, A. Mereta, and L. Rosasco, “Discovering discrete sub-  
					word units with binarized autoencoders and hidden-Markov-model  
					encoders,” in Proc. Interspeech, 2015.  
					[41] D. Harwath, W.-N. Hsu, and J. Glass, “Learning hierarchical dis-  
					crete linguistic units from visually-grounded speech,” in Proc.  
					ICLR, 2020.  
					21] R. Eloff, A. Nortje, B. L. Van Niekerk, A. Govender, L. Nortje,  
					A. Pretorius, E. Van Biljon, E. Van der Westhuizen, L. Van Staden,  
					and H. Kamper, “Unsupervised acoustic unit discovery for speech  
					synthesis using discrete latent-variable neural networks,” in Proc.  
					Interspeech, 2019.