6. References
1] O. J. Rasanen, “Computational modeling of phonetic and lexical
¨ ¨
learning in early language acquisition: Existing models and future
[22] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un-
supervised speech representation learning using WaveNet autoen-
coders,” IEEE Trans. Audio, Speech, Language Process., vol. 27,
no. 12, pp. 2041–2053, 2019.
[
[
[
directions,” Speech Commun., vol. 54, pp. 975–997, 2012.
[
23] A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura,
VQVAE unsupervised unit discovery and multi-scale code2spec
inverter for Zerospeech Challenge 2019,” in Proc. Interspeech,
019.
2] T. Schatz and N. H. Feldman, “Neural network vs. HMM speech
recognition systems as models of human cross-linguistic phonetic
perception,” in Proc. CCN, 2018.
“
2
3] C. Shain and M. Elsner, “Measuring the perceptual availability
of phonological features during language acquisition using unsu-
pervised binary stochastic autoencoders,” in Proc. HLT-NAACL,
[
24] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete
representation learning,” in Proc. NeurIPS, 2017.
2019.
[25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,
[
[
[
4] M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, “The Zero
Resource Speech Challenge 2015: Proposed approaches and re-
sults,” in Proc. SLTU, 2016.
“
WaveNet: a generative model for raw audio,” arXiv preprint
arXiv:1609.03499, 2016.
[
26] Y. Bengio, N. Leonard, and A. Courville, “Estimating or propagat-
´
ing gradients through stochastic neurons for conditional computa-
tion,” arXiv preprint arXiv:1308.3432, 2013.
5] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard,
L. Besacier, X. Anguera, and E. Dupoux, “The Zero Resource
Speech Challenge 2017,” in Proc. ASRU, 2017.
[
27] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz,
R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving
robust universal neural vocoding,” in Proc. Interspeech, 2019.
6] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea,
X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black et al.,
“
The Zero Resource Speech Challenge 2019: TTS without T,” in
Proc. Interspeech, 2019.
[28] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Gar-
cia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al.,
[
7] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-
speech synthesis,” in Proc. ICASSP, 1998.
“
Mixed precision training,” in Proc. ICLR, 2018.
[
29] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” in Proc. ICLR, 2015.
[
8] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice
conversion without parallel data by adversarially learning disentan-
gled audio representations,” in Proc. Interspeech, 2018.
[30] L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit,
and N. Shazeer, “Fast decoding in sequence models using discrete
latent variables,” in Proc. ICML, 2018.
[
9] N. Zeghidour, G. Synnaeve, N. Usunier, and E. Dupoux, “Joint
learning of speaker and phonetic similarities with Siamese net-
works,” in Proc. Interspeech, 2016.
[31] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-
supervised learning of discrete speech representations,” in Proc.
ICLR, 2020.
[
[
[
[
10] M. Heck, S. Sakti, and S. Nakamura, “Learning supervised fea-
ture transformations on zero resources for improved acoustic unit
discovery,” IEICE T. Inf. Syst., vol. 101, no. 1, pp. 205–214, 2018.
[32] A. van den Oord, Y. Li, and O. Vinyals, “Representation
learning with contrastive predictive coding,” arXiv preprint
arXiv:1807.03748, 2018.
11] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised
autoregressive model for speech representation learning,” in Proc.
Interspeech, 2019.
[33] M. Rivie
pretraining transfers well across languages,” in Proc. ICASSP,
020.
re, A. Joulin, P.-E. Mazare, and E. Dupoux, “Unsupervised
` ´
12] W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of
bidirectional speech encoders via masked reconstruction,” in Proc.
ICASSP, 2020.
2
[34] J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazare,
` ´
J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-
light: A benchmark for asr with limited or no supervision,” in Proc.
ICASSP, 2020.
13] P.-J. Last, H. A. Engelbrecht, and H. Kamper, “Unsupervised
feature learning for speech using correspondence and siamese
networks,” IEEE Signal Proc. Let., vol. 27, pp. 421–425, 2020.
[
[
[
35] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “De-
velopment of HMM-based Indonesian speech synthesis,” in Proc.
O-COCOSDA, 2008.
[
[
[
[
14] J. L. Flanagan, Speech analysis synthesis and perception. Springer
Science & Business Media, 2013, vol. 3.
15] B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised
learning of acoustic sub-word units,” in Proc. ACL, 2008.
36] S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Nakamura,
“
Development of Indonesian large vocabulary continuous speech
16] C.-y. Lee and J. R. Glass, “A nonparametric Bayesian approach to
acoustic model discovery,” in Proc. ACL, 2012.
recognition system within A-STAR project,” in Proc. TCAST, 2008.
37] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and
E. Dupoux, “Evaluating speech features with the minimal-pair
ABX task: Analysis of the classical MFC/PLP pipeline,” in Proc.
Interspeech, 2013.
17] M.-H. Siu, H. Gish, A. Chan, W. Belfield, and S. Lowe, “Unsuper-
vised training of an HMM-based self-organizing unit recognizer
with applications to topic classification and keyword discovery,”
Comput. Speech Lang., vol. 28, no. 1, pp. 210–223, 2014.
[
38] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural
[
[
[
[
18] C.-y. Lee, T. O’Donnell, and J. R. Glass, “Unsupervised lexicon
network speech synthesis system.” in Proc. SSW, 2016.
discovery from acoustic input,” Trans. ACL, vol. 3, pp. 389–403,
[
39] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-
time speaker-dependent neural vocoder,” in Proc. ICASSP, 2018.
2
015.
19] L. Ondel, L. Burget, and J. Cˇ ernocky
acoustic unit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86,
016.
`
, “Variational inference for
[40] M. Chen and T. Hain, “Unsupervised acoustic unit representation
learning for voice conversion using WaveNet auto-encoders,” in
submitted to Interspeech, 2020.
2
20] L. Badino, A. Mereta, and L. Rosasco, “Discovering discrete sub-
word units with binarized autoencoders and hidden-Markov-model
encoders,” in Proc. Interspeech, 2015.
[41] D. Harwath, W.-N. Hsu, and J. Glass, “Learning hierarchical dis-
crete linguistic units from visually-grounded speech,” in Proc.
ICLR, 2020.
21] R. Eloff, A. Nortje, B. L. Van Niekerk, A. Govender, L. Nortje,
A. Pretorius, E. Van Biljon, E. Van der Westhuizen, L. Van Staden,
and H. Kamper, “Unsupervised acoustic unit discovery for speech
synthesis using discrete latent-variable neural networks,” in Proc.
Interspeech, 2019.