Vector-quantized neural networks for acoustic unit discovery

in the ZeroSpeech 2020 challenge

Benjamin van Niekerk

Leanne Nortje

Herman Kamper

E&E Engineering, Stellenbosch University, South Africa

benjamin.l.van.niekerk@gmail.com, nortjeleanne@gmail.com, kamperh@sun.ac.za

Abstract

version [7, 8], an explicit goal of ZeroSpeech 2019 is to learn

low-bitrate representations that perform well on phone discrim-

ination tests. In contrast to work on continuous representation

向量量化

In this paper, we explore vector quantization for acoustic unit

discovery. Leveraging unlabelled data, we aim to learn discrete

representations of speech that separate phonetic content from

speaker-speciﬁc details. We propose two neural models to tackle

this challenge. Both models use vector quantization to map

continuous features to a ﬁnite set of codes. The ﬁrst model is

a type of vector-quantized variational autoencoder (VQ-VAE).

The VQ-VAE encodes speech into a discrete representation from

which the audio waveform is reconstructed. Our second model

combines vector quantization with contrastive predictive coding

learning [9–13], this encourages partic₁ipants to ﬁnd discrete

units that correspond to distinct phones.

Early approaches to acoustic unit discovery typically com-

bined clustering methods with hidden Markov models [15–19].

More recent studies have explored neural networks with interme-

diate discretization [20 23]. In this paper, we investigate vector

–

quantized (VQ) neural networks for acoustic unit discovery, and

propose two models for the ZeroSpeech 2020 challenge.

The ﬁrst model is a type of vector-quantized variational

autoencoder (VQ-VAE) [24]. The VQ-VAE maps speech into a

discrete latent space before reconstructing the original waveform.

Instead of using WaveNet [25], we opt for a lightweight recurrent

network as the decoder. The result is a smaller, faster model that

can be trained on a single GPU.

The second model is a combination of vector-quantization

and contrastive predictive coding (VQ-CPC). The model learns

a discrete representation of speech that can distinguish future

acoustic units from negative examples drawn from other utter-

ances. We compare across-speaker and within-speaker sampling

for negative examples and show that the latter is important for

speaker invariance.

In ABX phone discrimination tests on English and Indone-

sian data, the models outperform all other submissions to the

ZeroSpeech 2019 and 2020 challenges. On the voice conversion

task, both models are competitive, with VQ-CPC achieving the

best naturalness and speaker-similarity scores on the English

dataset. Finally, in probing experiments, we analyze the effect of

VQ. We show that VQ imposes an information bottleneck that

separates phonetic and speaker content.

(

VQ-CPC). The idea is to learn a representation of speech by

predicting future acoustic units. We evaluate the models on

English and Indonesian data for the ZeroSpeech 2020 challenge.

In ABX phone discrimination tests, both models outperform all

submissions to the 2019 and 2020 challenges, with a relative

improvement of more than 30%. The discovered units also

perform competitively on a downstream voice conversion task.

Of the two models, VQ-CPC performs slightly better in general

and is simpler and faster to train. Probing experiments show that

vector quantization is an effective bottleneck, forcing the models

to discard speaker information.

Index Terms: unsupervised speech processing, acoustic unit

discovery, voice conversion, representation learning

. Introduction

Modern speech and language technologies are developed with

massive amounts of annotated data. However, large datasets of

transcribed speech are not available for low-resource languages

and building new corpora can be prohibitively expensive. As a

result, tools like automatic speech recognition and text-to-speech

are not available for many of the world’s languages.

To address this problem, zero-resource speech processing

aims to develop methods that can learn directly from speech

without explicit supervision. The goal is to leverage unlabelled

data to discover representations that capture meaningful pho-

netic contrasts while being invariant to background noise and

speaker-speciﬁc details. These representations can then be used

to bootstrap training in downstream speech systems and reduce

requirements on labelled data. Additionally, since infants acquire

language without explicit supervision, the discovered representa-

tions can be used in cognitive models of language learning [1–3].

Over the last few years, progress in this area has been driven

by the ZeroSpeech Challenges [4 6]. ZeroSpeech 2020 consoli-

dates previous challenges, allowing submissions to both the 2017

and 2019 tracks. We focus on ZeroSpeech 2019: Text-to-Speech

Without Text, which requires participants to discover discrete

acoustic units from unlabelled data. From the discovered units,

the task is then to synthesize speech in a target speaker’s voice.

Synthesized utterances are evaluated in terms of intelligibility,

speaker-similarity, and naturalness. While similar to voice con-

In this section we ﬁrst explain vector quantization and then

discuss the two models in detail.

. Vector-quantized neural networks

.1. Vector quantization

1 2 K

The VQ layer consists of a trainable codebook {e , e , . . . , e }

with distinct codes. In the forward pass, a sequence of contin-

uous feature vectors z := hz , z , . . . , z is discretized by map-

ping each

to it’s nearest neighbo₂r in the codebook. Concretely,

− e

we ﬁnd k := arg min ||z

|| and replace

with the code

, zˆ , . . . , zˆ

–

, resulting in the quantized sequence zˆ := hzˆ

Since the arg min operator is not differentiable, in the backward

pass, gradients are approximated using the straight-through esti-

mator [26]. To train the codebook, we use an exponential moving

average of the continuous features. Finally, a commitment cost is

As a point of reference, phonetic transcriptions encode speech at a

rate of about 50 bits per second [14].

added to the loss to encourage each

code. For a more detailed explanation see [24].

to commit to the selected

bottleneck, forcing a compressed representation that discards

non-essential details. To encourage the bottleneck to speciﬁ-

cally discard speaker information, we condition the decoder on

speaker identity during training.

.2. Vector-quantized variational autoencoder

Training details. We train the model to maximize the log-

likelihood of the waveform given the bottleneck, i.e. we mini-

mize the sum of the reconstruction error and the commitment

cost:

Inspired by the WaveNet autoencoder proposed in [22], our

ﬁrst model is a type of VQ-VAE. We replace the WaveNet de-

coder [25] with a lightweight RNN based vocoder [27]. Together

with automatic mixed precision [28], this allows us to train on a

single GPU. Additionally, to learn a low-bitrate representation,

we use a much sma₂ller codebook. Finally, we release code and

pretrained weights.

L := −_N¹

log p(x

|zˆ) + β_T¹

||z

− sg(zˆ

)|| ,

i=1

Model description. The VQ-VAE can be divided into the

three components shown in Figure 1. The encoder takes a speech

waveform sampled at 16 kHz as input and computes a log-Mel

spectrogram. The spectrogram is processed by a stack of 5

convolutional layers, which downsamples the input by a factor

of 2. In the bottleneck, the output of the encoder is projected

into a sequence of continuous features. The representation is

then discretized using a VQ layer with 512 codes. Finally, the

decoder tries to reconstruct the original waveform. To predict

the next sample, we condition an autoregressive model on the

output of the bottleneck, the speaker identity, and past waveform

samples.

where hx , x , . . . , x i is a sequence of waveform samples, β is

1 2 N

the commitment cost weight, and

denotes the stop-gradient

00k steps. The network is trained for a total of 500k steps.

Voice conversion. At test time, we can generate speech in a

target voice by conditioning the decoder on a speciﬁc speaker.

First, we encode a source utterance into a sequence of acoustic

units. Since the bottleneck separates speaker details form pho-

netic information, we can replace the speaker while retaining the

content of the utterance. Speciﬁcally, the output of the bottleneck

is concatenated with the target speaker embedding and piped to

the decoder.

For acoustic unit discovery, the VQ-VAE balances two op-

posing pressures. On the one hand, the encoder must preserve in-

formation from the input to accurately reconstruct the waveform.

On the other hand, vector quantization imposes an information

Practical considerations. Our goal is to discover phone-

like acoustic units. Ideally, adjacent frames within the same

phone would be mapped to the same unit. In practice, to en-

courage consistency across frames, we use time-jitter regulariza-

tion [22]. During training, the code assigned to each frame may

be replaced by one of its neighbors. Jitter forces the discovered

codes to be useful across multiple time steps. We apply jitter

directly after the bottleneck, with a replacement probability of

2_h_t_t_p_s_:_/_/_g_i_t_h_u_b_._c_o_m_/_b_s_h_a_l_l_/_Z_e_r_o_S_p_e_e_c_h

Bottleneck

Encoder

linear(64)

VQ(512)

.5. Another common issue with vector quantization is codebook

ReLU

collapse, where only a few codes are ever selected [30,31]. We

found that batch normalization, coupled with large batch sizes,

improved codebook utilization.

Decoder

embedding

batchnorm

conv₃(768)

jitter(0.5)

concat

ReLU

batchnorm

conv₃(768)

upsample

00Hz

2.3. Vector-quantized contrastive predictive coding

biGRU(128)

Contrastive predictive coding (CPC) is a recently proposed

framework for unsupervised learning [32]. The idea is to learn

representations by predicting future observations in latent space.

Models are trained, with a contrastive loss, to distinguish future

frames from negative examples. The motivation behind CPC is

that the model must infer global structure in speech (e.g. phone

identity) to make accurate predictions. At the same time, low-

level details which do not improve prediction can be discarded.

Recent studies have shown that CPC learns representations

that capture phonetic contrasts and transfer well across languages

ReLU

batchnorm

0Hz conv₄stride₂(768)

upsample 16kHz

ReLU

GRU(896)

batchnorm

conv₃(768)

linear(256)

ReLU

embedding

ReLU

linear(256)

ReLU

batchnorm

conv₃(768)

[

33,34]. In this paper, we adapt CPC to the task of acoustic unit

8-bit

sample softmax

discovery. We incorporate vector quantization to learn discrete

units, and investigate different negative sampling strategies to

encourage speaker-invariant representations.

100Hz

log-Mel spec

mu-law

Model description. The VQ-CPC model is illustrated in

Figure 2. First, the encoder maps input speech (parametrized as

a log-Mel spectrogram) into a sequence of continuous features.

The encoder consists of a strided convolutional layer (downsam-

pling the input by a factor of 2), followed by a stack of 4 linear

layers with ReLU activations. Layer normalization is applied

after each layer. The bottleneck is identical to the one described

6kHz

waveform

speaker

Figure 1: VQ-VAE: A convolutional encoder (green) takes a

speech waveform as input and outputs downsampled continuous

features. These are discretized (red) using vector quantization.

The decoder (purple) then tries to reconstruct the input waveform

from the discrete representation using an RNN-based vocoder

conditioned on a speaker embedding.

in §2.2. The output of the encoder is projected into a sequence of

3_h_t_t_p_s_:_/_/_g_i_t_h_u_b_._c_o_m_/_b_s_h_a_l_l_/_V_e_c_t_o_r_Q_u_a_n_t_i_z_e_d_C_P_C

vectors

is a low-resource Austronesian language widely used as a lin-

gua franca [35, 36]. Following the challenge guidelines, we

use English as the development language. After ﬁnalizing the

models, we apply the same procedure to the Indonesian data.

For both languages, training data consists of about 15 hours of

speech from 100 speakers. An additional hour is provided per

target speaker for voice conversion. Finally, the test set contains

approximately 30 minutes of speech from unseen speakers.

ABX evaluation. ABX phone discrimination tests are used

to evaluate the discovered acoustic units [37]. The tests ask

Predictions

Autoregressive

model

Quantized

-vectors

Encoder

Input

whether triphone

is more similar to triphones

and X are instances of the same triphone (e.g. “beg”), while

differs in the middle phone (e.g. “bag”). To measure speaker-

and come from the same speaker, but is

A or B. Here,

invariance,

taken from a different speaker. As a similarity metric, we use

the average cosine distance along the dynamic time warping

alignment path. ABX is reported as an aggregated error rate over

all pairs of triphones in the test set.

Voice conversion. To assess voice conversion quality, hu-

man evaluators judge intelligibility, speaker-similarity, and natu-

ralness. For intelligibility, the evaluators orthographically tran-

scribe the synthesized speech. By comparing the transcriptions

to the ground truth, a character error rate (CER) is calculated.

The evaluators score speaker-similarity and naturalness on a

scale from 1 to 5 (higher is better), with the latter reported as a

mean opinion score (MOS).

Baselines. The challenge baseline system combines a

Dirichlet process Gaussian mixture model (DPGMM) for acous-

tic unit discovery [19] with a parametric speech synthesizer

based on Merlin [38]. The topline system feeds the output of a

supervised speech recognition model to a text-to-speech system,

both trained on ground-truth transcriptions. See [6] for details.

We also include results for two other approaches. The ﬁrst

is the VQ-VAE-based system we submitted to the previous chal-

lenge [21], referred to here as VQ-VAE(spec). Instead of gener-

ating audio waveforms directly, VQ-VAE(spec) uses a two-stage

approach. The model reconstructs log-Mel spectrograms, which

are then fed to a separately trained FFTNet vocoder [39] for

synthesis. Secondly, we include results for the system of Chen

and Hain [40], one of the other top-performing submissions to

ZeroSpeech 2020. Their system is similar to the WaveNet au-

toencoder of [22], but uses instance-norm layers in the encoder

and adaptive instance normalization for speaker conditioning. In

contrast to our models, Chen and Hain also down₁s₆ample by a

factor of 4 and use a much larger codebook with 2 codes.

Figure 2: VQ-CPC: An encoder (green) encodes speech

parametrized as a log-Mel spectrogram) to a sequence of con-

tinuous vectors . Using a VQ bottleneck (red) the -vectors

are quantized. The quantized zˆ-vectors are summarised by an

autoregressive RNN (purple) into context vectors . Using this

context, the model is trained to predict future codes.

(

continuous latent vectors which are discretized using a VQ layer

with 512 codes. Finally, the autoregressive model summarizes

t) into a context vector

the discrete representations (up to time

. Using this context, the model is trained to discriminate future

codes from negative examples drawn from other utterances.

Training details. Given a prediction horizon of steps,

a trainable predictor matrix , and a set t,m containing

negative examples and the positive code zˆt+m, we minimize the

InfoNCE loss [32]:

exp(zˆt^|+m

W c

m t

L^t^:⁼⁻^M¹

log

)

_z_˜_∈_N_t_,_mexp(z˜^|W

)

m=1

The loss is averaged over segments of 1.28 seconds and a VQ

commitment cost is added. We set the prediction horizon to

M = 6 steps and sample 17 negative examples per step. We use

the Adam optim₄ izer, with a batches size of 64, and a learning

−

rate of 4 · 10 . Each minibatch is divided into groups of 8

segments from which negative examples are sampled. To address

codebook collapse, we use a warm-up₅ phase where we linearly

−

increase the learning rate from

Sampling negative exam¹p^·l¹e⁰s. We investigate across-

speaker and within-speaker sampling for negative examples.

In across-speaker sampling, negatives are drawn from a mix of

speakers, while within-speaker sampling uses the same speaker.

We hypothesize that within-speaker sampling will encourage

speaker-invariant representations since speaker information can-

not be used to identify the positive example.

over the ﬁrst 150 epochs.

. Experimental results

Table 1 shows the evaluation results for the ZeroSpeech 2020

Challenge. On ABX tests, our models achieve the best scores,

outperforming all submissions to the 2019 and 2020 challenges.

Over our closest competitor [40], we improve ABX scores on

the English and Indonesian datasets by more than 30% and

50%, respectively. On the English voice conversion task, VQ-

CPC also achieves top naturalness and speaker-similarity results,

marginally beating the VQ-VAE. However, on Indonesian some

of the other submissions perform better. This discrepancy may

be explained by a mismatch in the volume of our synthesized

speech and the source utterances. On the English dataset, the

Voice conversion. VQ-CPC is not a generative model, so

we train a separate vocoder on top of the discovered acoustic

units for voice conversion. The vocoder is similar to the decoder

in Figure 1, except the jitter layer is replaced with an embedding

which reads in the code indices from the VQ-CPC bottleneck.

Again, the target voice can be controlled by conditioning the

vocoder on a speciﬁc speaker.

⁴The leader-board can be viewed at https://zerospeech.com/

020/results.html. Voice conversion samples for our models can

be found at https://bshall.github.io/ZeroSpeech/ and https:

//bshall.github.io/VectorQuantizedCPC/, respectively.

Datasets. We evaluate our models on the English and Indone-

sian datasets from the ZeroSpeech 2019 Challenge. Indonesian

. Experimental setup

Figure 3: The log-Mel spectrograms of speech segments taken from two different speakers. Overlaid are the aligned transcriptions and

acoustic units from VQ-CPC. Common units in the two code sequences are highlighted in yellow.

Table 1: Human and machine evaluations on the English and

Indonesian test sets. For MOS and similarity scores, higher is

better. For CER, ABX, and bitrate, lower is better. ABX scores

for the discrete codes and auxiliary representations are shown

under the “code” and “aux” columns respectively.

Table 2: Speaker classiﬁcation results at probe points before

and after quantization (shown under the “pre-quant” and “code”

columns respectively).

Spkr. class. accuracy (%)

ABX (%)

Model

code

pre-quant

code

aux

CER MOS Similarity

(%) [1, 5]

ABX (%)

aux

Model

[1, 5] code

Bitrate

log-Mel spectrogram

VQ-VAE

VQ-CPC (within)

VQ-CPC (across)

CPC (within)

98.9

65.8

47.4

80.3

99.7

27.0

14.0

13.4

36.2

16.4

98.8

94.9

98.5

13.2

12.5

31.7

13.8

English:

DPGMM-Merlin

2.14

2.18

3.61

2.98

2.51

2.57

35.6

173

386

VQ-VAE(spec) [21] 67

27.6 23.0

Chen and Hain [40] 18

20.2

VQ-VAE

VQ-CPC

3.62

3.64

3.49

3.80

14.0 13.2

13.4 12.5

412

421

is trained on paired images and unlabelled spoken captions.

To show that the VQ bottleneck discards speaker informa-

tion, we analyze representations before and after quantization.

At each probe point, we train a multilayer perceptron with 2048

hidden units to predict the speaker identity. We use mean-pooling

after the non-linearity to aggregate features. Table 2 shows the

results of the probing experiments on English data. Based on the

drop in speaker classiﬁcation accuracy across the probe points,

the VQ layer clearly acts as an information bottleneck, forcing

the models to discard speaker details. Interestingly, CPC without

vector quantization performs well on ABX tests but does not

explicitly discard speaker information. As a result, CPC alone

was not capable of voice conversion in our experiments. Table 2

also compares within-speaker and across-speaker negative sam-

Supervised

2.52

3.10

29.9

Indonesian:

DPGMM-Merlin

VQ-VAE(spec) [21] 60

Chen and Hain [40] 15

2.23

1.96

4.06

3.26

1.76

2.67

27.5

140

388

19.8 14.5

12.5

VQ-VAE

VQ-CPC

3.71

3.49

2.59

2.68

6.2

8.3

424

420

5.1

4.9

Supervised

3.49

3.77

16.1

volume difference is moderate, at around 6.1 LUFS⁵. But, a

larger disparity of 9.4 LUFS on Indonesian may have negatively

impacted our scores.

pling for VQ-CPC (see §2.3). Within-speaker sampling results in

better speaker invariance (lower speaker classiﬁcation accuracy)

and signiﬁcantly lower ABX scores (13.4% vs. 36.2%).

To examine a few of the acoustic units discovered by VQ-

CPC, Figure 3 plots two utterances along with the extracted

codes. We can see that the utterances are encoded as a similar

sequence of units despite coming from different speakers. Addi-

tionally, adjacent frames within a phone are often mapped to the

same code.

Chen and Hain [40] perform the best on intelligibility (CER)

across both languages. These results seem to indicate a trade-off

between intelligibility and voice conversion quality. By using a

larger codebook, Chen and Hain are able to improve CER at the

cost of speaker-similarity. A different trade-off is bitrate against

CER and ABX score. While our models outperform the super-

vised topline, they operate at a much higher bitrate. In contrast,

the topline has a similar bitrate to phonetic transcriptions.

Comparing our two models, it is clear that the VQ-VAE and

VQ-CPC perform similarly across all metrics. However, VQ-

CPC is an order of magnitude faster to train and was more robust

to codebook collapse in our experiments. A comparison to the

VQ-VAE(spec) (from our previous submission [21]), suggests

that training an autoregressive decoder jointly with the encoder

is beneﬁcial. Finally, it is interesting to note that our models

. Conclusions and future work

We presented two neural models for acoustic unit discovery from

unlabelled speech. Using vector quantization, both models learn

discrete representations of speech that capture phonetic content

but discard speaker information. They performed competitively

on phone discrimination tests and a voice conversion task for the

ZeroSpeech 2020 challenge. Despite these merits, the models

operate at high bitrates compared to phonetic transcriptions and

a supervised topline. In future work, we aim to lower bitrates

and discover acoustic units that are consistent across phones.

(

trained exclusively on unlabelled speech) achieve comparable

ABX scores to the visually grounded VQ model of [41], which

⁵Loudness Units relative to Full Scale (LUFS), see the ITU-R

BS.1770-4 standard.

6. References

1] O. J. Rasanen, “Computational modeling of phonetic and lexical

¨ ¨

learning in early language acquisition: Existing models and future

[22] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Un-

supervised speech representation learning using WaveNet autoen-

coders,” IEEE Trans. Audio, Speech, Language Process., vol. 27,

no. 12, pp. 2041–2053, 2019.

[

directions,” Speech Commun., vol. 54, pp. 975–997, 2012.

[

23] A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura,

VQVAE unsupervised unit discovery and multi-scale code2spec

inverter for Zerospeech Challenge 2019,” in Proc. Interspeech,

019.

2] T. Schatz and N. H. Feldman, “Neural network vs. HMM speech

recognition systems as models of human cross-linguistic phonetic

perception,” in Proc. CCN, 2018.

“

3] C. Shain and M. Elsner, “Measuring the perceptual availability

of phonological features during language acquisition using unsu-

pervised binary stochastic autoencoders,” in Proc. HLT-NAACL,

[

24] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete

representation learning,” in Proc. NeurIPS, 2017.

2019.

[25] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,

A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,

[

4] M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, “The Zero

Resource Speech Challenge 2015: Proposed approaches and re-

sults,” in Proc. SLTU, 2016.

“

WaveNet: a generative model for raw audio,” arXiv preprint

arXiv:1609.03499, 2016.

[

26] Y. Bengio, N. Leonard, and A. Courville, “Estimating or propagat-

ing gradients through stochastic neurons for conditional computa-

tion,” arXiv preprint arXiv:1308.3432, 2013.

5] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard,

L. Besacier, X. Anguera, and E. Dupoux, “The Zero Resource

Speech Challenge 2017,” in Proc. ASRU, 2017.

[

27] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz,

R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving

robust universal neural vocoding,” in Proc. Interspeech, 2019.

6] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea,

X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black et al.,

“

The Zero Resource Speech Challenge 2019: TTS without T,” in

Proc. Interspeech, 2019.

[28] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Gar-

cia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al.,

[

7] A. Kain and M. W. Macon, “Spectral voice conversion for text-to-

speech synthesis,” in Proc. ICASSP, 1998.

“

Mixed precision training,” in Proc. ICLR, 2018.

[

29] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-

tion,” in Proc. ICLR, 2015.

[

8] J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice

conversion without parallel data by adversarially learning disentan-

gled audio representations,” in Proc. Interspeech, 2018.

[30] L. Kaiser, S. Bengio, A. Roy, A. Vaswani, N. Parmar, J. Uszkoreit,

and N. Shazeer, “Fast decoding in sequence models using discrete

latent variables,” in Proc. ICML, 2018.

[

9] N. Zeghidour, G. Synnaeve, N. Usunier, and E. Dupoux, “Joint

learning of speaker and phonetic similarities with Siamese net-

works,” in Proc. Interspeech, 2016.

[31] A. Baevski, S. Schneider, and M. Auli, “vq-wav2vec: Self-

supervised learning of discrete speech representations,” in Proc.

ICLR, 2020.

[

10] M. Heck, S. Sakti, and S. Nakamura, “Learning supervised fea-

ture transformations on zero resources for improved acoustic unit

discovery,” IEICE T. Inf. Syst., vol. 101, no. 1, pp. 205–214, 2018.

[32] A. van den Oord, Y. Li, and O. Vinyals, “Representation

learning with contrastive predictive coding,” arXiv preprint

arXiv:1807.03748, 2018.

11] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised

autoregressive model for speech representation learning,” in Proc.

Interspeech, 2019.

[33] M. Rivie

pretraining transfers well across languages,” in Proc. ICASSP,

020.

re, A. Joulin, P.-E. Mazare, and E. Dupoux, “Unsupervised

` ´

12] W. Wang, Q. Tang, and K. Livescu, “Unsupervised pre-training of

bidirectional speech encoders via masked reconstruction,” in Proc.

ICASSP, 2020.

[34] J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazare,

` ´

J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-

light: A benchmark for asr with limited or no supervision,” in Proc.

ICASSP, 2020.

13] P.-J. Last, H. A. Engelbrecht, and H. Kamper, “Unsupervised

feature learning for speech using correspondence and siamese

networks,” IEEE Signal Proc. Let., vol. 27, pp. 421–425, 2020.

[

35] S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “De-

velopment of HMM-based Indonesian speech synthesis,” in Proc.

O-COCOSDA, 2008.

[

14] J. L. Flanagan, Speech analysis synthesis and perception. Springer

Science & Business Media, 2013, vol. 3.

15] B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised

learning of acoustic sub-word units,” in Proc. ACL, 2008.

36] S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Nakamura,

“

Development of Indonesian large vocabulary continuous speech

16] C.-y. Lee and J. R. Glass, “A nonparametric Bayesian approach to

acoustic model discovery,” in Proc. ACL, 2012.

recognition system within A-STAR project,” in Proc. TCAST, 2008.

37] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and

E. Dupoux, “Evaluating speech features with the minimal-pair

ABX task: Analysis of the classical MFC/PLP pipeline,” in Proc.

Interspeech, 2013.

17] M.-H. Siu, H. Gish, A. Chan, W. Belﬁeld, and S. Lowe, “Unsuper-

vised training of an HMM-based self-organizing unit recognizer

with applications to topic classiﬁcation and keyword discovery,”

Comput. Speech Lang., vol. 28, no. 1, pp. 210–223, 2014.

[

38] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural

[

18] C.-y. Lee, T. O’Donnell, and J. R. Glass, “Unsupervised lexicon

network speech synthesis system.” in Proc. SSW, 2016.

discovery from acoustic input,” Trans. ACL, vol. 3, pp. 389–403,

[

39] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: A real-

time speaker-dependent neural vocoder,” in Proc. ICASSP, 2018.

015.

19] L. Ondel, L. Burget, and J. C^ˇ ernocky

acoustic unit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86,

016.

, “Variational inference for

[40] M. Chen and T. Hain, “Unsupervised acoustic unit representation

learning for voice conversion using WaveNet auto-encoders,” in

submitted to Interspeech, 2020.

20] L. Badino, A. Mereta, and L. Rosasco, “Discovering discrete sub-

word units with binarized autoencoders and hidden-Markov-model

encoders,” in Proc. Interspeech, 2015.

[41] D. Harwath, W.-N. Hsu, and J. Glass, “Learning hierarchical dis-

crete linguistic units from visually-grounded speech,” in Proc.

ICLR, 2020.

21] R. Eloff, A. Nortje, B. L. Van Niekerk, A. Govender, L. Nortje,

A. Pretorius, E. Van Biljon, E. Van der Westhuizen, L. Van Staden,

and H. Kamper, “Unsupervised acoustic unit discovery for speech

synthesis using discrete latent-variable neural networks,” in Proc.

Interspeech, 2019.