StarGANv2-VC A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

year: 2021
paper: https://arxiv.org/pdf/2107.10394.pdf
website: https://starganv2-vc.github.io/
code: https://github.com/MaxWolf-01/real-time-vc
connections: GAN, style-transfer, cycle-consistency, voice conversion

Takeaways

Style Encoder

Learns the style of a target speaker and encodes this style information in a “style-vector”.

Mapping Nework

The mapping netowrk can be used to create the style vector instead of a style by reference mel-spectrogram from the Style Encoder. We sample from a random gaussian distribution.

Generator Encoder (Content encoder)

Filters the spectogram for only the linguistic / speech content

Generator Decoder

Turns latent representation + features of F0 Netork + Style vector into output mel-spectograms:

Losses

Training objectives

The discriminator (source classifier) tries to identify to which of the $N$ source domain the generated spectogram originally belonged to. It does not care about the target domain.

The generator tries to generate fake spectograms from one of the $N$ source domains (this domain is the target domain). It is successful if the classifier thinks that generated samples are from the target domain (but they are from a different domain).

2. Method in latex

2.1 StarGANv2-VC

StarGAN v2 [10] uses a single discriminator and generator to generate diverse images in each domain with the domainspecific style vectors from either the style encoder or the mapping network. We have adopted the same architecture to voice conversion, treated each speaker as an individual domain, and added a pre-trained joint detection and classification (JDC) F0 extraction network [13] to achieve F0-consistent conversion. An overview of our framework is shown in Figure 1. Generator. The generator $G$ converts an input mel-spectrogram Xsrc into $G (X_{src}, h_{s t y}, h_{f 0})$ that reflects the style in $h_{s t y}$ , which is given either by the mapping network or the style encoder, and the fundamental frequency in $h_{f 0}$ , which is provided by the convolution layers in the F0 extraction network $F$ . F0 network. The F0 extraction network $F$ is a pre-trained JDC network [13] that extracts the fundamental frequency from an input mel-spectrogram. The JDC network has convolutional layers followed by BLSTM units. We only use the convolutional output $F_{co n v} (X)$ for $X \in X$ as the input features. Mapping network. The mapping network $M$ generates a style vector $h_{M} = M (z, y)$ with a random latent code $z \in Z$ in a domain $y \in Y$ . The latent code is sampled from a Gaussian distribution to provide diverse style representations in all domains.

The style vector representation is shared for all domains until the last layer, where a domain-specific projection is applied to the shared representation. Style encoder. Given a reference mel-spectrogram $X_{r} e f$ , the style encoder S extracts the style code $h_{s t y} = S (X_{re f}, y)$ in the domain $y \in Y$ . Similar to the mapping network M, S first processes an input through shared layers across all domains. A domain-specific projection then maps the shared features into a domain-specific style code. Discriminators. The discriminator $D$ in [10] has shared layers that learns the common features between real and fake samples in all domains, followed by a domain-specific binary classifier that classifies whether a sample is real in each domain $y \in Y$ . However, since the domain-specific classifier consists of only one convolutional layer, it may fail to capture important aspects of domain-specific features such as the pronunciations of a speaker. To address this problem, we introduce an additional classifier $C$ with the same architecture as $D$ that learns the original domain of converted samples. By learning what features elude the input domain even after conversion, the classifier can provide feedback about features invariant to the generator yet characteristic to the original domain, upon which the generator should improve to generate a more similar sample in the target domain. A more detailed illustration is given in Figure 2.

2.2 Training Objectives

The aim of StarGANv2-VC is to learn a mapping $G : X_{y_{src}} \mapsto X_{y_{t r g}}$ that converts a sample $X \in X$ from the source domain $y_{src} \in Y$ to a sample $\hat{X} \in X_{y_{t r g}}$ in the target domain $y_{t r g} \in Y$ without parallel data.

During the training, we sample a target domain $y_{t r g} \in Y$ and a style code $s \in S_{y_{t r g}}$ randomly via either mapping network where $s = M (z, y_{t r g})$ with a latent code $z \in Z$ , or style encoder where $s = S (X_{re f}, y_{t r g})$ with a reference input $X_{r} e f \in X$ . Given a mel-spectogram $X \in X_{y_{src}}$ , the source domain $y_{src} \in Y$ and the target domain $y_{t r g} \in Y$ , we train our model with the following loss functions:

Adversarial loss ¹
The generator takes an input mel-spectrogram $X$ and a style vector $s$ and learns to generate a new mel-spectrogram $G (X, s)$ via the adversarial loss.

\displaylines L_{a d v} = E_{x, y_{src}} [lo g (D (X, y_{src})] + E_{x, y_{t r g}, s} [lo g (1 - D (G (X, s), y_{t r g}))] (1)

where $D (\cdot, y))$ denotes the output of real/fake classifier for the domain $y \in Y$ .

Adversarial source classifier loss
We use an additional adversarial loss function with the source classifier C (see Figure 2).

L_{a d v c l s} = E_{x, y_{t r g}, s} [CE (C (G (X, s)), y_{t r g})] (2)

where $CE (\cdot)$ denotes the cross-entropy loss funtion

Style reconstruction loss
We use the style reconstruction loss to ensure that the style code can be reconstructed from the generated samples.

L_{s t y} = E_{x, y_{t r g}, s} [∣∣ s - S (G (X, s), y_{t r g}) ∣ ∣_{1}] (3)

Syle diversification loss
The style diversification loss is maximized to enforce the generator to generate different samples with different style codes. In addition to maximizing the mean absolute error (MAE) between generated samples, we also maximize MAE of the F0 features between samples generated with different style codes

\displaylines L_{d s} = E_{x, s_{1}, s_{2}, y_{t r g}} [∣∣ G (X, s_{1}) - G (X, s_{2}) ∣ ∣_{1}] + E_{x, s_{1}, s_{2}, y_{t r g}} [∣∣ F_{co n v} (G (X, s_{1})) - F_{co n v} (G (X, s_{2}))) ∣ ∣_{1}] (4)

where $s_{1}, s_{2} \in S_{y_{t r g}}$ are two randomly sampled style codes from domain $y_{t r g} \in Y$ and $F_{co n v} (\cdot)$ is the output of the convolutional layers of the F0 network $F$ .

F0 Consistency Loss
To produce F0-consistent results, we add an F0-consistent loss with the normalized F0 curve provided by F0 network $F$ . For an input mel-spectrogram $X, F (X)$ provides the absolute F0 value in Hertz for each frame of $X$ . Since male and female speakers have different average F0, we normalize the absolute F0 values $F (X)$ by its temporal mean, denoted by $\hat{F} (X) = \frac{F ( x )}{∥ F ( x ) ∥ _{1}}$ . The F0 consistency loss is thus:

L_{f 0} = E_{x, s} [∥ \hat{F} (X) - \hat{F} (G (X, s)) ∥_{1}] (5)

Speech consistency loss
Speech consistency loss. To ensure that the converted speech has the same linguistic content as the source, we employ a speech consistency loss using convolutional features from a pretrained joint CTC-attention VGG-BLSTM network [14] given in Espnet toolkit 1 [15]. Similar to [16], we use the output from the intermediate layer before the LSTM layers as the linguistic feature, denoted by $ha sr (\cdot)$ . The speech consistency loss is defined as

L_{a sr} = E_{x, s} [∣∣ h_{a sr} (X) - h_{a sr} (G (X, s)) ∣ ∣_{1}] (6)

Norm consistency loss
We use the norm consistency loss to preserve the speech/silence intervals of generated samples. We use the absolute column-sum norm for a mel-spectrogram $X$ with $N$ mels and $T$ frames at the $t^{t h}$ frame, defined as $∥ X_{\cdot, t} ∥ = \sum_{N}^{n = 1} ∣ X_{n, t}$ , where $t \in {1, \dots, T}$ is the frame index. the norm consistency loss is given by

L_{n or m} = E_{x, s} [\frac{1}{T} t = 1 \sum T ∣ ∥ X_{\cdot, t} ∥ - ∥ G (X, s))_{\cdot, t} ∥ ∣] (7)

Cycle consistency loss ²
Lastly, we employ the cycle consistency loss to preserve all other features of the input

L_{cyc} = E_{x, y_{src}, y_{t r g}, s} [∣∣ X - G (G (X, s), \tilde{s})) ∣ ∣_{1}] (8)

where $\tilde{s} = S (X, y_{src})$ is the estimated style code of the input in the source domain $y_{src} \in Y$ .

Full objective
Our full generator obbjective functions can be summarized as follows:

\displaylines G, S, M min L_{a d v} + λ_{a d v c l s} L_{a d v c l s} + λ_{s t y} L_{s t y} - λ_{d s} L_{d s} + λ_{f 0} L_{f 0} + λ_{a sr} L_{a sr} + λ_{n or m} L_{n or m} + λ_{cyc} L_{cyc} (9)

where $λ_{a d v c l s}, λ_{s t y}, λ_{d s}, λ_{f 0}, λ_{a sr}, λ_{n or m}$ and $λ_{cyc}$ are hyperparameters for each term.
Our full discriminators objective is given by:

C, D min - L_{a d v} + λ_{c l s} L_{c l s} (10)

where $λ_{c l s}$ is the hyperparameter for source classifier loss $L_{c l s}$ which is given by

L_{c l s} = E_{x, y_{src, s}} [CE (C (G (X, s)), y_{src})] (11)

Figures

Figure 1:
StarGANv2-VC framework with style encoder. $X_{src}$ is the source input, $X_{re f}$ is the reference input that contains the style information, and $\hat{X}$ represents the converted mel-spectrogram. $h_{x}, F_{co n v}$ and $s$ denote the latent feature of the source, the F0 feature from convolutional layers of the source, and the style code of the reference in the target domain, respectively. $h_{x}$ and $h_{F 0}$ are concatenated by channel as the input to the decoder, and hsty is injected into the decoder by the adaptive instance normalization (AdaIn) [12]. Two classifiers form the discriminators that determine whether a generated sample is real or fake and who the source speaker of $\hat{X}$ is. In another scheme where the style encoder is replaced with the mapping network, the reference mel-spectrogram $X_{re f}$ is not needed.

Figure 2:
Training schemes of adversarial source classifier for a domain $y_{k}$ . The case $y_{t r g} = y_{src}$ is omitted to prevent amplification of artifacts from the source classifier. (a) When training the discriminators, the weights of the generator $G$ are fixed, and the source classifier $C$ is trained to determine the original domain yk of the converted samples, regardless of the target domains. (b) When training the generator, the weights of the source classifier $C$ are fixed, and the generator $G$ is trained to make $C$ classify all generated samples as being converted from the target domain $y_{k}$ , regardless of the actual domains of the source.

Additional resources

These issues will be valuable for improving / debugging gan training later: https://github.com/yl4579/StarGANv2-VC/issues/21 # Inference with nois source https://github.com/yl4579/StarGANv2-VC/issues/6 # “Many to many doubts” / improve zero-shot by adding more F0 ResBlocks https://github.com/yl4579/StarGANv2-VC/issues/68 # adding more discriminators

A previous pasper on improving StarGAN-V2TOWARDS LOW-RESOURCE STARGAN VOICE CONVERSION USING WEIGHT ADAPTIVE INSTANCE NORMALIZATION

Korean only… video that has diagrams for architecture at least (I have half assed transcript with whisper now).

References

GAN
Joint Detection and Classification of Singing Voice Melody Using Convolutional Recurrent Neural Networks
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks (→ big part of this paper)

[[Generative Adverserial Nets#Minmax loss [ src1|Minmax loss]] ↩
Cycle Consistency

The idea of using transitivity as a way to regularize structured data has a long history. In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades. In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators.

Link to original
↩

Max Wolf's Second Brain

Explorer

StarGANv2-VC A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Takeaways

Style Encoder

Mapping Nework

Generator Encoder (Content encoder)

Generator Decoder

Losses

Training objectives

2. Method in latex

2.1 StarGANv2-VC

2.2 Training Objectives

Figures

Additional resources

References

Graph View

Table of Contents

Backlinks

Max Wolf's Second Brain

Explorer

StarGANv2-VC A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion

Takeaways

Style Encoder

Mapping Nework

Generator Encoder (Content encoder)

Generator Decoder

Losses

Training objectives

2. Method in latex

2.1 StarGANv2-VC

2.2 Training Objectives

Figures

Additional resources

References

Footnotes

Graph View

Table of Contents

Backlinks