Paper: arxiv

Abstract

Singing voice conversion (SVC) aims to convert the voice of one singer to that of other singers while keeping the singing content and melody. On top of recent voice conversion works, we propose a novel model to steadily convert songs while keeping their naturalness and intonation. We build an end-to-end architecture, taking phonetic posteriorgrams (PPGs) as inputs and generating mel spectrograms. Specifically, we implement two separate encoders: one encodes PPGs as content, and the other compresses mel spectrograms to supply acoustic and musical information. To improve the performance on timbre and melody, an adversarial singer confusion module and a mel-regressive representation learning module are designed for the model. Objective and subjective experiments are conducted on our private Chinese singing corpus. Comparing with the baselines, our methods can significantly improve the conversion performance in terms of naturalness, melody, and voice similarity. Moreover, our PPG-based method is proved to be robust for noisy sources.

Audio Samples

We present audio samples generated by the baseline models and our proposed model.

Target Singers

The following audio samples are from the target female and male singers.

Target Samples
Female
Male

Source Samples

3 female singing audio and 3 male singing audio are presented as the source samples. The samples are from different singers.

  Samples
F1
F2
F3
M1
M2
M3

Converted Samples

Target Female

  F1 F2 F3 M1 M2 M3
BASE1
BASE2
BASE3
Proposed

Target Male

  F1 F2 F3 M1 M2 M3
BASE1
BASE2
BASE3
Proposed

Ablation Tests

Target Female

  F1 F2 M1 M2
BASE3
+ ME
+ SC
+ MS (Proposed)

Target Male

  F1 F2 M1 M2
BASE3
+ ME
+ SC
+ MS (Proposed)

Noise Robustness Tests

source SNR = 15.30

  F1 F2 M1 M2
Source
Target Female
Target Male

source SNR = 8.18

  F1 F2 M1 M2
Source
Target Female
Target Male