This demo resulted from my playing around with kNN-VC, a new simplified voice conversion method introduced by Baas et al. (2023).

A benefit of their approach is that the self-supervised feature vectors always inhabit a shared latent space, regardless of which target speaker we are generating for. As a result, simply by interpolating the feature vectors returned from the k-nearest neighbors matching step, we can easily interpolate (or extrapolate) between the voices of different target speakers. In fact, we can even morph continuously between any number of target speakers within a single utterance. Doing this produces the results shown below.

While I’m not entirely sure how novel these capabilities are (speaker interpolation, at least, is certainly possible with prior models), I was impressed by how well this seems to work out of the box with kNN-VC. At the very least, it’s a cool effect to hear one voice seamlessly blending into another!

Everything here was produced using a very slight fork of kNN-VC; see the Colab notebook for the generation process.

Interpolation example 1

Source
Speaker A1
Speaker B2
50% A + 50% B
Morph A -> B

Interpolation example 2

Source
Speaker C3
Speaker B
50% C + 50% B
Morph C -> B

Extrapolation example

Source
Speaker D4
Speaker B
150% D - 50% B
150% B - 50% D
  1. speaker from LJSpeech 

  2. speaker ID 587 from LibriSpeech train_clean_100 split 

  3. speaker ID 1334 from LibriSpeech train_clean_100 split 

  4. speaker ID 1235 from LibriSpeech train_clean_100 split