Morphing between voices using kNN-VC
This demo resulted from my playing around with kNN-VC, a new simplified voice conversion method introduced by Baas et al. (2023).
A benefit of their approach is that the self-supervised feature vectors always inhabit a shared latent space, regardless of which target speaker we are generating for. As a result, simply by interpolating the feature vectors returned from the k-nearest neighbors matching step, we can easily interpolate (or extrapolate) between the voices of different target speakers. In fact, we can even morph continuously between any number of target speakers within a single utterance. Doing this produces the results shown below.
While I’m not entirely sure how novel these capabilities are (speaker interpolation, at least, is certainly possible with prior models), I was impressed by how well this seems to work out of the box with kNN-VC. At the very least, it’s a cool effect to hear one voice seamlessly blending into another!
Everything here was produced using a very slight fork of kNN-VC; see the Colab notebook for the generation process.
Interpolation example 1
Source | |
Speaker A1 | |
Speaker B2 | |
50% A + 50% B | |
Morph A -> B |
Interpolation example 2
Source | |
Speaker C3 | |
Speaker B | |
50% C + 50% B | |
Morph C -> B |
Extrapolation example
Source | |
Speaker D4 | |
Speaker B | |
150% D - 50% B | |
150% B - 50% D |
-
speaker ID
587
from LibriSpeechtrain_clean_100
split ↩ -
speaker ID
1334
from LibriSpeechtrain_clean_100
split ↩ -
speaker ID
1235
from LibriSpeechtrain_clean_100
split ↩