Why do people expect Vocaloid\UTAU voice to sound the same once it's been remade into AI?
It's already been made clear that the voice providers sound different than the old synthesizers.
Why put in such an expectation? Vflower's voice is clearly machine-sounding and all of it's traits are due to the compressed nature of the samples + the Vocaloid3/4 engine warping them.
Of course the VP can't sound the same. I see people make remarks about UTAU Ritsu sounding better than the AI Ritsus when it's clear that the UTAU VB samples were cherrypicked to sound consistently shout-y to a non-human level. Nobody sings like that. Canon doesn't scream when she sings, it'd strain the voice too much. Of course the UTAU is going to always sound powerful when the samples were made and picked to sound that way.
I think it's unreasonable that people think the AI is simply going to be "automatic tuning generator" when the AI is actually a recreation of the voice provider's singing.
Training an AI voice directly on top of the UTAU rendered voice is going to make it sound low-quality due to the double synthesizing.
I get that we all have nostalgia for the old chunky sound of a previous generation synth, but AI is more welcoming to producers who don't want to deal with all the manual tuning. It's simply easier to use and sounds better by default.
And we can still keep our copies of the old previous gen synth.
It's already been made clear that the voice providers sound different than the old synthesizers.
Why put in such an expectation? Vflower's voice is clearly machine-sounding and all of it's traits are due to the compressed nature of the samples + the Vocaloid3/4 engine warping them.
Of course the VP can't sound the same. I see people make remarks about UTAU Ritsu sounding better than the AI Ritsus when it's clear that the UTAU VB samples were cherrypicked to sound consistently shout-y to a non-human level. Nobody sings like that. Canon doesn't scream when she sings, it'd strain the voice too much. Of course the UTAU is going to always sound powerful when the samples were made and picked to sound that way.
I think it's unreasonable that people think the AI is simply going to be "automatic tuning generator" when the AI is actually a recreation of the voice provider's singing.
Training an AI voice directly on top of the UTAU rendered voice is going to make it sound low-quality due to the double synthesizing.
I get that we all have nostalgia for the old chunky sound of a previous generation synth, but AI is more welcoming to producers who don't want to deal with all the manual tuning. It's simply easier to use and sounds better by default.
And we can still keep our copies of the old previous gen synth.