• Our rules have been updated to restrict certain forms of generative AI, commonly used in images and lyrics. Please take a moment to read over the changes here! Thank you!
  • We're currently having issues with our e-mail system. Anything requiring e-mail validation (2FA, forgotten passwords, etc.) requires to be changed manually at the moment. Please reach out via the Contact Us form if you require any assistance.

VOCALOID Hatsune Miku VOCALOID6

WacoWacko39

New Fan
Jul 22, 2025
10
I'll do my best to explain!

As context for anyone who doesn't know, traditional vocal synths are made using the 'concatenative' method. A singer is hired to sing all the individual letter sounds of a language, called phonemes, at various pitches. These short recordings of vowels and consonants are then downloaded onto your computer when you buy a concatenative Vocaloid. Then, the Vocaloid software puts the phoneme recordings together to make the words you type into the program. It is a simple method that allows you to make a program that can say any word without you needing to record every word in the dictionary.

AI vocal synths use machine learning to create the voicebank. Instead of recording phonemes, the singer sings whole songs. Then you label where each phoneme is in each song, so the software knows how to read them. After labelling, the AI software learns the sonic qualities that give the singer their unique timbre and pronunciation. These rules are saved into a much smaller file, then applied dynamically by the software, making the output sound like the singer. Rather than simply playing a pre-recorded sound, the software calculates how a word should be sung by applying the rules it has learnt from 'listening' to the singer. Rather like a robotic impressionist. There's a lot more complex stuff happening under the hood, for example, you can feed the AI data from multiple different singers to increase the range of one voice, and make them sing fluently in multiple languages by mixing the 'rules' for English with the 'rules' for a non-native singer's timbre.

However, AI does not sacrifice your ability to make a voice sound unique. In fact, I believe AI voices are easier to edit and can achieve greater variety in performance compared to concatenative voices, because you're not restricted to the singer's recordings. If you push their settings, you can make one AI voice sound like multiple completely different people! The reason AI usage online can be samey is a by-product of convenience. AI voices sound natural out of the box, so producers feel less of a need to edit the results. Consequently, a lot of producers use AI voices at their default settings.

Most concatenative voicebanks sound a lot less polished by default, and require more hard work to sound natural. When you tune a concatenative voice, you can hear a big improvement in its results, which motivates producers to edit them by hand. This forces more users to develop their own unique style.
Thank you for what I'm certain is a vast oversimplification; I'm not well educated on AI as a whole. I'm not against AI as a whole: that would be small minded. But I'm hesitant to be exited either; I'll carry on with all the human emotions that are appropriate when dealing with something new . And hold my opinions proper for when her new Voicebank comes out. My only comment I believe I have is regarding the ability to sound more "human" that you mentioned, I am fond of the somewhat robotic sound of Vocaloids and Utaus. But as long as that ability is preserved I see little reason to complain.
 

Vector

Passionate Fan
Mar 6, 2022
193
The reason AI usage online can be samey is a by-product of convenience. AI voices sound natural out of the box, so producers feel less of a need to edit the results. Consequently, a lot of producers use AI voices at their default settings.
I've found from playing around with GUMI V6 that I get better results by putting in a phrase, selecting all of the notes, and then going to the inspector and dragging down all of the pitch drift and vibrato settings until she sounds like a robot (like a concatenative bank with pitch snap on), then slowly reintroducing those parameters until it sounds natural. Otherwise V6 draws wild pitch curves and adds weird vibrato, so GUMI ends up sounding very "strained" and thin.

To expand on how the machine learning works (I work in software engineering, currently adjacent to some ML applications...not training models, but setting up infrastructure for some small ones): it's basically just a bundle of statistical math. Instead of storing hundreds of short recordings, you're basically producing a distilled essence of what each sound the voice bank can produce "looks sort of like." e.g. a "ka" sound has one kind of waveform overall, and when it runs into a "t" sound it looks kind of like such and such. The "training" process is basically just crunching a lot of numbers until you have a glorified spreadsheet that you can plug a desired word into and get a mathematical curve out (and that's all a sound is: a compound sine wave traveling through air).

If you record someone saying "cat" several times, the results will all look very similar, and are hypothetically predictable mathematically. It's very hard to do that by hand, but there are now computing tools that can automate that, basically.

So a V2-V4 voice bank is: the user typed "cat" so look up a sound file for "ka" and one for "t" and stitch them together, and hopefully they will gel together instead of sounding bad.

An "AI" voicebank is: the user typed "cat" so let's input those into a statistical pipeline that will predict what the waveform those make will look like.
 

lIlI

Staff member
Administrator
Apr 6, 2018
1,070
The Lightning Strike
I've found from playing around with GUMI V6 that I get better results by putting in a phrase, selecting all of the notes, and then going to the inspector and dragging down all of the pitch drift and vibrato settings until she sounds like a robot (like a concatenative bank with pitch snap on), then slowly reintroducing those parameters until it sounds natural. Otherwise V6 draws wild pitch curves and adds weird vibrato, so GUMI ends up sounding very "strained" and thin.
I probably should have excluded V6 from the "AI sounds natural by default" statement haha. :kyo_ani_lili: It's definitely among the least advanced applications of the technology.
 

MagicalMiku

♡Miku♡
Apr 13, 2018
2,231
Sapporo
in the Episode 2 "Music x Technology" of Crypton 30th Anniversary special interview (that I translated here):

there is a detailed explanation about Miku Vocaloid 6, what they are aiming for and how is going with the development together with YAMAHA:
--By the way, I'm sure there are people reading this who are wondering, so I'll ask you... How is the progress of "Hatsune Miku V6 AI" going?

Sasaki :
Regarding the "challenges that are currently difficult to overcome as we proceed with development," which we reported on the SONICWIRE blog in December 2024, we are now seeing a clue to solving them with the cooperation of YAMAHA. We are proceeding with development so that we can make it possible for you to try the new features in some form by 2025, so we apologize to everyone who is looking forward to it, but we would appreciate it if you could wait for a little while longer.
Also, President Ito of Crypton Future Media confirmed a nice thing:
Ito : When it comes to new initiatives such as "NT" and "V6 AI," "Hatsune Miku" always takes the lead, but of course we are also thinking about other virtual singers. It's taking time, but we hope you will continue to watch over the developments going forward.
I recommend to read the full Episode 2 of the interview, it has so many interesting details and talks about many things:meiko_ani_lili:for example:
- As mentioned earlier, there are more and more situations where AI technology is being used these days, such as "Hatsune Miku V6 AI". Some creators are uneasy about AI, for example, because they are worried that generative AI may learn their work without their permission. Could you take this opportunity to tell us what your company thinks about AI?

Ito :
Because the image of generative AI is prevalent, it is often seen as the enemy of creators due to problems such as unauthorized learning, but AI is not just generative AI. For example, if you combine AI with music understanding technology, it will be possible to have AI find music that perfectly suits your tastes. If we can provide a music discovery service using such AI technology, it will not only be convenient for listeners, but it will also create opportunities for creators to listen to their music.

I think of AI as a new pen tool, and I think it is a technology that can expand the possibilities of music depending on how it is used.

Sasaki : I think that the pen tool is an analogy of an everyday tool used by craftsmen and students alike, and I think it is a good expression. Although the product name does not have "AI" in it, the singing voice synthesis software "Hatsune Miku NT (Ver.2)" that we updated in March this year actually incorporates AI. I think many people have the image that "AI-based singing synthesis software = software that can make a singer sound just like a human", but we want to value "singing voices that retain the characteristics of a virtual singer" that are different from humans, so rather than using AI to reproduce realistic singing voices, we incorporate AI as a tool to make creators' work more efficient.
 
Last edited:

Users Who Are Viewing This Thread (Users: 1, Guests: 2)