I brought this up a while ago in a different thread, but it kinda got lost in the larger conversation; so I thought why not make it its own thread?
So I know enough about vocal synthesis technology, linguistics, and the English language in general to know that synthesizing English is really difficult, and making it sound good is even harder. From what I can tell, English voice banks usually have to pick between either being intelligible or sounding convincingly human, and managing to do both is almost like turning lead into gold. But recently, largely in part thanks to AI, we've been getting more English banks that don't have to pick, which is great! Eleanor, Solaria, Anri, that male SynthV that Dreamtonics teased; hell I'll even vouch for Lucy EmVoice sounding decent, clownish dev antics aside.
My only real complaint is that these recent vocalists have all had somewhat similar, i.e. soft, voice types. And I'm curious as to why exactly that is? I have a few my own theories, but if anyone here is more knowledgeable about the technology- or even better, has actual hands-on experience with these or similar vocals- and is able to share what they know, I'd love to hear it! For now, here are my uneducated hypotheses:
These first two theories pertain to technological limitations, but what if the lack of stronger English voices has nothing to do with that after all? I have two more ideas for that, then:
THEORY 3: It is entirely possible to create more varied English voice banks with modern techniques, it's just that the only people who have lent their voices so far all happen to have similar voice types.This would be the best case scenario in my opinion, because it means it's just a matter of finding/convincing someone with a stronger voice to lend theirs. Not to say that's necessarily an easy task, but it'd at least be nice to know that it could be done any day. And I find it easy to believe, since I imagine most people who choose to become voice providers are likely hobbyists, enthusiasts, and assorted tech/music nerds rather than experienced vocalists; and untrained singers are much more likely to have weaker voices. I know Vocaloid struggled early on to find willing voice providers because some singers were afraid that they might be "replaced" by their Vocaloid counterparts, or were otherwise uncomfortable with losing control over their voice in some way. Finally, one last theory:
Anyway, those are my thoughts! What about y'all? Oh yeah, you may have noticed I didn't mention Maki, even though she recently got an English SynthV. Honestly, I just haven't listened to her much at all. I'm operating under the assumption that her English bank was primarily made to expand her "brand", and not because they were interested in creating a quality voicebank, so I've kinda just... ignored it. But hey, if I'm totally off the mark with that and she's actually great and I'm really missing out, please feel free to correct me!
So I know enough about vocal synthesis technology, linguistics, and the English language in general to know that synthesizing English is really difficult, and making it sound good is even harder. From what I can tell, English voice banks usually have to pick between either being intelligible or sounding convincingly human, and managing to do both is almost like turning lead into gold. But recently, largely in part thanks to AI, we've been getting more English banks that don't have to pick, which is great! Eleanor, Solaria, Anri, that male SynthV that Dreamtonics teased; hell I'll even vouch for Lucy EmVoice sounding decent, clownish dev antics aside.
My only real complaint is that these recent vocalists have all had somewhat similar, i.e. soft, voice types. And I'm curious as to why exactly that is? I have a few my own theories, but if anyone here is more knowledgeable about the technology- or even better, has actual hands-on experience with these or similar vocals- and is able to share what they know, I'd love to hear it! For now, here are my uneducated hypotheses:
THEORY 1: It's just plain hard to synthesize powerful vocals, English ones even more so.
Vocal synthesis technology just isn't advanced enough at this point in time to reproduce strong vocals, or at least not very well. This is theory number one because I think it's the weakest. As in, I know it's false because I've heard plenty of powerful vocal synths so it's obviously possible... with the caveat that they're mostly Japanese or Mandarin voicebanks, not English ones. So maybe it could just be that English- being the awful, complex, and inconsistent Frankenstein's monster of a language that it is- presents more difficulties compared to other languages that make it unfeasible, if not outright impossible, to produce more powerful voices... But I know that isn't true either! Lola, Sweet Ann, Prima, Big Al, and Tonio all have pretty strong voices! Then again, those are all rather old Vocaloid/Vocaloid 2 banks by now, and the technology behind them is quite outdated. Which leads into my second theory:THEORY 2: Powerful English vocals are possible, but not with modern voice synthesis methods.
All the vocalists cited in my second paragraph (except Lucy) are Synthesizer V vocalists, and SynthV seems to be the king of vocal synth engines at the moment, Vocaloid having lost that crown years ago. And for very good reason! It's just a great engine and, most relevant to this discussion, seems to be leading the charge for high-quality English synthesis. So why then do they all sound so similar, and so soft? Japanese and Mandarin SynthV's don't have that problem. Off the top of my head, Genbu, Chiyu, and Saki come to mind as having pretty strong voices, and I'm sure I could find other examples if I looked. So, I guess it could be that whatever works for Japanese and Mandarin voicebanks just doesn't work for English ones, whatever that entails?These first two theories pertain to technological limitations, but what if the lack of stronger English voices has nothing to do with that after all? I have two more ideas for that, then:
THEORY 3: It is entirely possible to create more varied English voice banks with modern techniques, it's just that the only people who have lent their voices so far all happen to have similar voice types.
THEORY 4: No one's put in the effort because no one thinks it'd sell enough.
The saddest of them all in my eyes, and basically the same excuse given time and time again in defense of so many other tiresome trends. Male vocals don't sell, so the vast majority of synths are female vocals. Many of the female vocals that do get made are fairly generic, high-voiced, pop-oriented teens and young adults. Character often sells the software more than the actual voice itself, so more effort is put into that than making an interesting/quality voice. Miku made all the money in the world by drawing from idol and moe culture, so now almost all character-based vocal synths are going to do the exact same thing in an attempt to get a piece of that pie. This "It's just what sells!" stock phrase gets doled out any time someone expresses a desire for more diverse voices and/or characters, and I hate it. My only solace is that as time goes on, it's only going to get easier and cheaper to develop new voice synthesis software, and this tired excuse for aggressive homogeneity will, each day, inch closer and closer to a well-deserved grave.Anyway, those are my thoughts! What about y'all? Oh yeah, you may have noticed I didn't mention Maki, even though she recently got an English SynthV. Honestly, I just haven't listened to her much at all. I'm operating under the assumption that her English bank was primarily made to expand her "brand", and not because they were interested in creating a quality voicebank, so I've kinda just... ignored it. But hey, if I'm totally off the mark with that and she's actually great and I'm really missing out, please feel free to correct me!