Why all the soft English vocals?

Tortoiseshel · Nov 15, 2021

I brought this up a while ago in a different thread, but it kinda got lost in the larger conversation; so I thought why not make it its own thread?

So I know enough about vocal synthesis technology, linguistics, and the English language in general to know that synthesizing English is really difficult, and making it sound good is even harder. From what I can tell, English voice banks usually have to pick between either being intelligible or sounding convincingly human, and managing to do both is almost like turning lead into gold. But recently, largely in part thanks to AI, we've been getting more English banks that don't have to pick, which is great! Eleanor, Solaria, Anri, that male SynthV that Dreamtonics teased; hell I'll even vouch for Lucy EmVoice sounding decent, clownish dev antics aside.

My only real complaint is that these recent vocalists have all had somewhat similar, i.e. soft, voice types. And I'm curious as to why exactly that is? I have a few my own theories, but if anyone here is more knowledgeable about the technology- or even better, has actual hands-on experience with these or similar vocals- and is able to share what they know, I'd love to hear it! For now, here are my uneducated hypotheses:

THEORY 1: It's just plain hard to synthesize powerful vocals, English ones even more so.

Vocal synthesis technology just isn't advanced enough at this point in time to reproduce strong vocals, or at least not very well. This is theory number one because I think it's the weakest. As in, I know it's false because I've heard plenty of powerful vocal synths so it's obviously possible... with the caveat that they're mostly Japanese or Mandarin voicebanks, not English ones. So maybe it could just be that English- being the awful, complex, and inconsistent Frankenstein's monster of a language that it is- presents more difficulties compared to other languages that make it unfeasible, if not outright impossible, to produce more powerful voices... But I know that isn't true either! Lola, Sweet Ann, Prima, Big Al, and Tonio all have pretty strong voices! Then again, those are all rather old Vocaloid/Vocaloid 2 banks by now, and the technology behind them is quite outdated. Which leads into my second theory:

THEORY 2: Powerful English vocals are possible, but not with modern voice synthesis methods.

All the vocalists cited in my second paragraph (except Lucy) are Synthesizer V vocalists, and SynthV seems to be the king of vocal synth engines at the moment, Vocaloid having lost that crown years ago. And for very good reason! It's just a great engine and, most relevant to this discussion, seems to be leading the charge for high-quality English synthesis. So why then do they all sound so similar, and so soft? Japanese and Mandarin SynthV's don't have that problem. Off the top of my head, Genbu, Chiyu, and Saki come to mind as having pretty strong voices, and I'm sure I could find other examples if I looked. So, I guess it could be that whatever works for Japanese and Mandarin voicebanks just doesn't work for English ones, whatever that entails?

These first two theories pertain to technological limitations, but what if the lack of stronger English voices has nothing to do with that after all? I have two more ideas for that, then:

THEORY 3: It is entirely possible to create more varied English voice banks with modern techniques, it's just that the only people who have lent their voices so far all happen to have similar voice types.

This would be the best case scenario in my opinion, because it means it's just a matter of finding/convincing someone with a stronger voice to lend theirs. Not to say that's necessarily an easy task, but it'd at least be nice to know that it could be done any day. And I find it easy to believe, since I imagine most people who choose to become voice providers are likely hobbyists, enthusiasts, and assorted tech/music nerds rather than experienced vocalists; and untrained singers are much more likely to have weaker voices. I know Vocaloid struggled early on to find willing voice providers because some singers were afraid that they might be "replaced" by their Vocaloid counterparts, or were otherwise uncomfortable with losing control over their voice in some way. Finally, one last theory:

THEORY 4: No one's put in the effort because no one thinks it'd sell enough.

The saddest of them all in my eyes, and basically the same excuse given time and time again in defense of so many other tiresome trends. Male vocals don't sell, so the vast majority of synths are female vocals. Many of the female vocals that do get made are fairly generic, high-voiced, pop-oriented teens and young adults. Character often sells the software more than the actual voice itself, so more effort is put into that than making an interesting/quality voice. Miku made all the money in the world by drawing from idol and moe culture, so now almost all character-based vocal synths are going to do the exact same thing in an attempt to get a piece of that pie. This "It's just what sells!" stock phrase gets doled out any time someone expresses a desire for more diverse voices and/or characters, and I hate it. My only solace is that as time goes on, it's only going to get easier and cheaper to develop new voice synthesis software, and this tired excuse for aggressive homogeneity will, each day, inch closer and closer to a well-deserved grave.

Anyway, those are my thoughts! What about y'all? Oh yeah, you may have noticed I didn't mention Maki, even though she recently got an English SynthV. Honestly, I just haven't listened to her much at all. I'm operating under the assumption that her English bank was primarily made to expand her "brand", and not because they were interested in creating a quality voicebank, so I've kinda just... ignored it. But hey, if I'm totally off the mark with that and she's actually great and I'm really missing out, please feel free to correct me!

Nokone Miku · Nov 15, 2021

Sometimes it isn't the software or the tech at fault. The problem is that people are not utilizing the software's full capabilities. No one is making that sort of content (or they are, but aren't getting noticed). I feel like it is only in recent years that more people are using Miku's different appends. Some are even going back and exploring her V3 appends which went largely unused.

The whole reason I started trying to produce Vocaloid/vocal synth music is that I felt like no one is pushing the English voicebanks to their full potential yet. I kept expecting people to start working toward it, but it hasn't been happening so I resolved to do it myself. So far it is taking a long time because I'm learning audio mixing as I go. I'm also wasting time experimenting with everything and anything. (Recently I've been playing around with a vocal-fry shouting effect by using V4 Miku English alongside a fully devoiced version and a differently tuned V3 version, running them through different compressors and then recombining them. It's sounding okay, but I need to mess with it some more to get it where I want it.)

I bought Avanna and downloaded Eleanor Forte lite, but at this rate it will be awhile before I get to them. I probably won't mess with GUMI English, because that's a voicebank I kinda see people experimenting with already.

It feels similar to how certain genres of music are underrepresented in Vocaloid. One person on Reddit was asking if any of the English voicebanks could be used for country music. I hope they try experimenting with that (but the cynic in me is guessing they won't).

I could be wrong and people have explored their full capabilities and found it impossible to go farther? I don't know. I found that to get Miku English to sound more powerful required a lot of reverse-sloped vibratos and bumping the Clearness to 12 across the whole track. Is that something new? Or have people already been doing that? There's no way to know, because I don't see many tuning examples. The few I do see are with Japanese voicebanks.

lIlI · Nov 15, 2021

I've been thinking about this too. I believe it's potentially a mix of the first three reasons. Powerful voices have a strong character to them, and are harder for singers to keep up consistently. This means it's reasonable to assume they're a bit trickier to synthesise, and somewhat rarer to find skilled voicers for.

The one thing I can rule out is public demand - I went back to check the poll, and only 7% of people said they preferred soft voices. The fandom is definitely ravenous for power banks, I am hoping that companies aren't ignorant of that. vFlower's success surely sent a message!

Perhaps the biggest reason might just be timing. We have two new soft voicebanks coming out, but two isn't enough to define a trend just yet. Maybe, just maybe, we have a bunch of power banks coming down the line.

KatrinaNocturne · Nov 15, 2021

I’m leaning towards option 3. If recording a commercial english synth is anything like recording an english utau voicebank, I find that it’s tough to maintain a consistently strong tone for the amount of recording you need to do for english, as an untrained singer. But other people have managed to do it. utau isn’t AI tho.

Iirc, Solaria’s creators are attempting to achieve a strong tone with her, so hopefully we’ll see if it’s an engine problem or a voice provider problem soon.

edit: Solaria is strong

Search

Search

Why all the soft English vocals?

Tortoiseshel

Aspiring Fan

Nokone Miku

Aspiring Lyricist/Producer

lIlI

⚡

KatrinaNocturne

???

Users Who Are Viewing This Thread (Users: 0, Guests: 1)