CeVIO Deep-learning-based CeVIO further in development

uncreepy · Oct 11, 2019

tl;dr CeVIO's updated engine that uses deep learning has DAW integration, balances quality/render time better than before (went from taking 10 hours for a 5 min song to now running on a laptop), will probably be called "NeoCeVIO", and the vocals appear to be getting Neo names (ex: "Neo Sasara").

Note:
For those of you out of the loop, check out the English/Japanese/Chinese demos from December 2018 here: Reproducing high-quality singing voice
For the most recent demo using English/Japanese, check out this full length song ("Itsuka Kanarazu"): New singing synthesis demo from CeVIO developer Techno-Speech

On October 9th, Techno-Speech teased the upcoming deep-learning-based "NeoCeVIO" (temporary name) at Meiji Kinenkan (it's a historic venue especially used for parties/weddings). It was free to the public and contained posters and a software demonstration.
Kazuhiro Nakamura is a researcher at Techno-Speech who conducted the booth.

On the page linked, ‘–±ÈbuICTƒCƒmƒx[ƒVƒ‡ƒ“ƒtƒH[ƒ‰ƒ€2019v‚ÌŠJÃ
it explains that the event was part of "ICT Innocation Forum 2019" to show off technology research and development in the sphere of telecommunications hosted by the Ministry of Internal Affairs and Communications.

(Eji is a person who collects a lot of information from Miku/Crypton-related events.)
Eji says that other potential names for the new CeVIO are: CeVIO AI (Techno-Speech apparently used this name before), CeVIO Pro (a user called PSGOZ heard this name), and NeoCeVIO (kM4osM (pronounced "kurosu") heard this name at the event). (I'm going to call it NeoCeVIO, because that's what was heard during this event a few days ago.)

The goal of the update is so the voices can have diverse expression. The other things noted by Eji are too advanced in my understanding on vocal synthesis to rephrase.

(Chiteico is a person related to the Synth V sphere and frequently talks to Amano Kei about it.)
Chiteico thinks that while VOCALOID:AI is aimed at pros, NeoCeVIO is aimed at "DTMers" (the Japanese term for Desktop Musicians who make MIDI music)/Vocaloid producers.

KM4osM says that NeoCeVIO's good point is that it balances the quality of the voices and the speed of synthesis. If the balance is changed in one direction, it makes the product change greatly (ex: high quality, very slow vs low quality, very fast).

With this tweet, we will move on to KM4osM's blog post. They appear to be the only member of the vocal synth tinfoil hat brigade who actually went to the event and took pictures to share on Twitter. Kazuhiro Nakamura said many people showed up, but I could not find any other tweets than the ones I shared.

NeoCeVIO（仮）見てきた｜くろ州の合成音声備忘録

(This is a summary of it, not word for word because of time constraints.)

In 2018, Techno-Speech showed their deep-learning-based vocal synthesis that sounded human.
Upon seeing Kazuhiro Nakamura's tweet, KM4osM felt an obligation to go to the event.
Meiji Kinenkan had a regal aire about it, which made sense on account of it being a Ministry of Internal Affairs and Communications event.
At the venue, this was the booth (Nakamura was there):

IT HAS DAW INTEGRATION!!!!!!!!!!!!!!!!! (#1 impression) (It was being used with REAPER at the event.)
It seems that NeoCeVIO is closer to being a full product that can be used with a DAW (and maybe standalone?).
On top of that, NeoCeVIO was even running on laptop!
Last year around March, it took 10 hours to render a high quality 5 minute song ("Itsuka Kanarazu" linked at the start of this post). But now it successfully decreased the render time.

[The main question]
From what KM4osM could see, NeoCeVIO (temporary name) seemed to be quite complete. (It sounds like even if you don't tune, the singing sounds good so you can compose late at night and it feels like "the future is here".)
The parameters it had were volume, pitch, timing (duration), vibrato, so it isn't very different from the current CeVIO. KM4osM wasn't sure if they will add more parameters/adjust them due to this being a temporary version or not.
At the booth, "Neo Sasara" (temporary name) sang a cover of "Ai o Komete Hanataba o" by Superfly.

^ This is not the Neo Sasara version, this is the real human version.

It seemed like Neo Sasara could sing with a calm tone of voice, you could feel the expression in her singing. The pitch changed, there was subtle "shakuri" (Japanese singing technique where you sing slightly lower than the note's pitch and ease into the "correct" pitch).

[When will it become a product?]
We don't know. The engine seems fine, but there were things that needed work with the GUI.

Scarlet Illusion · Oct 11, 2019

Sounds pretty interesting! I'll have to check out the tweets and articles when I've got the chance. It seems like all the popular vocal synths (except for UTAU, maybe) are really working to improve what their software can do!

Thanks for sharing the information, Uncreepy!

xuu · Oct 11, 2019

YES YES YESS FINALLY NEWS ITS BEEN SO LONG

I will go back to being a full time CeVIO shill on the condition they release the Superfly cover as a demo song, it's literally one of my favourite songs.

lIlI · Oct 11, 2019

This is sooner than I expected!

Chiteico thinks that while VOCALOID:AI is aimed at pros, NeoCeVIO is aimed at "DTMers" (the Japanese term for Desktop Musicians who make MIDI music)/Vocaloid producers.

Chiteico, my life is in your hands buddy :miriam_lili:

Cluemily · Oct 12, 2019

I'm crying the fact they said it could run on a laptop brings me hope. V5 is so unstable on my poor laptop compared to past engines, so knowing NeoCeVIO would probably run pretty well is so good. The voices still sound great as well. ;;

YOYo_MAMA · Oct 13, 2019

After a hundred years!! I they finally gave us an update.
I was starting to loose hope on this project. I just hope the audio of NeoCevio will be released somehow. I really want to hear how the new software sound so far.

Prism · Oct 13, 2019

I don't know if it's just me and coming from an animation background where it takes hours to render a frame but for it to take 5 hours for that quality I think it's well worth the time

uncreepy · Oct 13, 2019

Prism said:
I don't know if it's just me and coming from an animation background where it takes hours to render a frame but for it to take 5 hours for that quality I think it's well worth the time

I wonder if they had to just export it and hope it sounded good, or if they could tune it and THEN export it? They said they had to fix up the errors with Melodyne, so part of me wonders if the results of "Itsuka Kanarazu" were not able to be controlled. With NeoCeVIO, KM4osM said there were the same old tuning parameters, so why would they have to fix the errors with the pre-commercial/pre-NeoCeVIO version from April through Melodyne? Doesn't it almost seem as if the way the software ran before NeoCeVIO was somehow completely different?

That being said, I wouldn't mind having to wait 5 hours for very human-sounding results. Even if it took that many hours, that would be about the same time it takes to manually tune a song by hand. But if there was no way to preview the song before exporting for 5 hours and then having to fix it in Melodyne, that would make it slightly less appealing.

uncreepy · Oct 18, 2019

Eji contacted me on Twitter about this thread regarding the upcoming CeVIO. He wrote to me in English, but asked me to rewrite what he wrote before sharing it with everyone. So, I will put a quote box around Eji's information, even though they are rephrased quotes and not direct quotes.

kM4osM called it "NeoCeVIO" on his own. Eji contacted a CeVIO team member through DM and they said they do no have any name officially.

Dang it, I really liked the sound of "NeoCeVIO" and "Neo Sasara". I was all on board for that being true.
:ring_ani_lili:

PSGOZ did not go to the event on October 9th at Meiji Kinenkan. But he tweeted "CeVIO Pro" on December 14th, 2018 when he saw this news release:

The DTM Station article is called "Revolution in singing voice synthesis! AI singing synthesis system sings just like a human through deep learning developed by Nagoya Institute of Technology and Techno Speech". In the article, they compare the new CeVIO to other vocal synths. Mainly the fact that VOCALOID sounds inhuman, but CeVIO will revolutionize the vocal synth world due to sounding human (but isn't ready to be sold). They also talk about Microsoft's virtual singer, Rinna. The new CeVIO wav files used to be monotonous, but are now more dynamic and expressive.

Old CeVIO:

New CeVIO (Well... 2018 new. Might not reflect the version that is now closer to commercial sales):

Note: I did not see anywhere in the article CeVIO being called "CeVIO Pro" by DTM station.

For this tweet, I said I wasn't able to translate the technical stuff, but Eji explained for me:

The goal of the CeVIO update is to accelerate the very slow synthesis of the CeVIO AI that was public in December, 2018. The following article (in English) details that CeVIO was based on CNNs (convolutional neural networks). The new CeVIO uses DNN (deep neural networks) to improve the naturalness of the synthesized singing voice (the new CeVIO based on a DNN model can synthesize natural sounding singing): [1904.06868] Singing voice synthesis based on convolutional neural networks

Objective: CeVIO wants the singing voice to have diverse expression, but also want it to be fast produced/synthesized. So they do time relative modeling with CNN, and not have recursive structure. (Recursive means it loops/chains itself.)

That quote previous was regarding a rough description of a paper from the Acoustical Society of Japan's 2019 Autumn Meeting. The document is a pdf you can pay for if you are not a member of ASJ. (I won't link to it, because no one will be able to read it who don't know Japanese/Japanese vocal synth lingo.)

So we do not have the details that those professors have, we only know that it is much faster and can reach real-time just like the current CeVIO that already exists on the market and can run on a notebook PC.

But it wasn't quite the same as the one public in December 2018, that had 2 kinds of engines showed below:

It had a CNN+V version and a CNN+W version.
V means "traditional Vocoder" and W means "WaveNet Vocoder", the later one produces an almost identical result to a human voice that shows a Mean opinion score of 4.23.
CNN-V as shown at ASJ during the 2019/9 and shown to the public on 2019/10.
Which is a version of CeVIO with significantly increased quality, but not the same quality as shown on the 2018/12 website (these demos: Reproducing high-quality singing voice )

We don't know how good CNN+V was, since it wasn't compared directly with the CNN+W from the website demos, but since it used a traditional vocoder, it can not break the barrier of MOS 4.

It may be better than songs known as 「神調教」(godly-tuned) in the Japanese vocal synth community with its default value, since it is machine learning using a human singer's voice.
But it still can be in reach with human-hand crafted VOCALOIDs since it uses the same Vocoder-type synthesis.
Theoretically.

And, it is all automatic and does not need human tuning. The live demo (from the 9th) showed software run on a notebook PC and was not edited while being operated at the show. (Or those people who came there, like kM4osM, said "We do not know if it can do that since we didn't see it", so we can assume that it can do that.)

So, to sum it all up, the "NeoCeVIO" name is not official (RIP). The software behind the new CeVIO has changed a bit since the old demos from last year. But it is now fast and even a notebook PC can run it. And it will sound good even without tuning it due to the DNN.

uncreepy · Nov 27, 2019

Adding the info on here cause I dunno where else to put it.

Techno-Speech has established a new twitter for CeVIO/vocal synth technology: テクノスピーチ（Techno-Speech） (@techno_speech) | Twitter

Their icon is by chie_rico. Both the artist's description and Techno-Speech's description say this is for their Twitter avatar, but it seems weird to put in that much effort to design a character simply for a tiny Twitter icon. (Please be a new vocal synth character!)

Her hair clip says TS and she has a pink/blue futuristic design.
Edit: Her pink/blue are the colors from the CeVIO icon.

So far, Techno-Speech has tweeted about vocal synths in general, such as the NHK Special about VOCALOID: AI, and saying that Techno-Speech was created 10 years ago in 2009. Andother Japanese Vocaloids like Luka, Kiyoteru, Yuki, miki, GUMI are their classmate (dunno why the emphasis is on AHS lol).

Ulysses · Nov 27, 2019

As an old CeVIO fan, I'd predict that this character already has a voice db (Talk, not sure for Song).
And it's likely that she will show up in CeVIO.

lIlI · Nov 28, 2019

Maybe she's like FL-Chan, a mascot for the whole software? Of course, that doesn't rule out a voicebank.

Prism · Nov 28, 2019

I like the look of her maybe she's going to be a bundled in voice or like vy1

xuu · Nov 30, 2019

I think she's likely to be a new flagship character, either alongside or replacing Sasara as the mascot of the software. Wouldn't be surprised if they revamp their sales model and include her like a default DB with CeVIO's new update. Would be cool if she got a Song Voice that represents the advancements they've made. Maybe a bit VY1 to Sasara's Miku?

Trevor · Dec 1, 2019

They might not have been able to rerecord Sasara's vocals so they can meet the new quality president CEVIO is trying to set. Like, she may well be included or even purchasable just for fans, but CEVIO has and is trying to keep up with a modern user base. EDIT: after researching sasara's VP, it appears her career took off two years later and now has a singing career. Scheduling a recording session more grueling than the last would be expensive and difficult if the vp was still interested in a product that might compete against her. Especially for a free voice.

uncreepy · Dec 1, 2019

They actually used Sasara in the demo at the Ministry of Internal Affairs and Communications event (first post in the thread) and said her voice became much more expressive/human. I don't doubt that they would make a more modern voicebank with a new character to help launch the engine, but it seems that even old voicebanks can benefit from the new features of new CeVIO's engine. Plus, for the English/Japanese/Chinese demos back in December 2018, they even used IA's English voice with the new engine: Reproducing high-quality singing voice (Now that I think of it, they used both Sasara and IA in the world's first deep-learning produce CD that featured the song "Itsuka Kanarazu".)

I'm curious how the Color Voice Series would work with the new CeVIO.

Prism · Dec 1, 2019

Everyone seems to forget about them I wonder how the enka singers ones will sound also the green one is best girl

uncreepy · Dec 3, 2019

Techno-Speeche is having another demo/presentation about new CeVIO.

It will be on December 5th from 1:30 PM ~ 5 PM in Nagoya. The presentation will be by Kazuhiro Nakamura, who will have an oral presentation followed by a poster presentation. The contents are basically the same as the presentation in October (see 1st post on this thread).

Hopefully more people see this and share pictures/explanations of what they saw.

xuu · Dec 5, 2019

CeVIO CS7 is coming. Whether this is the deep-learning update in full or not I'm unsure. CeVIO will become a 64-bit application, there will be support for Ruby text and splitting/combing accent (?) in Talk, for Song what I *believe* it's saying is that the "voice quality" or gender parameter will be usable on the time axis like other parameters. For English Song voices words will be automatically split into their syllables. Beta testing will open in a few days.

uncreepy · Dec 5, 2019

My translation of the notice:

[Notice]
Now, the development of the big next version update of "CeVIO CS7" is coming to a close.

Things like 64 bit application, splitting/joining for the pitch accent of the talk section, ruby assignment for lines of speech, detailed tuning in the time axis for singing quality, automatic splitting of English lyric syllables, etc are planned.

Plans for beta testing are soon, so please look forward to it!!

CeVIO Deep-learning-based CeVIO further in development

Veteran

Veteran

long suffering synth fan

⚡

cherish chika

I am Thou and Thou aren't Shit...

Enthusiast

Veteran

Veteran

Veteran

from VOICeVIO

⚡

Enthusiast

long suffering synth fan

?

Veteran

Enthusiast

Veteran

long suffering synth fan

Veteran

Users Who Are Viewing This Thread (Users: 0, Guests: 1)