Is it my imagination that their voices (featuring Miku, Meiko, and Luka) sound quite different from their past NT voicebank demos? Luka's voicebank, in particular, reminds me of her V2 voicebank. However, I'm sorry to say that after listening to their voices several more times, I still feel that Luka's and Meiko's voices in this video somewhat are blended with Miku's (or CFM did some other thing to reach this effect idk) because the three of them sound quite similar (although still could distinguish between each other, but not as signature as their V3 or V4x voicebanks). Nevertheless, at least their voices have become much better compared to the previous NT voices (which were truly unbearable). Since the bgm is quite loud, I could not deduce more, unless I could listen to their voices in raw state.
I agree that they all, Luka most especially, sound much more like themselves in this demo. Regarding the Miku comparison, you're not the only person who has raised this sentiment in the past and now, though I think there's a fundamental misunderstanding regarding it.
I do not believe that what we're hearing in Luka and Meiko is Miku, it's simply that we're hearing
NT in Luka and Meiko. When thinking about it, for a long time Miku was the only NT voicebank we had heard. Therefore, I think there's a slightly skewed bias towards the engine recreating Miku rather than Miku being a product of the engine. People usually talk about consonants or vowel shapes. Still, if you compare Miku NT to her V4x, she also sounds very different in regards to the way her vowels sound, the articulation of her consonants, etc., and this is because NT and Vocaloid2/3/4/5 operate upon a fundamental difference in their technology.
NT is more akin to Vocaloid1, or DeepVocal if we want to compare it to something more modern. While these voicebanks do have their WAV files within the actual compressed library, this is
not what is used by the engine; there is an additional component in the library that is used instead, which is models. For example, in Vocaloid2/3/4/5, when you type in the lyrics "ka ba" it is going to pull the WAV files recorded directly
[k a] [a]* [a b] [b a] [a]* [a Sil]
However, in Vocaloid1 and NT, it does
not pull the WAV files recorded; it pulls from a recreated model of the WAV file. The actual WAV file that was recorded never plays directly in the editor, everything we are hearing has been rebuilt by the engine based on several factors and the benefit of this is that you can (in theory) have a wider array of editability and manipulation (like tension). However, on the downside (and this is why Yamaha ditched this technology in their crossover to V2 outside of formant models for stationaries), depending on the complexity of the model it can heavily impact the sound as well as the size of the voicebank. It's also important to note that currently unless something changed for NT2 (if it is NT2), they are using SPTK; an open-source toolkit that was created in the late '90s/early 00s as the basis of their synthesis technology which likely plays a huge role in the quality.
So, this is all to say that while in some parts it might sound like "Hey, this is eerily similar to how Miku NT would sound," it's more so that "Oh, this is just how NT is causing voices to sound."
* These would be the stationaries, and while formant synthesis is used to ensure that they are natural when long and extended, I'm not 100% certain on the implementation as naturally I only know what I've found through research over the years. It's possible these stationaries do use a formant synthesis model in conjunction with recorded diphones/triphones, but it's also possible that it merely uses it for reference and still pulls the recorded WAV.