All of this is highly speculative, but I think I might have an idea of what's going on.
They showed how they can detect the F0 and formants, and the samples both sounded like some of the formants were additionally attenuated and therefore louder than in the original audio. Also, this is not only happening where sample transitions might be, but throughout the audio clips.
Together, this would suggest to me that their engine is "pitch-adaptive", which means the data batches it processes each represent a single F0 wave, or a single vocal chord vibration instead of a fixed amount of time. (They might also be processing a fixed number of vocal chord vibrations at a time, but that wouldn't really change things)
This has several advantages for the math running in the background. I'm not sure how to explain the exact reasons, but the end result is that pitch shifting samples and a few other things, like Gender Factor modifications, are far easier for them compared to if their engine was using a fixed tick rate. So I wouldn't be too worried about the sample having the same pitch as the original audio recording right now, it's unlikely they're going to have problems with pitch changes, and parameters like Gender Factor and Breathiness should work great as well. (Btw, most concatenative commercial vocal synths also use pitch-adaptive synthesis)
However, the biggest disadvantage of pitch-adaptive engines is that the data batches used by the engine each contain different amounts of data. This makes constructing the transitions between them a lot harder. If it isn't done well enough, these flawed transitions will be audible as selective attenuation of formants, like what can be heard in the sample. Mitigating this effect is an extremely complicated task, and over a dozen research papers have been written on the subject, each proposing different methods for how to deal with the problem.
But I think Maghni AI is using its own algorithm for this task instead of using one from a paper. During their original Q&A, I asked several questions about their synthesis algorithm, which they understandably didn't want to fully answer. However, they did say that their engine isn't using Fourier transforms, but rather a custom, similar algorithm. One of the many variants of the Fourier transform is the so-called "inverse short-time Fourier transform", which is used to calculate the transitions between data batches in non pitch-adaptive engines. So this quote can be interpreted as them adapting this algorithm for their pitch-adaptive synth.
Assuming that any of this is correct, which I wouldn't be too sure about, this would mean that they've been working on the part that causes most of the deviation of the synthesized from the original sound since way before the first Q&A and probably still do. It's very impressive that they took on such a task at all, but it's still kind of a gamble. If they get it wrong, there's almost no way to work around it, but if they get it right, it drastically improves the quality of the engine in other areas.
Please keep in mind that this really is speculation. I only know the same audio samples as you, and I've just based a lot of assumtions on just that. (and a question from an old Q&A)