I don't think they take advantage any specialty hardware acceleration. (Too bad, because Apple ships ML accelerator hardware on their systems.) They're very fast to render though. Usually I edit a note, and the little render progress zooms by in about a half second to a second and it's ready to play.
Typically you need beefy hardware to train models, but just running inference on them is fairly resource-light.
I'm actually really curious about how the V6 engine works under the hood, and I wish they'd publish a paper or something like they did for the original Vocaloid. The existence of vocalo-changer and the relative file size of the editor seem like a hint...I wonder if there's a set of internal "carrier" samples that the editor uses, like a traditional sample based bank, and then it alters the timbre with the voice bank's ML model. Or if the editor phoneme/pitch signals just provoke the model to emit whatever it it's been trained as an "ah" sound or whatever.