Quote of the day—Microsoft researchers

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.

Chengyi Wang, Sanyuan Chen, Yu Wu*, Ziqiang Zhang,Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei
January 5, 2023
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
[Emphasis added.

They have multiple samples you can listen too.

It does a scary good job. What could possibly go wrong?

Skynet smiles. This will be used in the terminators.—Joe]


7 thoughts on "Quote of the day—Microsoft researchers

  1. “If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.”

    Yea, I’m sure that would be 100% effective. And I’m sure the government would never abuse such a system right?

    I mean, why bother with getting undercover audio recordings of suspects when you can just take a voice sample and then “record” them saying whatever you want them to say? Especially when it’s just one of those evil white supremacist homophobic transphobic misogynist Reich wingers. Maybe we didn’t actually record him saying these terrible things, but I’m sure he was thinking it…fake but accurate right?

    • As soon as you “get one”, all you need is three seconds of JEJ’s voice and you’re good to go.

      • Would you have to pay Carrie Underwood for using her voice?
        Or some other celebrity?

  2. I want to combine this with ChatGPT and have it answer spam phone calls to waste their time.

  3. Looks like you solved Rolf’s audiobook problem, Joe!
    And when granny said trust only half of what you see, and none of what you hear?

