You know Parler-TTS. You've experienced its groundbreaking approach to text-to-speech synthesis the dual tokenizer architecture, the natural language descriptions, the impressive 45K hours of training data. But what if we told you there was more to be discovered?
ParlerVoice represents the next chapter in this story. Born from 650 hours of meticulously curated audio data and refined through advanced preprocessing techniques, it's not just another fine-tuned model it's a reimagining of what's possible with expressive speech synthesis.
Parler-TTS Mini v1.1 gave us a solid foundation: 938M parameters, 34 speakers, and a robust architecture. But we saw potential. We saw opportunities to push the boundaries of what expressive TTS could achieve.
What if we could create a model that didn't just generate speech, but truly understood the nuances of human expression? What if we could offer unprecedented control over every aspect of voice generation while maintaining the naturalness that made Parler-TTS revolutionary?
Our journey began with 650 hours of curated audio not just any audio, but carefully curated proprietary data. Each sample was chosen for its quality, consistency, and expressive potential.
Here's where our story takes an innovative turn. We didn't just use the data as-isโwe transformed it through a comprehensive preprocessing pipeline: