ParlerVoice | Notion

What is ParlerVoice?

You know Parler-TTS. You've experienced its groundbreaking approach to text-to-speech synthesis the dual tokenizer architecture, the natural language descriptions, the impressive 45K hours of training data. But what if we told you there was more to be discovered?

ParlerVoice represents the next chapter in this story. Born from 650 hours of meticulously curated audio data and refined through advanced preprocessing techniques, it's not just another fine-tuned model it's a reimagining of what's possible with expressive speech synthesis.

🚀 The Transformation: From Good to Exceptional

The Foundation We Built Upon

Parler-TTS Mini v1.1 gave us a solid foundation: 938M parameters, 34 speakers, and a robust architecture. But we saw potential. We saw opportunities to push the boundaries of what expressive TTS could achieve.

Our Vision

What if we could create a model that didn't just generate speech, but truly understood the nuances of human expression? What if we could offer unprecedented control over every aspect of voice generation while maintaining the naturalness that made Parler-TTS revolutionary?

🎯 The Journey: Training & Refinement

The Data That Changed Everything

Our journey began with 650 hours of curated audio not just any audio, but carefully curated proprietary data. Each sample was chosen for its quality, consistency, and expressive potential.

The Training Process

10 epochs of intensive training on NVIDIA A100 hardware
Same architecture as Parler-TTS Mini v1.1 (938M parameters)
Specialized fine-tuning approach that preserved the base model's strengths while enhancing its capabilities

The Preprocessing Revolution

Here's where our story takes an innovative turn. We didn't just use the data as-is—we transformed it through a comprehensive preprocessing pipeline:

Accent & Gender Detection: We employed advanced models to automatically detect accents and gender characteristics, creating a comprehensive understanding of each speaker's unique vocal identity.