CoquiTTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality.

It comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

CoquiTTS is written with Python, and it can be a handy tool for video game developers, post-production, dubbing, and creating educational videos.

CoquiTTS developers are working now on, Coqui studio which will offer a straightforward simple user-friendly interface to clone and create text-to-speech audios in MP3 format.

Features

  • High-performance Deep Learning models for Text2Speech tasks.
  • Text2Spec models (Tacotron, Tacotron2, Glow-TTS, SpeedySpeech).
  • Speaker Encoder to compute speaker embeddings efficiently.
  • Vocoder models (MelGAN, Multiband-MelGAN, GAN-TTS, ParallelWaveGAN, WaveGrad, WaveRNN)
  • Fast and efficient model training.
  • Detailed training logs on the terminal and Tensorboard.
  • Support for Multi-speaker TTS.
  • Efficient, flexible, lightweight but feature complete Trainer API.
  • Released and ready-to-use models.
  • Tools to curate Text2Speech datasets underdataset_analysis.
  • Utilities to use and test your models.
  • Modular (but not too much) code base enabling easy implementation of new ideas.

Implemented Models

Spectrogram models

End-to-End Models

Attention Methods

  • Guided Attention: paper
  • Forward Backward Decoding: paper
  • Graves Attention: paper
  • Double Decoder Consistency: blog
  • Dynamic Convolutional Attention: paper
  • Alignment Network: paper

Speaker Encoder

Vocoders

License

The project is released under the MPL-2.0 License.

Resources

  1. Source code
  2. Website