My Top 5 Open Source Text to Speech Softwares Starting off in 2024

Jarods Journey
11 Jan 202408:36

TLDRIn this video, the creator reviews the top five open-source text-to-speech software options, sharing audio samples to demonstrate their capabilities. The softwares discussed include Sun's Bark, Valley X, Whisper X, Style TTS 2, and the favored Tortois TTS. The creator highlights the strengths and limitations of each, noting that while some produce good outputs, none are perfect. Tortois TTS stands out for its quality and the potential to improve with additional training and techniques like RVC, offering a more natural and accurate voice output.

Takeaways

  • 🔊 The video discusses the top five open-source text-to-speech (TTS) softwares encountered over the past year.
  • 🎤 The first TTS software mentioned is Sun's Bark, which gained attention but hasn't been updated recently.
  • 🗣️ Valley X is another TTS software capable of voice cloning, but it's not considered the best due to its robotic tone.
  • 🌟 The narrator tests various TTS softwares, including using a preset voice from Elden Ring and a Japanese voice.
  • 🎧 Style TTS 2 is highlighted for its fantastic voice quality and active community support.
  • 📝 K TTSs xtts is noted for its speed and quality, using HiFi GAN and a different training approach.
  • 🇯🇵 The narrator demonstrates the TTS software's ability to handle Japanese language, though pitch accuracy is a challenge.
  • 🥇 Tortois TTS is the narrator's favorite, offering the best quality and experience, especially when combined with RVC (Real Voice Cloning).
  • 📚 The video includes a demonstration of generating an audiobook using Tortois TTS with RVC.
  • 📌 The narrator emphasizes the importance of voice training for better TTS output quality.
  • 📢 The video concludes with a recommendation for Tortois TTS and a call to action for viewers to share their thoughts in the comments.

Q & A

  • What is the first open-source text-to-speech software mentioned in the transcript?

    -The first open-source text-to-speech software mentioned is Sun's bark.

  • What feature of Sun's bark caught attention when it was released?

    -Sun's bark caught attention due to its curated example of a British accent voice.

  • How does the speaker describe the performance of Sun's bark?

    -The speaker describes Sun's bark as good but not the best, and mentions that it hasn't been updated much as the focus shifted to their chirp feature.

  • What is Valley X's capability in relation to text-to-speech?

    -Valley X can do voice cloning and is decent in producing text-to-speech outputs, though not always accurate.

  • How does the speaker evaluate the quality of the synthesized voice from Whisper X?

    -The speaker finds the synthesized voice from Whisper X to be robotic and not very accurate.

  • What is the speaker's opinion on Style TTS 2?

    -The speaker hasn't trained any voices with Style TTS 2 but finds the out-of-the-box voice to be fantastic.

  • What makes XTTS stand out according to the speaker?

    -XTTS is considered better than bark and zuno, with the speaker noting its quality and the fact that it uses HiFi Gan instead of diffusion for the male spectrogram.

  • Why does the speaker prefer Tortoise TTS?

    -The speaker prefers Tortoise TTS due to their experience with it, its high quality, and the ability to maintain good intonation when used with RVC.

  • How does the speaker enhance the quality of Tortoise TTS?

    -The speaker enhances the quality of Tortoise TTS by using it in conjunction with RVC, which improves the voice's naturalness and intonation.

  • What is the speaker's overall recommendation for the best open-source text-to-speech software?

    -The speaker's overall recommendation is Tortoise TTS, especially when combined with RVC for voice cloning.

Outlines

00:00

🗣️ Open-Source Text-to-Speech Software Overview

The paragraph introduces five top open-source text-to-speech (TTS) software options encountered over the past year. It begins with a comparison of the author's voice with a British accent and their actual voice. The first software discussed is Sun's Bark, which gained attention but hasn't seen updates as the developers focus on a feature called Chirp. The author also mentions Valley X, which can clone voices but is not the best option. The discussion includes audio samples from different TTS software, including a comparison between pre-made and cloned voices, with a focus on the quality and accuracy of the outputs.

05:02

🎤 Detailed Reviews and Voice Cloning

This paragraph delves into the specifics of each TTS software. It starts with a brief mention of Whisper X and its robotic output. The author then discusses Style TTS 2, highlighting a fantastic voice sample provided by a Discord member. The paragraph continues with Kttss XTTS, which the author has fine-tuned, and Tortoise TTS, which is the author's favorite due to their extensive experience with it. The author emphasizes the quality of Tortoise TTS, especially when combined with RVC (Rhythmic Voice Cloning), and shares their enthusiasm for AI voice cloning. The paragraph concludes with a comparison of the native output and the improved output after RVC processing.

Mindmap

Keywords

💡Open-source text-to-speech software

Refers to software applications that convert written text into spoken words, which are freely available for use and often allow users to modify and distribute the source code. In the context of the video, the narrator reviews various open-source options, highlighting their features and performance. Examples include Sun's Bark, Valley X, and Tortois TTS, which are discussed for their voice cloning capabilities and quality of speech synthesis.

💡Voice cloning

The process of creating a synthetic voice that mimics the characteristics of a specific person's voice. In the video, the narrator experiments with software that can clone voices, such as using a preset voice from Elden Ring or cloning the narrator's own voice. This technology is showcased as a significant feature of the reviewed text-to-speech software.

💡Hugging Face

A platform that provides a space for developers to share and collaborate on machine learning models, including those for text-to-speech applications. In the video, the narrator uses Hugging Face to generate audio samples with different speakers, demonstrating the versatility of the platform in conjunction with text-to-speech software.

💡Style TTS 2

A specific text-to-speech software mentioned in the video, known for its high-quality voice output. The narrator discusses the fantastic voice quality produced by this software, indicating that it's a result of the active development and community engagement around the project.

💡XTTS

Another text-to-speech software discussed in the video, which the narrator has fine-tuned for better performance. XTTS is noted for its speed and quality, with the narrator expressing a preference for it over other options like Sun's Bark and Valley X.

💡Tortois TTS

The narrator's favorite text-to-speech software, which they have the most experience with. Tortois TTS is praised for its high-quality output, especially when combined with RVC (Real Voice Cloning), which enhances the naturalness and intonation of the synthesized voice.

💡RVC (Real Voice Cloning)

A technique used to improve the quality of synthesized voices by fine-tuning them to match a specific person's voice characteristics. The narrator uses RVC with Tortois TTS to create a more natural-sounding voice clone, which is a key aspect of their positive review of the software.

💡HiFi-GAN

A generative adversarial network (GAN) architecture used in text-to-speech software to produce high-fidelity audio. The narrator mentions that XTTS uses HiFi-GAN, which contributes to the software's ability to generate high-quality speech.

💡Diffusion

A machine learning technique used in the generation of audio, particularly in the context of text-to-speech. The narrator notes that Tortois TTS uses a different approach than diffusion for the male spectrogram, which affects the quality and characteristics of the synthesized voice.

💡Audiobook generation

The process of creating an audiobook by converting written text into spoken words. The narrator demonstrates this by using Tortois TTS with RVC to generate sentences for an audiobook, showcasing the practical application of the text-to-speech software.

Highlights

Sun's bark is an open-source text-to-speech software that gained attention for its quality and capabilities.

Sun's bark has not been updated recently, as the focus shifted to the chirp feature.

Valley X is another text-to-speech software capable of voice cloning, but it's not the best and can be robotic.

The speaker tested Valley X with their own voice and found it decent but not perfect.

The speaker used a preset voice from Elden Ring, Melina, for a sample.

A Japanese voice preset was tested, and the speaker found it close to the original voice.

Style TTS 2 was praised for its fantastic voice quality, even without personal training.

The speaker plans to train their own voice with Style TTS 2 and share the results.

K TTSs XTTS was tested and found to be faster and of good quality.

XTTS is considered better than Sun's bark and Valley X, with HiFi GAN technology.

Tortoise TTS is the speaker's favorite, offering the best quality and experience.

Tortoise TTS, when pipelined with RVC, produces high-quality voice cloning.

The speaker has extensive experience with Tortoise TTS and recommends it for its intonation and quality.

The speaker used Tortoise TTS to create an audiobook with their voice clone.

The speaker also tested a Lex Freedman voice clone with Tortoise TTS, showcasing its versatility.

The speaker concludes that Tortoise TTS is the best among the tested open-source text-to-speech softwares.

The speaker encourages viewers to share their thoughts and ask for more details in the comments.

The video is the first of the year 2024 for the speaker's channel, marking a new beginning.