Text-to-speech

Probably a lot of people are interested in talking to their LLMs via speech. To get realistic human voices from a computer, you need to use text-to-speech (TTS).

I found one that is super simple to get going:

pip install TTS

tts --model_name tts_models/en/ljspeech/tacotron2-DDC_ph --vocoder_name vocoder_models/en/ljspeech/univnet --text "When I choose to see the good side of things, I'm not being naive. It is strategic and necessary. It's how I've learned to survive through everything.  I know you go through life with your fists held tight.  You see yourself as a fighter. Well, I see myself as one too. This is how I fight." --progress_bar True --use_cuda True

Here’s the output:

This model will split up the input into sentences and then generate a wav file for each sentence. So, a good idea to reduce latency from LLM output is to split the LLM output at sentence boundaries since they are easy to separate by scanning for periods followed by a space.

Then to reduce latency, the LLM can start speaking as soon as the first sentence is done being generated, and the next sentence can be generated while the first sentence is being spoken.

Different Speakers

You can see the availaable speakers:

tts --model_name tts_models/en/vctk/vits --list_speaker_idxs

And then you can specify a speaker e.g. p330:

tts --model_name tts_models/en/vctk/vits --text "When I choose to see the good side of things, I'm not being naive. It is strategic and necessary. It's how I've learned to survive through everything.  I know you go through life with your fists held tight.  You see yourself as a fighter. Well, I see myself as one too. This is how I fight." --progress_bar True --use_cuda True --speaker_idx p330

The VITS models take about twice as long to run, but the output is much better. This seems to be a good tradeoff for real-time speech.

Here’s the output: