Self-Hosted Qwen3-TTS: Voice Without the Cloud

You find a cool cloud service. You integrate it everywhere. Three months later, you get the bill. Coffee meets keyboard.

That's been my TTS API journey.

Price is only half the problem. The other half? Realizing at 2 AM that your audio data, including that confidential client demo, is going through servers you know nothing about. GDPR loves this.

Qwen3-TTS fixes this.

If you followed my Fish Speech installation guide, consider this its successor. Fish Speech was good. Qwen3-TTS is better. Same self-hosted philosophy, cleaner French accents, simpler setup. I've since retired Fish Speech entirely.

Open-source. 1.7B parameters. Clones voices from a ten-second sample. Ten seconds. I thought it was fake the first time I tried it. It's not.

Runs locally on your GPU. Your data never leaves your machine. This is how you deploy it on Ubuntu with an RTX 3090.

Prerequisites

You'll need:

Ubuntu 22.04+ with SSH
NVIDIA GPU, 8GB+ VRAM (RTX 3090 ideal, 3080 works)
CUDA 12.1 with drivers installed (yes, the painful part)
Python 3.10+
~5GB disk space

Check CUDA first. This saves you 30 minutes of debugging:

nvidia-smi

GPU shows up with stats? Good. If not, the NVIDIA forums await.

Installation

Install the qwen-tts package:

pip install qwen-tts

Test it:

qwen-tts-demo Qwen/Qwen3-TTS-12Hz-1.7B-Base --no-flash-attn --device cuda:0 --port 8082

Gradio interface on port 8082. Upload a voice sample, type some text. Hearing "yourself" say things you never said is weird. Cool, but weird. You get used to it.

Now let's make it run properly.

Systemd Service

Auto-restart. Clean logs. Sleep peacefully:

sudo tee /etc/systemd/system/qwen3-tts.service << 'EOF'
[Unit]
Description=Qwen3-TTS Voice Cloning Server
After=network.target

[Service]
Type=simple
User=pi
WorkingDirectory=/home/pi
ExecStart=/home/pi/.local/bin/qwen-tts-demo \
    Qwen/Qwen3-TTS-12Hz-1.7B-Base \
    --no-flash-attn \
    --device cuda:0 \
    --port 8082
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

Enable it:

sudo systemctl daemon-reload
sudo systemctl enable qwen3-tts
sudo systemctl start qwen3-tts

Check with systemctl status qwen3-tts. Done.

Using the API

Qwen3-TTS exposes a Gradio API. You can use the Python client or call it directly with curl. Three steps:

1. Upload your voice reference file:

curl -X POST "http://localhost:8082/gradio_api/upload" \
  -F "files=@voice_sample.wav"

# Returns: ["/tmp/gradio/.../voice_sample.wav"]

2. Call the voice clone endpoint:

curl -X POST "http://localhost:8082/gradio_api/call/run_voice_clone" \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      {"path": "/tmp/gradio/.../voice_sample.wav", "meta": {"_type": "gradio.FileData"}},
      "",
      true,
      "Your text to synthesize goes here.",
      "English"
    ]
  }'

# Returns: {"event_id": "abc123..."}

3. Stream the result and download:

# Poll for completion (SSE stream)
curl -N "http://localhost:8082/gradio_api/call/run_voice_clone/{event_id}"

# Returns: event: complete
# data: [{"path": "/tmp/gradio/.../audio.wav", ...}, "Finished."]

# Download the audio
curl "http://localhost:8082/gradio_api/file=/tmp/gradio/.../audio.wav" -o output.wav

The meta field is required. Without it, Gradio 6 rejects the request.

Quality tip: 10 seconds works. 20-30 seconds with clear voice, no background noise, no music? That's where it gets impressive.

example FR🇫🇷

0:00

/4.8

example EN🇬🇧

0:00

/5.6

What's Next

Qwen3-TTS runs locally. No cloud. No surprise bills. Your data stays yours.

The 1.7B model is a good balance between speed and quality. I use it daily. Qwen has bigger models for broadcast-quality stuff, but your GPU needs to keep up.