Setting Up Qwen3-TTS with GPU Acceleration on Ubuntu Server

A step-by-step guide to setting up Qwen3-TTS with GPU acceleration on Ubuntu Server. Complete with FastAPI wrapper, voice cloning via x-vector embeddings, and all the troubleshooting tips I wish I had when I started.

4 min read
Setting Up Qwen3-TTS with GPU Acceleration on Ubuntu Server

Setting Up Qwen3-TTS with GPU Acceleration on Ubuntu Server

I spent a few sleepless nights chasing different TTS stacks - Qwen3-TTS, Fish Speech, a handful of other libraries - only to finally get a stable, GPU-accelerated text-to-speech API humming on my AI01 server. Below is the exact recipe that worked.

Hardware Setup

  • Server: AI01 (192.168.0.253)
  • GPU: 2x NVIDIA RTX 3090 (24GB VRAM each)
  • OS: Ubuntu Server with NVIDIA Driver 535.288.01
  • Allocation: GPU 0 for ComfyUI, GPU 1 for Qwen3-TTS
Dual RTX 3090 GPU setup

Prerequisites

Before you dive in, make sure you have the following:

  • NVIDIA drivers installed and working (nvidia-smi should list your GPUs)
  • A CUDA 12.x compatible system
  • Python 3.10 or newer, plus pip

Installation Script

Create the installation script at /home/pi/qwen3tts.sh:

#!/usr/bin/env bash
set -euo pipefail

### === Config ===
INSTALL_DIR="${INSTALL_DIR:-/opt/qwen3-tts}"
VENV_DIR="${VENV_DIR:-$INSTALL_DIR/venv}"
PYTHON_BIN="${PYTHON_BIN:-python3}"
PORT="${PORT:-8000}"

### === System deps ===
sudo apt-get update -y
sudo apt-get install -y python3 python3-venv python3-pip ffmpeg git

### === Create install dir ===
sudo mkdir -p "$INSTALL_DIR"
sudo chown -R "$(id -u):$(id -g)" "$INSTALL_DIR"

### === Create venv ===
"$PYTHON_BIN" -m venv "$VENV_DIR"
source "$VENV_DIR/bin/activate"

pip install --upgrade pip wheel setuptools

### === Install Python deps ===
pip install --upgrade \
  huggingface_hub \
  fastapi uvicorn \
  transformers accelerate \
  soundfile numpy

### === Install Torch with CUDA support ===
pip install --upgrade torch --index-url https://download.pytorch.org/whl/cu121

### === Install Qwen3-TTS ===
pip install qwen-tts

echo "Installation complete!"
echo "Activate venv: source $VENV_DIR/bin/activate"

Run the script:

chmod +x /home/pi/qwen3tts.sh
./home/pi/qwen3tts.sh

API Server

Create the FastAPI server at /opt/qwen3-tts/server.py:

import os
import torch
import uuid
import soundfile as sf
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from fastapi.responses import FileResponse

# Force GPU 1 (second GPU)
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

from qwen_tts import Qwen3TTSModel

app = FastAPI(title="Qwen3-TTS API")

# Load model on startup with GPU allocation
print("Loading Qwen3-TTS model on GPU 1...")
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    device_map="cuda:0",
    torch_dtype=torch.float16
)
print(f"Model loaded on device: {model.device}")

# Language mapping for convenience
LANG_MAP = {
    "fr": "french",
    "en": "english",
    "de": "german",
    "es": "spanish",
    "it": "italian",
    "pt": "portuguese",
    "ru": "russian",
    "zh": "chinese",
    "ja": "japanese",
    "ko": "korean"
}

class TTSRequest(BaseModel):
    text: str
    voice: str = "fr"

@app.get("/health")
def health():
    return {
        "status": "ok",
        "gpu": torch.cuda.is_available(),
        "device": str(model.device)
    }

@app.post("/generate")
async def generate(req: TTSRequest):
    try:
        output_path = f"/tmp/{uuid.uuid4()}.wav"

        # Map language code to full name
        lang = LANG_MAP.get(req.voice, req.voice)

        # Use voice reference with x_vector_only_mode (no ref_text needed)
        voice_ref = f"/home/pi/voice-refs/samantha_voice_{req.voice}.wav"
        if not os.path.exists(voice_ref):
            voice_ref = "/home/pi/voice-refs/samantha_voice.wav"

        audio_list, sr = model.generate_voice_clone(
            text=req.text,
            language=lang,
            ref_audio=voice_ref,
            x_vector_only_mode=True
        )

        audio = audio_list[0]
        sf.write(output_path, audio, sr)
        return FileResponse(output_path, media_type="audio/wav")

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
TTS API architecture

Voice References

Drop your voice reference files into /home/pi/voice-refs/:

mkdir -p /home/pi/voice-refs
# Add your reference audio files (5-30 seconds of clear speech)
# samantha_voice_fr.wav
# samantha_voice.wav (for English)

The x_vector_only_mode=True flag extracts just the voice embedding, so you don't need to transcribe the reference audio.

Start the Server

cd /opt/qwen3-tts
./venv/bin/uvicorn server:app --host 0.0.0.0 --port 8000

Or run it in the background:

nohup ./venv/bin/uvicorn server:app --host 0.0.0.0 --port 8000 > server.log 2>&1 &

API Usage

Health check:

curl http://192.168.0.253:8000/health

Generate speech (French):

curl -X POST http://192.168.0.253:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Bonjour, ceci est un test.", "voice": "fr"}' \
  --output output.wav

Generate speech (English):

curl -X POST http://192.168.0.253:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"text": "Hello, this is a test.", "voice": "en"}' \
  --output output.wav

Verify GPU Usage

Run nvidia-smi while generating audio to confirm GPU utilization. GPU 1 should show ~5GB VRAM usage and active compute during inference.

Troubleshooting

Language Error

Error: Unsupported languages: ['fr']

Solution: Use full language names: french, english, german, etc. The LANG_MAP in server.py handles this automatically.

ref_text Required Error

Error: ref_text is required when x_vector_only_mode=False

Solution: Set x_vector_only_mode=True in the generate_voice_clone call.

GPU Not Used

If the model runs on CPU, verify:

  1. CUDA_VISIBLE_DEVICES=1 is set at the top of server.py
  2. device_map="cuda:0" is passed to from_pretrained()
  3. PyTorch CUDA is installed: pip install torch --index-url https://download.pytorch.org/whl/cu121

Supported Languages

The model supports: auto, chinese, english, french, german, italian, japanese, korean, portuguese, russian, spanish

Conclusion

This setup delivers a production-ready TTS API powered by dedicated GPU hardware. The model uses around 5GB of VRAM and responds in under 10 seconds for typical phrases. With voice cloning via x-vector embeddings, you can build consistent voice personas without transcribing the audio.

Update: I have since moved from Qwen3-TTS to Chatterbox Multilingual, which handles French much better out of the box and requires far less tuning. That said, this guide remains useful if you specifically want to try Qwen3-TTS or if your use case is English-only where it performs well.