Auto-Tagging 37,000 Metal Tracks with MusicBrainz and Python

How I used Python, MusicBrainz, AcoustID fingerprinting, and fuzzy matching to automatically tag 28,000 metal tracks, fetch album artwork, and embed it all — starting from a messy NFS share on a Sunday morning.

11 min read
Heavy metal vinyl records and headphones in moody neon lighting

I host my own music server on Navidrome, carrying roughly 28,000 tracks spread over 2,987 albums and 1,078 artists. Almost all of them are metal—from headliners to the deep-cut underground. The files sit on a TrueNAS NFS share under a tidy Artist/Album/Track layout, but the actual ID3/FLAC tags were a nightmare: titles missing, track numbers scrambled, and no release year at all.

Navidrome doesn’t look at folder names; it consumes the tags themselves. As a result, my library’s display was a disaster. I decided one Sunday morning to straighten it all out with Python.

The homelab: TrueNAS NFS serving 28,000 tracks to Navidrome on Kubernetes

The Plan

Artist/Album/
 │
 ├─ Search MusicBrainz (artist + album)
 │ ↓
 │ Confidence ≥ 90%?
 │ ┌──Yes──┐ ┌──No──┐
 │ Tags MB   Tags from
 │ (exact    filename
 │  title,   (fallback)
 │  year,
 │  label)
 │ ↓
 └─ For each track: match by track# + fuzzy title
 │
 └─ Still unmatched? AcoustID fingerprint → MusicBrainz

I ended up writing four scripts that run as a pipeline:

  1. tag_music_mb.py — tags files using MusicBrainz text search with fuzzy matching
  2. acoustid_tagger.py — fingerprints unmatched files and identifies them via AcoustID
  3. fetch_covers.py — downloads album artwork from Cover Art Archive + Deezer
  4. embed_covers.py — embeds the artwork directly into each audio file

The idea is straightforward: for every Artist/Album folder I hit MusicBrainz. If the match confidence is 90% or higher, I grab the official metadata. If not, I fall back to whatever I can extract from the filenames.

import musicbrainzngs as mb
from rapidfuzz import fuzz

mb.set_useragent("NavidromeAutoTagger", "1.0", "https://example.com")
mb.set_rate_limit(limit_or_interval=1.0)  # 1 req/s

def search_release(artist: str, album: str):
    query = f'artist:"{artist}" AND release:"{album}"'
    result = mb.search_releases(query=query, limit=5)
    releases = result.get("release-list", [])

    best, best_score = None, 0.0
    for rel in releases:
        artist_score = fuzz.token_sort_ratio(
            artist.lower(),
            rel.get("artist-credit-phrase", "").lower()
        )
        album_score = fuzz.token_sort_ratio(
            album.lower(), rel.get("title", "").lower()
        )
        sc = (artist_score + album_score) / 2
        if sc > best_score:
            best, best_score = rel, sc

    return best, best_score

Rapidfuzz’s token_sort_ratio is the hero here, turning strings like “The Black Dahlia Murder” and “Black Dahlia Murder, The” into comparable tokens, sorting them, and then comparing. The 90% cutoff weeds out spurious matches while still catching minor variations.

Track Matching

When a release is identified, I pull its full tracklist from MusicBrainz and then try to map each local file to a track entry:

  • By track number — if the filename starts with 01 - Title.flac, match to position 1
  • By fuzzy title — if numbers don’t match, fall back to title similarity

Filename Fallback

If MusicBrainz turns up empty—typical for obscure releases with only a couple hundred listeners—I extract whatever clues I can from the filenames and surrounding metadata:

def parse_track_filename(filename: str):
    """'01 - Song Title.mp3' -> ('1', 'Song Title')"""
    stem = Path(filename).stem
    m = re.match(r'^(\d{1,3})\s*[-._\s]\s*(.+)$', stem)
    if m:
        return m.group(1).lstrip("0") or "1", m.group(2).strip()
    return None, stem.strip()

Step 2: AcoustID Fingerprinting

For the ~15,000 files that MusicBrainz text search couldn’t match, I brought in the heavy artillery: AcoustID. Instead of searching by name, it analyzes the actual audio waveform and generates a unique fingerprint using Chromaprint. That fingerprint gets looked up against a database of 40+ million tracks.

This catches albums that MusicBrainz missed due to spelling differences, regional editions, or simply obscure names—because the audio itself doesn’t lie.

import acoustid

def lookup(filepath):
    duration, fp = acoustid.fingerprint_file(str(filepath))
    params = {
        "client": ACOUSTID_KEY,
        "duration": str(int(duration)),
        "fingerprint": fp,
        "meta": "recordings releases"
    }
    r = requests.post(
        "https://api.acoustid.org/v2/lookup",
        data=params, timeout=15
    )
    results = r.json().get("results", [])
    if results and results[0].get("score", 0) >= 0.7:
        rec = results[0]["recordings"][0]
        return {
            "title": rec.get("title"),
            "artist": rec["artists"][0]["name"],
            "album": rec["releases"][0]["title"],
            ...
        }

Even for underground black metal demos from 1996, AcoustID pulled matches at 97-100% confidence. The metal community on MusicBrainz is nothing if not thorough.

Tagging Results

28,000 tracks across nearly 3,000 albums — from Abbath to Zyklon
MethodFiles%
MusicBrainz (text search)16,03357%
AcoustID (fingerprint)2531%
Local fallback (filename)11,62541%
Errors3841%

A 58% verified match rate across both methods is solid for a library this underground. The remaining 41% still got tagged from filenames—not perfect metadata, but way better than blank tags.

Step 3: Album Covers

With the tags in the right place, the next missing piece was album art. Navidrome automatically picks up any cover.jpg that lives in an album’s folder.

Cover Art Archive

The Cover Art Archive lives alongside MusicBrainz, offering free, community-curated album covers. Once a release is matched, we already have its release_id:

def fetch_cover_caa(release_id: str):
    url = f"https://coverartarchive.org/release/{release_id}/front-500"
    r = requests.get(url, timeout=15, allow_redirects=True)
    if r.status_code == 200 and len(r.content) > 1000:
        return r.content
    return None

Deezer Fallback

For releases that escape MusicBrainz or lack CAA art, the Deezer API steps in as a reliable fallback—no API key needed:

def fetch_cover_deezer(artist: str, album: str):
    r = requests.get(
        "https://api.deezer.com/search/album",
        params={"q": f'artist:"{artist}" album:"{album}"', "limit": 5},
        timeout=10,
    )
    data = r.json().get("data", [])
    # Find best fuzzy match, download cover_big

This pulls in a surprising number of albums that MusicBrainz misses—because Deezer’s catalog is built around commercial releases, its coverage pattern is a little different.

SourceCovers
Cover Art Archive611
Deezer (fallback)685
Already present558
Not found861 albums (29%)

Step 4: Embedding Artwork

After getting tags and cover files sorted, the last step was embedding the artwork directly into the audio files. Some players—especially mobile ones—don’t look for cover.jpg in the folder. They want the art inside the file itself.

def embed_mp3(filepath, cover_data, mime):
    audio = MP3(filepath)
    if audio.tags is None:
        audio.add_tags()
    audio.tags.delall("APIC")
    audio.tags.add(APIC(
        encoding=3, mime=mime, type=3,
        desc="Front cover", data=cover_data
    ))
    audio.save()

def embed_flac(filepath, cover_data, mime):
    audio = FLAC(filepath)
    audio.clear_pictures()
    pic = Picture()
    pic.type = 3  # Front cover
    pic.mime = mime
    pic.desc = "Front cover"
    pic.data = cover_data
    audio.add_picture(pic)
    audio.save()

Final State

After running the full pipeline, here’s what the library looks like:

MetricValue
Artists1,078
Albums2,987
Audio files28,295 (25,753 MP3 + 2,542 FLAC)
Total size401 GB
Title tag coverage99%
Artist tag coverage99%
Album tag coverage99%
Year tag coverage99%
Embedded artwork91%
Albums with cover file2,167 (73%)

Top artists by album count: Shining (28), Motörhead (27), Watain (26), Overkill (26), Napalm Death (26), Satyricon (25), Kreator (25), Testament (24), Kataklysm (24), Heaven Shall Burn (23).

The Full Pipeline

All four scripts live in a GitLab repo. Run them in order—the whole pipeline takes about three hours for a 28,000-track library and requires zero manual intervention.

pip install mutagen musicbrainzngs rapidfuzz requests pyacoustid
apt install libchromaprint-tools  # for fpcalc

# Step 1: Tag via MusicBrainz text search
python3 tag_music_mb.py /mnt/music

# Step 2: Fingerprint unmatched files via AcoustID
python3 acoustid_tagger.py /mnt/music --log tag_music_mb.log

# Step 3: Download missing artwork
python3 fetch_covers.py /mnt/music

# Step 4: Embed artwork into audio files
python3 embed_covers.py /mnt/music

For a big library, run them inside screen or tmux—they’re slow but completely safe, no risk of corrupting your data.

The Full Tagger Script

Here’s the full tag_music_mb.py script—drop it on any machine with Python 3.10+ and its dependencies installed:

#!/usr/bin/env python3
"""
tag_music_mb.py — Tagger MP3/FLAC via MusicBrainz pour Navidrome
Structure attendue : /music_root/Artiste/Album/piste.mp3|flac

Dépendances :
    pip install mutagen musicbrainzngs rapidfuzz
"""

import os
import re
import sys
import time
import logging
import argparse
from pathlib import Path

# ── Dépendances externes ──────────────────────────────────────────────────────
try:
    import musicbrainzngs as mb
except ImportError:
    print("❌  pip install musicbrainzngs")
    sys.exit(1)

try:
    from rapidfuzz import fuzz
except ImportError:
    print("❌  pip install rapidfuzz")
    sys.exit(1)

try:
    from mutagen.mp3 import MP3
    from mutagen.id3 import (
        ID3, ID3NoHeaderError,
        TIT2, TPE1, TALB, TRCK, TDRC, TPE2, TPOS, TCON, TPUB
    )
    from mutagen.flac import FLAC
except ImportError:
    print("❌  pip install mutagen")
    sys.exit(1)

# ── MusicBrainz user-agent (obligatoire) ─────────────────────────────────────
mb.set_useragent("NavidromeAutoTagger", "1.0", "https://github.com/local/tagger")
mb.set_rate_limit(limit_or_interval=1.0)   # 1 req/s max (règle MusicBrainz)

# ── Logger ────────────────────────────────────────────────────────────────────
logging.basicConfig(
    level=logging.INFO,
    format="%(levelname)s  %(message)s",
    handlers=[
        logging.StreamHandler(),
        logging.FileHandler("tag_music_mb.log", encoding="utf-8"),
    ],
)
log = logging.getLogger(__name__)

# ── Seuil de confiance ────────────────────────────────────────────────────────
CONFIDENCE_THRESHOLD = 90   # %


# ═══════════════════════════════════════════════════════════════════════════════
# Parsing local (fallback)
# ═══════════════════════════════════════════════════════════════════════════════

def parse_track_filename(filename: str):
    """'01 - Titre de la piste.mp3'  →  ('1', 'Titre de la piste')"""
    stem = Path(filename).stem
    m = re.match(r'^(\d{1,3})\s*[-._\s]\s*(.+)$', stem)
    if m:
        return m.group(1).lstrip("0") or "1", m.group(2).strip()
    return None, stem.strip()


def parse_year(text: str):
    m = re.search(r'\b(19|20)\d{2}\b', text)
    return m.group(0) if m else None


def parse_disc(folder_name: str):
    """'CD1', 'Disc 2', 'Disk1'  →  '1'  (ou None)"""
    m = re.match(r'^(?:CD|Disc|Disk)\s*(\d+)$', folder_name, re.IGNORECASE)
    return m.group(1) if m else None


# ═══════════════════════════════════════════════════════════════════════════════
# MusicBrainz helpers
# ═══════════════════════════════════════════════════════════════════════════════

def _score(a: str, b: str) -> float:
    """Similarité token_sort entre deux chaînes, insensible à la casse."""
    return fuzz.token_sort_ratio(a.lower(), b.lower())


def search_release(artist: str, album: str):
    """
    Cherche une release MusicBrainz.
    Retourne (release_dict, confidence_float) ou (None, 0).
    """
    query = f'artist:"{artist}" AND release:"{album}"'
    try:
        result = mb.search_releases(query=query, limit=5)
    except mb.NetworkError as e:
        log.warning(f"MusicBrainz network error: {e}")
        return None, 0

    releases = result.get("release-list", [])
    if not releases:
        return None, 0

    best, best_score = None, 0.0
    for rel in releases:
        artist_credit = rel.get("artist-credit-phrase", "")
        rel_title     = rel.get("title", "")
        sc = (_score(artist, artist_credit) + _score(album, rel_title)) / 2
        if sc > best_score:
            best, best_score = rel, sc

    return best, best_score


def fetch_tracks(release_id: str):
    """
    Retourne la liste des pistes depuis le release MusicBrainz.
    [{'position': '1', 'disc': '1', 'title': '…', 'length': …}, …]
    """
    try:
        data = mb.get_release_by_id(release_id, includes=["recordings", "media"])
    except mb.NetworkError as e:
        log.warning(f"MusicBrainz fetch error: {e}")
        return []

    tracks = []
    for media in data["release"].get("medium-list", []):
        disc_pos = media.get("position", "1")
        for t in media.get("track-list", []):
            tracks.append({
                "disc":     str(disc_pos),
                "position": t.get("position", ""),
                "title":    t.get("recording", {}).get("title", ""),
            })
    return tracks


def match_track(local_title: str, local_num, mb_tracks: list):
    """
    Retrouve la piste MusicBrainz correspondante.
    Priorité : numéro de piste → fuzzy titre.
    Retourne (mb_track_dict, confidence) ou (None, 0).
    """
    # 1. Correspondance par numéro
    if local_num:
        for t in mb_tracks:
            if t["position"] == str(local_num):
                sc = _score(local_title, t["title"])
                return t, max(sc, 70)   # on fait confiance au numéro

    # 2. Fuzzy titre seul
    best, best_sc = None, 0.0
    for t in mb_tracks:
        sc = _score(local_title, t["title"])
        if sc > best_sc:
            best, best_sc = t, sc

    return best, best_sc


# ═══════════════════════════════════════════════════════════════════════════════
# Taggers mutagen
# ═══════════════════════════════════════════════════════════════════════════════

def apply_mp3(filepath: Path, tags: dict):
    try:
        audio = ID3(filepath)
    except ID3NoHeaderError:
        audio = ID3()

    def s(v): return v or ""

    audio.add(TIT2(encoding=3, text=s(tags.get("title"))))
    audio.add(TPE1(encoding=3, text=s(tags.get("artist"))))
    audio.add(TPE2(encoding=3, text=s(tags.get("albumartist"))))
    audio.add(TALB(encoding=3, text=s(tags.get("album"))))
    if tags.get("tracknumber"):
        audio.add(TRCK(encoding=3, text=str(tags["tracknumber"])))
    if tags.get("year"):
        audio.add(TDRC(encoding=3, text=str(tags["year"])))
    if tags.get("discnumber"):
        audio.add(TPOS(encoding=3, text=str(tags["discnumber"])))
    if tags.get("genre"):
        audio.add(TCON(encoding=3, text=s(tags["genre"])))
    if tags.get("label"):
        audio.add(TPUB(encoding=3, text=s(tags["label"])))
    audio.save(filepath, v2_version=3)


def apply_flac(filepath: Path, tags: dict):
    audio = FLAC(filepath)
    audio["title"]       = tags.get("title", "")
    audio["artist"]      = tags.get("artist", "")
    audio["albumartist"] = tags.get("albumartist", "")
    audio["album"]       = tags.get("album", "")
    if tags.get("tracknumber"):
        audio["tracknumber"] = str(tags["tracknumber"])
    if tags.get("year"):
        audio["date"] = str(tags["year"])
    if tags.get("discnumber"):
        audio["discnumber"] = str(tags["discnumber"])
    if tags.get("genre"):
        audio["genre"] = tags["genre"]
    if tags.get("label"):
        audio["organization"] = tags["label"]
    audio.save()


TAGGERS = {".mp3": apply_mp3, ".flac": apply_flac}


# ═══════════════════════════════════════════════════════════════════════════════
# Cache album (évite les appels MusicBrainz répétés)
# ═══════════════════════════════════════════════════════════════════════════════
_album_cache: dict = {}   # (artist, album) → (release, mb_tracks, confidence)


def resolve_album(artist: str, album: str):
    key = (artist.lower(), album.lower())
    if key in _album_cache:
        return _album_cache[key]

    release, confidence = search_release(artist, album)
    mb_tracks = []
    extra = {}

    if release and confidence >= CONFIDENCE_THRESHOLD:
        rid = release["id"]
        mb_tracks = fetch_tracks(rid)
        # Métadonnées complémentaires
        extra["year"]  = (release.get("date") or "")[:4] or parse_year(album)
        extra["label"] = (release.get("label-info-list") or [{}])[0] \
                              .get("label", {}).get("name", "")
        # Genre : premier tag MusicBrainz si dispo (souvent absent sans include)
        extra["genre"] = ""
        log.info(
            f"  ✔ MusicBrainz [{confidence:.0f}%]  "
            f"{release.get('artist-credit-phrase')} — {release.get('title')}  "
            f"({extra.get('year','')})"
        )
    else:
        if release:
            log.warning(
                f"  ⚠ MusicBrainz confiance insuffisante [{confidence:.0f}%] "
                f"pour « {artist} / {album} » → tags locaux"
            )
        else:
            log.warning(f"  ⚠ Aucun résultat MusicBrainz pour « {artist} / {album} » → tags locaux")
        release = None

    _album_cache[key] = (release, mb_tracks, confidence, extra)
    time.sleep(0.2)   # politesse supplémentaire
    return _album_cache[key]


# ═══════════════════════════════════════════════════════════════════════════════
# Parcours de la bibliothèque
# ═══════════════════════════════════════════════════════════════════════════════

def process_library(root: Path, dry_run: bool):
    stats = {"mb": 0, "local": 0, "error": 0, "skip": 0}

    for artist_dir in sorted(root.iterdir()):
        if not artist_dir.is_dir():
            continue
        artist_name = artist_dir.name

        for album_dir in sorted(artist_dir.iterdir()):
            if not album_dir.is_dir():
                continue

            album_raw  = album_dir.name
            local_year = parse_year(album_raw)
            # Nettoie l'année du nom d'album pour la recherche
            album_clean = re.sub(r'[\(\[]\s*(19|20)\d{2}\s*[\)\]]', '', album_raw).strip()
            album_clean = re.sub(r'\s*[-–]\s*(19|20)\d{2}$', '', album_clean).strip()

            log.info(f"\n{'─'*60}")
            log.info(f"🎵  {artist_name}  /  {album_clean}")

            release, mb_tracks, confidence, extra = resolve_album(artist_name, album_clean)
            use_mb = release is not None

            # Recherche des fichiers (dossier courant + sous-dossiers CD/Disc)
            def gather_files(directory: Path, disc_num=None):
                disc = disc_num or parse_disc(directory.name)
                for fp in sorted(directory.iterdir()):
                    if fp.is_file() and fp.suffix.lower() in TAGGERS:
                        yield fp, disc
                    elif fp.is_dir():
                        d = parse_disc(fp.name)
                        if d:
                            yield from gather_files(fp, disc_num=d)

            for filepath, disc_num in gather_files(album_dir):
                ext = filepath.suffix.lower()
                local_num, local_title = parse_track_filename(filepath.name)
                rel = filepath.relative_to(root)

                # ── Tags de base (fallback local) ──────────────────────────
                tags = {
                    "artist":      artist_name,
                    "albumartist": artist_name,
                    "album":       album_clean,
                    "title":       local_title,
                    "tracknumber": local_num,
                    "discnumber":  disc_num,
                    "year":        local_year,
                    "genre":       "",
                    "label":       "",
                }

                source = "local"

                # ── Enrichissement MusicBrainz ─────────────────────────────
                if use_mb:
                    mb_track, track_sc = match_track(local_title, local_num, mb_tracks)
                    if mb_track and track_sc >= CONFIDENCE_THRESHOLD:
                        tags["title"]       = mb_track["title"]
                        tags["tracknumber"] = mb_track["position"]
                        tags["discnumber"]  = mb_track.get("disc", disc_num)
                        tags["year"]        = extra.get("year") or local_year
                        tags["genre"]       = extra.get("genre", "")
                        tags["label"]       = extra.get("label", "")
                        source = f"MB[{track_sc:.0f}%]"
                    else:
                        sc_str = f"{track_sc:.0f}%" if mb_track else "—"
                        log.warning(
                            f"    ⚠ Piste non matchée [{sc_str}] : {filepath.name} → tags locaux"
                        )

                # ── Écriture ───────────────────────────────────────────────
                if dry_run:
                    log.info(f"  [DRY] {rel}")
                    log.info(f"        [{source}] {tags['tracknumber']}. {tags['title']}  "
                             f"| {tags['year']} | disc:{tags['discnumber']}")
                    stats["mb" if source != "local" else "local"] += 1
                    continue

                try:
                    TAGGERS[ext](filepath, tags)
                    log.info(
                        f"  ✔ [{source}] {rel.name}  →  "
                        f"{tags['tracknumber']}. {tags['title']}"
                    )
                    stats["mb" if source != "local" else "local"] += 1
                except Exception as exc:
                    log.error(f"  ✘ {rel}  →  {exc}")
                    stats["error"] += 1

    return stats


# ═══════════════════════════════════════════════════════════════════════════════
# Main
# ═══════════════════════════════════════════════════════════════════════════════

def main():
    parser = argparse.ArgumentParser(
        description="Tagger MP3/FLAC via MusicBrainz (≥90%% de confiance) pour Navidrome."
    )
    parser.add_argument(
        "root",
        help="Racine de la bibliothèque (ex: /mnt/music)"
    )
    parser.add_argument(
        "--dry-run", "-n",
        action="store_true",
        help="Simulation — aucun fichier modifié"
    )
    args = parser.parse_args()

    root = Path(args.root)
    if not root.exists():
        log.error(f"Chemin introuvable : {root}")
        sys.exit(1)

    mode = "DRY-RUN 🔍" if args.dry_run else "ÉCRITURE ✏️"
    log.info(f"{'═'*60}")
    log.info(f"  tag_music_mb.py  —  {mode}")
    log.info(f"  Racine   : {root}")
    log.info(f"  Seuil MB : {CONFIDENCE_THRESHOLD}%")
    log.info(f"{'═'*60}")

    stats = process_library(root, dry_run=args.dry_run)

    log.info(f"\n{'═'*60}")
    log.info(f"  ✔ MB       : {stats['mb']} fichiers taggés via MusicBrainz")
    log.info(f"  ✔ Local    : {stats['local']} fichiers taggés en local (fallback)")
    log.info(f"  ✘ Erreurs  : {stats['error']}")
    log.info(f"  Log complet → tag_music_mb.log")
    log.info(f"{'═'*60}")


if __name__ == "__main__":
    main()

What’s Next

  • Incremental mode: Currently these are full-scan scripts. A --newer-than flag or inotify watcher would make them useful for ongoing library maintenance.
  • Submit fingerprints back: The ~14,000 unmatched files could be submitted to AcoustID to help the next person with obscure taste.

Now my Navidrome feels a lot more polished: accurate artist names, tracks in the right order, release years on point, and album art everywhere. Pretty bang-for-the-buck for a Sunday project.

Tools I used: MusicBrainz, AcoustID, Cover Art Archive, Deezer API, Navidrome, mutagen, rapidfuzz. All scripts: navidrome-tools