Building a Semantic Search Engine for Music Cues: Local-First AI Without the API Costs

When you have hundreds of music files with inconsistent metadata, natural language search becomes essential—here's how I'm planning to build it without spending a dollar on API costs


About Me: I'm a business and product executive with zero coding experience. I've spent my career building products by working with engineering teams at Amazon, Wondery, Fox, Rovi, and TV Guide, but never wrote production code myself. Until recently.

Frustrated with the pace of traditional development and inspired by the AI coding revolution, I decided to build my own projects using AI assistants (primarily Claude Code, Codex, and Cursor). This blog post is part of that journey—documenting what I've learned building real production systems as a complete beginner.


TL;DR

The Problem: Several hundred music cues scattered across directories, inconsistent metadata, and no good way to search for "uplifting orchestral with strings" or "dark and tense synth."

The Solution I'm Planning: A local-first semantic search system that:

Cost: $0 (everything runs locally)
Privacy: 100% (audio never leaves my machine)
Trade-off: Slower initial indexing (~30-60 sec per file vs 5-10 sec with cloud APIs)

Key Learnings:


The Spark: An Audio Engineer's Real Problem

This project didn't start with me. It started with a conversation with a friend—a seasoned audio engineer with decades of experience in music production.

We were catching up over coffee when he mentioned his frustration: hundreds of music cues scattered across drives, inconsistent metadata, and no good way to search for what he needed. "I waste hours just trying to find the right cue for a project," he said. "I know I have something that would work, but I can't remember what it's called or where it is."

I pulled out my laptop and showed him Claude Code. "What if we could build something to solve this?"

We spent the next hour ideating together. He explained the problem from an audio engineer's perspective—what metadata matters, what makes a good search result, how he actually thinks about music cues. I explained what was technically possible with AI, embeddings, and semantic search.

By the end of the conversation, we had a plan. This blog post is that plan—documented, thought through, and ready to build.

What I love about this: This isn't a solution looking for a problem. It's a real problem, faced by a real professional, that we can solve together using AI coding tools. He has the domain expertise; I have the AI coding tools. Perfect partnership.


The Problem: Music Cue Hell

As my friend explained it (and I'm experiencing it too): several hundred music cues from various projects spread across multiple machines and drives. They're in multiple formats (WAV, MP3, FLAC, AIFF), scattered across different directories, and the metadata quality is... inconsistent.

Some files have decent metadata:

upbeat-guitar-loop-120bpm.wav
dark-ambient-drone.mp3
jazz-piano-solo-Cmajor.flac

Others are basically useless:

track_final_v3_FINAL.wav
bounce_2024_03_15.mp3
audio_export.wav

What He (and I) Need to Search For

When working on a project, audio engineers think in terms of:

Traditional file search (Spotlight, grep, whatever) completely fails at this. I need semantic search—search that understands meaning, not just filename matches.


The Initial Question: Cloud or Local?

When I started planning this, my first instinct was to use cloud APIs:

But then I did the math.

Cost Analysis: Cloud vs Local

Task Cloud Approach Estimated Cost (500 files) Local Approach Cost
Audio Analysis Claude Haiku w/ audio samples $3-8 librosa feature extraction + local LLM $0
Embeddings OpenAI text-embedding-3-small $1-2 sentence-transformers (local) $0
Re-indexing Same costs every time $4-10 each time Free unlimited re-runs $0
Total - $5-10 (one-time) - $0

Okay, $5-10 isn't a lot. But here's what changed my mind:

  1. Re-indexing flexibility - With cloud APIs, every re-index costs money. Want to tweak the prompts? That's another $5-10. Add 50 new files? More costs.
  2. Privacy - Music cues might be from commercial projects. Uploading to cloud APIs feels sketchy.
  3. Learning opportunity - I've never built with local LLMs or audio analysis libraries. This is a chance to learn.
  4. Speed isn't critical - This is a one-time indexing job. If it takes 4 hours instead of 1 hour, so what?

Decision: Go local.


The Architecture: Local-First Everything

How Audio Search Actually Works

Here's the high-level flow:

1. SCAN FILES
   └─> Find all audio files in specified directories
   └─> Extract basic metadata (title, artist, genre, duration)

2. ANALYZE AUDIO
   └─> Extract features with librosa:
       • Tempo (BPM)
       • Key/pitch
       • Spectral features (brightness, rolloff)
       • MFCC (timbre characteristics)
       • Onset rate (rhythm complexity)
   └─> Convert features to text description
   └─> Use local LLM (Ollama) to synthesize into searchable description

3. GENERATE EMBEDDINGS
   └─> Combine all metadata + AI description into rich text
   └─> Generate embedding with sentence-transformers
   └─> Store in SQLite

4. SEARCH
   └─> Convert user query to embedding
   └─> Cosine similarity search against all cues
   └─> Return ranked results

The Key Components

1. Audio Feature Extraction: librosa

librosa is a Python library for music and audio analysis. It can extract tons of useful features without actually "listening" to the audio:

import librosa

# Load audio file
y, sr = librosa.load('upbeat-guitar.wav')

# Extract tempo
tempo, beats = librosa.beat.beat_track(y=y, sr=sr)
# Result: 120 BPM

# Extract key/pitch
chroma = librosa.feature.chroma_cqt(y=y, sr=sr)
# Result: C major

# Extract spectral features
spectral_centroids = librosa.feature.spectral_centroid(y=y, sr=sr)
# Result: "bright" or "dark" sound

# Extract MFCCs (timbre)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
# Result: timbral characteristics

From these features, I can generate text like:

"Tempo: 120 BPM, Key: C major, Brightness: High, Timbre: Warm, Rhythm complexity: Moderate"

2. Description Generation: Ollama (Local LLM)

Ollama makes running local LLMs ridiculously easy. No Docker, no complicated setup—just install and run.

The flow:

# Extract features from librosa
features = analyze_audio('upbeat-guitar.wav')

# Generate prompt for LLM
prompt = f"""
Based on these audio features, generate a rich description for music search:

Filename: upbeat-guitar.wav
Existing metadata: Title: "Upbeat Guitar Loop", Genre: "Rock"
Tempo: {features['tempo']} BPM
Key: {features['key']}
Spectral brightness: {features['brightness']}
Timbre: {features['timbre']}
Rhythm complexity: {features['rhythm']}

Generate a concise description including:
- Genre/style
- Mood/emotion
- Instrumentation (inferred from timbre)
- Suggested use cases

Return JSON: {{"genre": "...", "mood": "...", "instruments": "...", "description": "...", "tags": [...]}}
"""

# Call Ollama API (runs locally)
import ollama
response = ollama.chat(model='llama3.2', messages=[
    {'role': 'user', 'content': prompt}
])

# Parse JSON response
metadata = json.loads(response['message']['content'])
# Result: {
#   "genre": "Rock, Indie",
#   "mood": "Uplifting, Energetic",
#   "instruments": "Electric guitar, drums, bass",
#   "description": "Bright and energetic rock loop with driving rhythm",
#   "tags": ["upbeat", "guitar", "rock", "energetic", "loop"]
# }

Why this works: Even though the LLM hasn't "heard" the audio, the features from librosa give it enough context to generate meaningful descriptions.

3. Semantic Embeddings: sentence-transformers

sentence-transformers is a Python library that generates embeddings locally. No API calls, no costs.

from sentence_transformers import SentenceTransformer

# Load model (downloads once, then cached)
model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, good quality

# Combine all metadata into searchable text
searchable_text = f"""
Title: {metadata['title']}
Genre: {metadata['genre']}
Mood: {metadata['mood']}
Instruments: {metadata['instruments']}
Description: {metadata['description']}
Tags: {', '.join(metadata['tags'])}
"""

# Generate embedding
embedding = model.encode(searchable_text)
# Result: 384-dimensional vector

# Store in SQLite
db.execute("INSERT INTO cues (path, metadata, embedding) VALUES (?, ?, ?)",
           (file_path, json.dumps(metadata), embedding.tobytes()))

4. Search: Cosine Similarity

When a user searches, convert their query to an embedding and find the most similar cues:

# User query
query = "uplifting orchestral music with strings for a dramatic scene"

# Generate query embedding
query_embedding = model.encode(query)

# Load all embeddings from database
cues = db.execute("SELECT path, metadata, embedding FROM cues").fetchall()

# Calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
scores = []
for cue in cues:
    cue_embedding = np.frombuffer(cue['embedding'])
    similarity = cosine_similarity([query_embedding], [cue_embedding])[0][0]
    scores.append((cue['path'], cue['metadata'], similarity))

# Sort by similarity (highest first)
results = sorted(scores, key=lambda x: x[2], reverse=True)[:10]

# Return top 10 results

The Tech Stack

Component Technology Why
Backend Python + FastAPI Great ecosystem for ML/audio, FastAPI is simple and fast
Audio Analysis librosa Industry standard for music feature extraction
LLM Ollama (Llama 3.2 or Mistral) Easy local inference, good quality, free
Embeddings sentence-transformers Local, fast, no API costs
Database SQLite Zero config, perfect for local apps
Frontend HTML/CSS/JS (vanilla) Simple web UI with search box and audio player

Dependencies

# requirements.txt
fastapi==0.104.1
uvicorn==0.24.0
librosa==0.10.1
ollama==0.1.6
sentence-transformers==2.2.2
scikit-learn==1.3.2
mutagen==1.47.0  # For metadata extraction

Total disk space: ~5GB (mostly for sentence-transformers model cache and Ollama models)


The Implementation Plan (KISS/YAGNI)

I'm planning to build this in phases, keeping it simple and only adding complexity when needed.

Phase 1: Audit & Extract (Est. 2 hours)

Goal: Understand what I have.

Quality scoring:

Phase 2: Audio Analysis (Est. 3 hours)

Goal: Generate rich descriptions for all files.

Smart optimization: Only analyze the intro/middle/outro (3x 10-second clips) instead of the full file. This reduces processing time by ~70% while still capturing the essence of the music.

Phase 3: Indexing (Est. 1 hour)

Goal: Generate embeddings for semantic search.

Phase 4: Search API (Est. 2 hours)

Goal: Build a simple FastAPI backend.

# main.py
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')

@app.post("/search")
async def search(query: str, limit: int = 10):
    # Generate query embedding
    query_embedding = model.encode(query)

    # Search database for similar cues
    results = search_similar(query_embedding, limit)

    return {"results": results}

@app.get("/cue/{id}")
async def get_cue(id: int):
    # Return full metadata for a specific cue
    return get_cue_by_id(id)

@app.get("/audio/{id}")
async def stream_audio(id: int):
    # Stream audio file for preview
    return FileResponse(get_audio_path(id))

Phase 5: Web UI (Est. 3 hours)

Goal: Simple, functional interface.


Trade-offs: What I'm Giving Up

Going local-first means accepting some compromises:

Aspect Cloud APIs Local-First
Quality Best-in-class (GPT-4, Claude) Good enough (Llama 3.2, Mistral)
Speed 5-10 sec per file 30-60 sec per file
Setup API keys only Install Ollama, download models (~5GB)
Cost $5-10 per index $0 forever
Privacy Audio sent to APIs 100% local
Re-indexing Costs every time Free unlimited

Is "good enough" quality acceptable?

For music cue search, yes. I'm not generating creative content or writing essays. I need:

The descriptions don't need to be perfect. They need to be searchable. Big difference.


What I'd Do Differently

1. Start with a Smaller Test Set

Instead of indexing all 500 files at once, I should start with 50 representative samples. Test the pipeline, tune the prompts, validate search quality—then scale up.

2. Build the Search UI First

I'm planning to build the indexer first, then the search UI. But actually, I should build a mock search UI with fake data first. This helps me understand what metadata I actually need before spending hours extracting it.

3. Consider Hybrid Approach

Maybe use local LLMs for most files, but fall back to Claude API for files where local analysis struggles (e.g., very complex orchestral pieces). Best of both worlds: mostly free, occasionally high-quality.

4. Version the Embeddings

If I improve the prompts or switch models later, I'll need to re-generate embeddings. Should include a version field in the database so I can track which cues need re-indexing.


Key Learnings from the Planning Process

1. KISS (Keep It Simple, Stupid) Is Hard

My first instinct was to over-engineer this:

No. I need to search music cues. That's it. Everything else is scope creep.

2. YAGNI (You Aren't Gonna Need It) Applies to Infrastructure Too

I almost convinced myself I needed:

Why? This is a local tool for me. It doesn't need to scale to millions of users. SQLite + FastAPI running on localhost:8000 is plenty.

3. The Best Architecture Is One You'll Finish

Cloud APIs would give me slightly better quality and faster indexing. But the local-first approach is more exciting to me because:

If the architecture motivates me to finish the project, that's the right architecture.

4. Audio Feature Extraction Is Shockingly Powerful

I assumed I'd need to actually "listen" to the audio (like Whisper does for speech). But librosa can extract so much information just from the waveform:

Combined with a local LLM, these features are enough to generate rich, searchable descriptions.


What's Next

This post is the planning phase. I haven't written a single line of code yet.

Next steps:

  1. Install dependencies - Ollama, librosa, sentence-transformers
  2. Build the audit script - Scan directories, extract metadata, generate report
  3. Test audio analysis - Run librosa + Ollama on 10 sample files, validate quality
  4. Build the indexer - Full pipeline from audio → embeddings → SQLite
  5. Build the search API - FastAPI backend with cosine similarity search
  6. Build the web UI - Simple search interface with audio preview
  7. Write a follow-up post - Document what actually happened vs what I planned

I'm estimating 12-15 hours of work total. We'll see if that's optimistic or pessimistic.


Final Thoughts

This project is a great example of how AI coding tools change the calculus of what's worth building.

Before AI coding assistants:

With AI coding assistants:

The difference isn't that the AI "does it for me." The difference is that I can focus on what to build instead of how to implement every detail.

I still need to:

But I don't need to memorize the librosa API or debug sentence-transformers tensor dimensions. That's what Claude Code is for.

The best part? Even if this project doesn't work perfectly, I'll have learned a ton about audio analysis, semantic search, and local LLMs. That knowledge carries forward to the next project.

And working with my audio engineer friend means I'm building something that solves a real problem for a real professional. His domain expertise combined with AI coding tools is a powerful combination—he knows exactly what he needs, and I can help him build it.

That's the real value of building in public: the journey, the collaboration, and the learning—not just the destination.


Follow along as I build this with my friend. Next post will cover the actual implementation—what worked, what didn't, and what we learned along the way.