Building a Semantic Search Engine for Music Cues: Local-First AI Without the API Costs

When you have hundreds of music files with inconsistent metadata, natural language search becomes essential—here's how I'm planning to build it without spending a dollar on API costs

October 22, 2025 AI/ML Audio Processing Local LLM

About Me: I'm a business and product executive with zero coding experience. I've spent my career building products by working with engineering teams at Amazon, Wondery, Fox, Rovi, and TV Guide, but never wrote production code myself. Until recently.

Frustrated with the pace of traditional development and inspired by the AI coding revolution, I decided to build my own projects using AI assistants (primarily Claude Code, Codex, and Cursor). This blog post is part of that journey—documenting what I've learned building real production systems as a complete beginner.

TL;DR

The Problem: Several hundred music cues scattered across directories, inconsistent metadata, and no good way to search for "uplifting orchestral with strings" or "dark and tense synth."

The Solution I'm Planning: A local-first semantic search system that:

Extracts audio features using librosa (tempo, key, spectral characteristics)
Generates rich descriptions using local LLMs via Ollama (Llama 3.2 or Mistral)
Creates semantic embeddings with sentence-transformers (100% local, no API calls)
Indexes everything in SQLite for fast similarity search
Serves results through a simple FastAPI + web UI

Cost: $0 (everything runs locally)
Privacy: 100% (audio never leaves my machine)
Trade-off: Slower initial indexing (~30-60 sec per file vs 5-10 sec with cloud APIs)

Key Learnings:

Local LLMs have reached "good enough" quality for many tasks
Audio feature extraction can replace true audio understanding for music cues
Sentence-transformers are surprisingly powerful for semantic search
KISS principle: Start with simple features, resist over-engineering
The best architecture is one you'll actually finish

The Spark: An Audio Engineer's Real Problem

This project didn't start with me. It started with a conversation with a friend—a seasoned audio engineer with decades of experience in music production.

We were catching up over coffee when he mentioned his frustration: hundreds of music cues scattered across drives, inconsistent metadata, and no good way to search for what he needed. "I waste hours just trying to find the right cue for a project," he said. "I know I have something that would work, but I can't remember what it's called or where it is."

I pulled out my laptop and showed him Claude Code. "What if we could build something to solve this?"

We spent the next hour ideating together. He explained the problem from an audio engineer's perspective—what metadata matters, what makes a good search result, how he actually thinks about music cues. I explained what was technically possible with AI, embeddings, and semantic search.

By the end of the conversation, we had a plan. This blog post is that plan—documented, thought through, and ready to build.

What I love about this: This isn't a solution looking for a problem. It's a real problem, faced by a real professional, that we can solve together using AI coding tools. He has the domain expertise; I have the AI coding tools. Perfect partnership.

The Problem: Music Cue Hell

As my friend explained it (and I'm experiencing it too): several hundred music cues from various projects spread across multiple machines and drives. They're in multiple formats (WAV, MP3, FLAC, AIFF), scattered across different directories, and the metadata quality is... inconsistent.

Some files have decent metadata:

upbeat-guitar-loop-120bpm.wav
dark-ambient-drone.mp3
jazz-piano-solo-Cmajor.flac

Others are basically useless:

track_final_v3_FINAL.wav
bounce_2024_03_15.mp3
audio_export.wav

What He (and I) Need to Search For

When working on a project, audio engineers think in terms of:

Mood/emotion: "Something melancholic and reflective"
Genre/style: "80s synth vibes" or "orchestral with strings"
Instrumentation: "Solo piano" or "electric guitar with drums"
Use case: "Background music for an interview" or "uplifting transition music"

Traditional file search (Spotlight, grep, whatever) completely fails at this. I need semantic search—search that understands meaning, not just filename matches.

The Initial Question: Cloud or Local?

When I started planning this, my first instinct was to use cloud APIs:

OpenAI Whisper for audio understanding
Claude or GPT-4 for generating descriptions
OpenAI embeddings for semantic search

But then I did the math.

Cost Analysis: Cloud vs Local

Task	Cloud Approach	Estimated Cost (500 files)	Local Approach	Cost
Audio Analysis	Claude Haiku w/ audio samples	$3-8	librosa feature extraction + local LLM	$0
Embeddings	OpenAI text-embedding-3-small	$1-2	sentence-transformers (local)	$0
Re-indexing	Same costs every time	$4-10 each time	Free unlimited re-runs	$0
Total	-	$5-10 (one-time)	-	$0

Okay, $5-10 isn't a lot. But here's what changed my mind:

Re-indexing flexibility - With cloud APIs, every re-index costs money. Want to tweak the prompts? That's another $5-10. Add 50 new files? More costs.
Privacy - Music cues might be from commercial projects. Uploading to cloud APIs feels sketchy.
Learning opportunity - I've never built with local LLMs or audio analysis libraries. This is a chance to learn.
Speed isn't critical - This is a one-time indexing job. If it takes 4 hours instead of 1 hour, so what?

Decision: Go local.

The Architecture: Local-First Everything

How Audio Search Actually Works

Here's the high-level flow:

1. SCAN FILES
   └─> Find all audio files in specified directories
   └─> Extract basic metadata (title, artist, genre, duration)

2. ANALYZE AUDIO
   └─> Extract features with librosa:
       • Tempo (BPM)
       • Key/pitch
       • Spectral features (brightness, rolloff)
       • MFCC (timbre characteristics)
       • Onset rate (rhythm complexity)
   └─> Convert features to text description
   └─> Use local LLM (Ollama) to synthesize into searchable description

3. GENERATE EMBEDDINGS
   └─> Combine all metadata + AI description into rich text
   └─> Generate embedding with sentence-transformers
   └─> Store in SQLite

4. SEARCH
   └─> Convert user query to embedding
   └─> Cosine similarity search against all cues
   └─> Return ranked results

The Key Components

1. Audio Feature Extraction: librosa

librosa is a Python library for music and audio analysis. It can extract tons of useful features without actually "listening" to the audio:

import librosa

# Load audio file
y, sr = librosa.load('upbeat-guitar.wav')

# Extract tempo
tempo, beats = librosa.beat.beat_track(y=y, sr=sr)
# Result: 120 BPM

# Extract key/pitch
chroma = librosa.feature.chroma_cqt(y=y, sr=sr)
# Result: C major

# Extract spectral features
spectral_centroids = librosa.feature.spectral_centroid(y=y, sr=sr)
# Result: "bright" or "dark" sound

# Extract MFCCs (timbre)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
# Result: timbral characteristics

From these features, I can generate text like:

"Tempo: 120 BPM, Key: C major, Brightness: High, Timbre: Warm, Rhythm complexity: Moderate"

2. Description Generation: Ollama (Local LLM)

Ollama makes running local LLMs ridiculously easy. No Docker, no complicated setup—just install and run.

The flow:

# Extract features from librosa
features = analyze_audio('upbeat-guitar.wav')

# Generate prompt for LLM
prompt = f"""
Based on these audio features, generate a rich description for music search:

Filename: upbeat-guitar.wav
Existing metadata: Title: "Upbeat Guitar Loop", Genre: "Rock"
Tempo: {features['tempo']} BPM
Key: {features['key']}
Spectral brightness: {features['brightness']}
Timbre: {features['timbre']}
Rhythm complexity: {features['rhythm']}

Generate a concise description including:
- Genre/style
- Mood/emotion
- Instrumentation (inferred from timbre)
- Suggested use cases

Return JSON: {{"genre": "...", "mood": "...", "instruments": "...", "description": "...", "tags": [...]}}
"""

# Call Ollama API (runs locally)
import ollama
response = ollama.chat(model='llama3.2', messages=[
    {'role': 'user', 'content': prompt}
])

# Parse JSON response
metadata = json.loads(response['message']['content'])
# Result: {
#   "genre": "Rock, Indie",
#   "mood": "Uplifting, Energetic",
#   "instruments": "Electric guitar, drums, bass",
#   "description": "Bright and energetic rock loop with driving rhythm",
#   "tags": ["upbeat", "guitar", "rock", "energetic", "loop"]
# }

Why this works: Even though the LLM hasn't "heard" the audio, the features from librosa give it enough context to generate meaningful descriptions.

3. Semantic Embeddings: sentence-transformers

sentence-transformers is a Python library that generates embeddings locally. No API calls, no costs.

from sentence_transformers import SentenceTransformer

# Load model (downloads once, then cached)
model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, good quality

# Combine all metadata into searchable text
searchable_text = f"""
Title: {metadata['title']}
Genre: {metadata['genre']}
Mood: {metadata['mood']}
Instruments: {metadata['instruments']}
Description: {metadata['description']}
Tags: {', '.join(metadata['tags'])}
"""

# Generate embedding
embedding = model.encode(searchable_text)
# Result: 384-dimensional vector

# Store in SQLite
db.execute("INSERT INTO cues (path, metadata, embedding) VALUES (?, ?, ?)",
           (file_path, json.dumps(metadata), embedding.tobytes()))

4. Search: Cosine Similarity

When a user searches, convert their query to an embedding and find the most similar cues:

# User query
query = "uplifting orchestral music with strings for a dramatic scene"

# Generate query embedding
query_embedding = model.encode(query)

# Load all embeddings from database
cues = db.execute("SELECT path, metadata, embedding FROM cues").fetchall()

# Calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
scores = []
for cue in cues:
    cue_embedding = np.frombuffer(cue['embedding'])
    similarity = cosine_similarity([query_embedding], [cue_embedding])[0][0]
    scores.append((cue['path'], cue['metadata'], similarity))

# Sort by similarity (highest first)
results = sorted(scores, key=lambda x: x[2], reverse=True)[:10]

# Return top 10 results

The Tech Stack

Component	Technology	Why
Backend	Python + FastAPI	Great ecosystem for ML/audio, FastAPI is simple and fast
Audio Analysis	librosa	Industry standard for music feature extraction
LLM	Ollama (Llama 3.2 or Mistral)	Easy local inference, good quality, free
Embeddings	sentence-transformers	Local, fast, no API costs
Database	SQLite	Zero config, perfect for local apps
Frontend	HTML/CSS/JS (vanilla)	Simple web UI with search box and audio player

Dependencies

# requirements.txt
fastapi==0.104.1
uvicorn==0.24.0
librosa==0.10.1
ollama==0.1.6
sentence-transformers==2.2.2
scikit-learn==1.3.2
mutagen==1.47.0  # For metadata extraction

Total disk space: ~5GB (mostly for sentence-transformers model cache and Ollama models)

The Implementation Plan (KISS/YAGNI)

I'm planning to build this in phases, keeping it simple and only adding complexity when needed.

Phase 1: Audit & Extract (Est. 2 hours)

Goal: Understand what I have.

Recursively scan directories for audio files
Extract existing metadata using mutagen
Use ffprobe for format info (duration, sample rate, bitrate)
Generate CSV report: path, title, artist, genre, duration, quality_score
Identify which files need AI analysis (poor metadata)

Quality scoring:

Good (3 points): Has title, artist, genre, and description
Fair (2 points): Has some metadata but missing key fields
Poor (1 point): Filename only, no useful metadata

Phase 2: Audio Analysis (Est. 3 hours)

Goal: Generate rich descriptions for all files.

For files with "good" metadata: Skip AI analysis, just use existing data
For "fair" and "poor" files: Run full analysis pipeline
Extract audio features with librosa (30-second samples to save time)
Generate descriptions using Ollama
Store results in SQLite

Smart optimization: Only analyze the intro/middle/outro (3x 10-second clips) instead of the full file. This reduces processing time by ~70% while still capturing the essence of the music.

Phase 3: Indexing (Est. 1 hour)

Goal: Generate embeddings for semantic search.

Combine all metadata into searchable text
Generate embeddings using sentence-transformers
Store in SQLite with proper indexes
Test search with sample queries

Phase 4: Search API (Est. 2 hours)

Goal: Build a simple FastAPI backend.

# main.py
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')

@app.post("/search")
async def search(query: str, limit: int = 10):
    # Generate query embedding
    query_embedding = model.encode(query)

    # Search database for similar cues
    results = search_similar(query_embedding, limit)

    return {"results": results}

@app.get("/cue/{id}")
async def get_cue(id: int):
    # Return full metadata for a specific cue
    return get_cue_by_id(id)

@app.get("/audio/{id}")
async def stream_audio(id: int):
    # Stream audio file for preview
    return FileResponse(get_audio_path(id))

Phase 5: Web UI (Est. 3 hours)

Goal: Simple, functional interface.

Search box with example queries
Real-time results (debounced, 300ms delay)
Result cards showing:
- Title, genre, mood, instruments
- Description and tags
- Similarity score (0-100%)
- HTML5 audio player for preview
- "Copy path" button
Filters: duration range, genre, mood
Keyboard shortcuts (↑/↓ to navigate results, Enter to play/pause)

Trade-offs: What I'm Giving Up

Going local-first means accepting some compromises:

Aspect	Cloud APIs	Local-First
Quality	Best-in-class (GPT-4, Claude)	Good enough (Llama 3.2, Mistral)
Speed	5-10 sec per file	30-60 sec per file
Setup	API keys only	Install Ollama, download models (~5GB)
Cost	$5-10 per index	$0 forever
Privacy	Audio sent to APIs	100% local
Re-indexing	Costs every time	Free unlimited

Is "good enough" quality acceptable?

For music cue search, yes. I'm not generating creative content or writing essays. I need:

"This is orchestral music" → Local LLM can do this
"Mood is uplifting and dramatic" → Features + LLM can infer this
"Instruments include strings and piano" → Timbre analysis + LLM works

The descriptions don't need to be perfect. They need to be searchable. Big difference.

What I'd Do Differently

1. Start with a Smaller Test Set

Instead of indexing all 500 files at once, I should start with 50 representative samples. Test the pipeline, tune the prompts, validate search quality—then scale up.

2. Build the Search UI First

I'm planning to build the indexer first, then the search UI. But actually, I should build a mock search UI with fake data first. This helps me understand what metadata I actually need before spending hours extracting it.

3. Consider Hybrid Approach

Maybe use local LLMs for most files, but fall back to Claude API for files where local analysis struggles (e.g., very complex orchestral pieces). Best of both worlds: mostly free, occasionally high-quality.

4. Version the Embeddings

If I improve the prompts or switch models later, I'll need to re-generate embeddings. Should include a version field in the database so I can track which cues need re-indexing.

Key Learnings from the Planning Process

1. KISS (Keep It Simple, Stupid) Is Hard

My first instinct was to over-engineer this:

"Should I build a recommendation engine?"
"What about playlist generation?"
"Should I support collaborative filtering?"

No. I need to search music cues. That's it. Everything else is scope creep.

2. YAGNI (You Aren't Gonna Need It) Applies to Infrastructure Too

I almost convinced myself I needed:

Kubernetes for deployment
PostgreSQL instead of SQLite
Redis for caching
Docker for containerization

Why? This is a local tool for me. It doesn't need to scale to millions of users. SQLite + FastAPI running on localhost:8000 is plenty.

3. The Best Architecture Is One You'll Finish

Cloud APIs would give me slightly better quality and faster indexing. But the local-first approach is more exciting to me because:

I get to learn new tools (Ollama, librosa, sentence-transformers)
I can experiment freely without worrying about API costs
The entire system runs on my laptop with no dependencies

If the architecture motivates me to finish the project, that's the right architecture.

4. Audio Feature Extraction Is Shockingly Powerful

I assumed I'd need to actually "listen" to the audio (like Whisper does for speech). But librosa can extract so much information just from the waveform:

Tempo and rhythm
Pitch and key
Brightness and timbre
Harmonic vs percussive content

Combined with a local LLM, these features are enough to generate rich, searchable descriptions.

What's Next

This post is the planning phase. I haven't written a single line of code yet.

Next steps:

Install dependencies - Ollama, librosa, sentence-transformers
Build the audit script - Scan directories, extract metadata, generate report
Test audio analysis - Run librosa + Ollama on 10 sample files, validate quality
Build the indexer - Full pipeline from audio → embeddings → SQLite
Build the search API - FastAPI backend with cosine similarity search
Build the web UI - Simple search interface with audio preview
Write a follow-up post - Document what actually happened vs what I planned

I'm estimating 12-15 hours of work total. We'll see if that's optimistic or pessimistic.

Final Thoughts

This project is a great example of how AI coding tools change the calculus of what's worth building.

Before AI coding assistants:

I'd need to learn librosa, Ollama, sentence-transformers, FastAPI, and SQLite
Each library would take days or weeks to understand
Total time to build: 2-3 months
Likelihood of finishing: 20%

With AI coding assistants:

I understand the concepts (embeddings, semantic search, audio features)
Claude Code handles the implementation details
Total time to build: 12-15 hours
Likelihood of finishing: 90%

The difference isn't that the AI "does it for me." The difference is that I can focus on what to build instead of how to implement every detail.

I still need to:

Understand the problem domain (music search, audio analysis)
Choose the right architecture (local vs cloud, SQLite vs PostgreSQL)
Make UX decisions (what metadata to show, how to display results)
Validate quality (do the descriptions make sense? does search work?)

But I don't need to memorize the librosa API or debug sentence-transformers tensor dimensions. That's what Claude Code is for.

The best part? Even if this project doesn't work perfectly, I'll have learned a ton about audio analysis, semantic search, and local LLMs. That knowledge carries forward to the next project.

And working with my audio engineer friend means I'm building something that solves a real problem for a real professional. His domain expertise combined with AI coding tools is a powerful combination—he knows exactly what he needs, and I can help him build it.

That's the real value of building in public: the journey, the collaboration, and the learning—not just the destination.

Follow along as I build this with my friend. Next post will cover the actual implementation—what worked, what didn't, and what we learned along the way.