Building a Semantic Search Engine for Music Cues: Local-First AI Without the API Costs
When you have hundreds of music files with inconsistent metadata, natural language search becomes essential—here's how I'm planning to build it without spending a dollar on API costs
About Me: I'm a business and product executive with zero coding experience. I've spent my career building products by working with engineering teams at Amazon, Wondery, Fox, Rovi, and TV Guide, but never wrote production code myself. Until recently.
Frustrated with the pace of traditional development and inspired by the AI coding revolution, I decided to build my own projects using AI assistants (primarily Claude Code, Codex, and Cursor). This blog post is part of that journey—documenting what I've learned building real production systems as a complete beginner.
TL;DR
The Problem: Several hundred music cues scattered across directories, inconsistent metadata, and no good way to search for "uplifting orchestral with strings" or "dark and tense synth."
The Solution I'm Planning: A local-first semantic search system that:
- Extracts audio features using
librosa(tempo, key, spectral characteristics) - Generates rich descriptions using local LLMs via Ollama (Llama 3.2 or Mistral)
- Creates semantic embeddings with
sentence-transformers(100% local, no API calls) - Indexes everything in SQLite for fast similarity search
- Serves results through a simple FastAPI + web UI
Cost: $0 (everything runs locally)
Privacy: 100% (audio never leaves my machine)
Trade-off: Slower initial indexing (~30-60 sec per file vs 5-10 sec with cloud APIs)
Key Learnings:
- Local LLMs have reached "good enough" quality for many tasks
- Audio feature extraction can replace true audio understanding for music cues
- Sentence-transformers are surprisingly powerful for semantic search
- KISS principle: Start with simple features, resist over-engineering
- The best architecture is one you'll actually finish
The Spark: An Audio Engineer's Real Problem
This project didn't start with me. It started with a conversation with a friend—a seasoned audio engineer with decades of experience in music production.
We were catching up over coffee when he mentioned his frustration: hundreds of music cues scattered across drives, inconsistent metadata, and no good way to search for what he needed. "I waste hours just trying to find the right cue for a project," he said. "I know I have something that would work, but I can't remember what it's called or where it is."
I pulled out my laptop and showed him Claude Code. "What if we could build something to solve this?"
We spent the next hour ideating together. He explained the problem from an audio engineer's perspective—what metadata matters, what makes a good search result, how he actually thinks about music cues. I explained what was technically possible with AI, embeddings, and semantic search.
By the end of the conversation, we had a plan. This blog post is that plan—documented, thought through, and ready to build.
What I love about this: This isn't a solution looking for a problem. It's a real problem, faced by a real professional, that we can solve together using AI coding tools. He has the domain expertise; I have the AI coding tools. Perfect partnership.
The Problem: Music Cue Hell
As my friend explained it (and I'm experiencing it too): several hundred music cues from various projects spread across multiple machines and drives. They're in multiple formats (WAV, MP3, FLAC, AIFF), scattered across different directories, and the metadata quality is... inconsistent.
Some files have decent metadata:
upbeat-guitar-loop-120bpm.wav
dark-ambient-drone.mp3
jazz-piano-solo-Cmajor.flac
Others are basically useless:
track_final_v3_FINAL.wav
bounce_2024_03_15.mp3
audio_export.wav
What He (and I) Need to Search For
When working on a project, audio engineers think in terms of:
- Mood/emotion: "Something melancholic and reflective"
- Genre/style: "80s synth vibes" or "orchestral with strings"
- Instrumentation: "Solo piano" or "electric guitar with drums"
- Use case: "Background music for an interview" or "uplifting transition music"
Traditional file search (Spotlight, grep, whatever) completely fails at this. I need semantic search—search that understands meaning, not just filename matches.
The Initial Question: Cloud or Local?
When I started planning this, my first instinct was to use cloud APIs:
- OpenAI Whisper for audio understanding
- Claude or GPT-4 for generating descriptions
- OpenAI embeddings for semantic search
But then I did the math.
Cost Analysis: Cloud vs Local
| Task | Cloud Approach | Estimated Cost (500 files) | Local Approach | Cost |
|---|---|---|---|---|
| Audio Analysis | Claude Haiku w/ audio samples | $3-8 | librosa feature extraction + local LLM | $0 |
| Embeddings | OpenAI text-embedding-3-small | $1-2 | sentence-transformers (local) | $0 |
| Re-indexing | Same costs every time | $4-10 each time | Free unlimited re-runs | $0 |
| Total | - | $5-10 (one-time) | - | $0 |
Okay, $5-10 isn't a lot. But here's what changed my mind:
- Re-indexing flexibility - With cloud APIs, every re-index costs money. Want to tweak the prompts? That's another $5-10. Add 50 new files? More costs.
- Privacy - Music cues might be from commercial projects. Uploading to cloud APIs feels sketchy.
- Learning opportunity - I've never built with local LLMs or audio analysis libraries. This is a chance to learn.
- Speed isn't critical - This is a one-time indexing job. If it takes 4 hours instead of 1 hour, so what?
Decision: Go local.
The Architecture: Local-First Everything
How Audio Search Actually Works
Here's the high-level flow:
1. SCAN FILES
└─> Find all audio files in specified directories
└─> Extract basic metadata (title, artist, genre, duration)
2. ANALYZE AUDIO
└─> Extract features with librosa:
• Tempo (BPM)
• Key/pitch
• Spectral features (brightness, rolloff)
• MFCC (timbre characteristics)
• Onset rate (rhythm complexity)
└─> Convert features to text description
└─> Use local LLM (Ollama) to synthesize into searchable description
3. GENERATE EMBEDDINGS
└─> Combine all metadata + AI description into rich text
└─> Generate embedding with sentence-transformers
└─> Store in SQLite
4. SEARCH
└─> Convert user query to embedding
└─> Cosine similarity search against all cues
└─> Return ranked results
The Key Components
1. Audio Feature Extraction: librosa
librosa is a Python library for music and audio analysis. It can extract tons of useful features without actually "listening" to the audio:
import librosa
# Load audio file
y, sr = librosa.load('upbeat-guitar.wav')
# Extract tempo
tempo, beats = librosa.beat.beat_track(y=y, sr=sr)
# Result: 120 BPM
# Extract key/pitch
chroma = librosa.feature.chroma_cqt(y=y, sr=sr)
# Result: C major
# Extract spectral features
spectral_centroids = librosa.feature.spectral_centroid(y=y, sr=sr)
# Result: "bright" or "dark" sound
# Extract MFCCs (timbre)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
# Result: timbral characteristics
From these features, I can generate text like:
"Tempo: 120 BPM, Key: C major, Brightness: High, Timbre: Warm, Rhythm complexity: Moderate"
2. Description Generation: Ollama (Local LLM)
Ollama makes running local LLMs ridiculously easy. No Docker, no complicated setup—just install and run.
The flow:
# Extract features from librosa
features = analyze_audio('upbeat-guitar.wav')
# Generate prompt for LLM
prompt = f"""
Based on these audio features, generate a rich description for music search:
Filename: upbeat-guitar.wav
Existing metadata: Title: "Upbeat Guitar Loop", Genre: "Rock"
Tempo: {features['tempo']} BPM
Key: {features['key']}
Spectral brightness: {features['brightness']}
Timbre: {features['timbre']}
Rhythm complexity: {features['rhythm']}
Generate a concise description including:
- Genre/style
- Mood/emotion
- Instrumentation (inferred from timbre)
- Suggested use cases
Return JSON: {{"genre": "...", "mood": "...", "instruments": "...", "description": "...", "tags": [...]}}
"""
# Call Ollama API (runs locally)
import ollama
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': prompt}
])
# Parse JSON response
metadata = json.loads(response['message']['content'])
# Result: {
# "genre": "Rock, Indie",
# "mood": "Uplifting, Energetic",
# "instruments": "Electric guitar, drums, bass",
# "description": "Bright and energetic rock loop with driving rhythm",
# "tags": ["upbeat", "guitar", "rock", "energetic", "loop"]
# }
Why this works: Even though the LLM hasn't "heard" the audio, the features from librosa give it enough context to generate meaningful descriptions.
3. Semantic Embeddings: sentence-transformers
sentence-transformers is a Python library that generates embeddings locally. No API calls, no costs.
from sentence_transformers import SentenceTransformer
# Load model (downloads once, then cached)
model = SentenceTransformer('all-MiniLM-L6-v2') # Fast, good quality
# Combine all metadata into searchable text
searchable_text = f"""
Title: {metadata['title']}
Genre: {metadata['genre']}
Mood: {metadata['mood']}
Instruments: {metadata['instruments']}
Description: {metadata['description']}
Tags: {', '.join(metadata['tags'])}
"""
# Generate embedding
embedding = model.encode(searchable_text)
# Result: 384-dimensional vector
# Store in SQLite
db.execute("INSERT INTO cues (path, metadata, embedding) VALUES (?, ?, ?)",
(file_path, json.dumps(metadata), embedding.tobytes()))
4. Search: Cosine Similarity
When a user searches, convert their query to an embedding and find the most similar cues:
# User query
query = "uplifting orchestral music with strings for a dramatic scene"
# Generate query embedding
query_embedding = model.encode(query)
# Load all embeddings from database
cues = db.execute("SELECT path, metadata, embedding FROM cues").fetchall()
# Calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
scores = []
for cue in cues:
cue_embedding = np.frombuffer(cue['embedding'])
similarity = cosine_similarity([query_embedding], [cue_embedding])[0][0]
scores.append((cue['path'], cue['metadata'], similarity))
# Sort by similarity (highest first)
results = sorted(scores, key=lambda x: x[2], reverse=True)[:10]
# Return top 10 results
The Tech Stack
| Component | Technology | Why |
|---|---|---|
| Backend | Python + FastAPI | Great ecosystem for ML/audio, FastAPI is simple and fast |
| Audio Analysis | librosa | Industry standard for music feature extraction |
| LLM | Ollama (Llama 3.2 or Mistral) | Easy local inference, good quality, free |
| Embeddings | sentence-transformers | Local, fast, no API costs |
| Database | SQLite | Zero config, perfect for local apps |
| Frontend | HTML/CSS/JS (vanilla) | Simple web UI with search box and audio player |
Dependencies
# requirements.txt
fastapi==0.104.1
uvicorn==0.24.0
librosa==0.10.1
ollama==0.1.6
sentence-transformers==2.2.2
scikit-learn==1.3.2
mutagen==1.47.0 # For metadata extraction
Total disk space: ~5GB (mostly for sentence-transformers model cache and Ollama models)
The Implementation Plan (KISS/YAGNI)
I'm planning to build this in phases, keeping it simple and only adding complexity when needed.
Phase 1: Audit & Extract (Est. 2 hours)
Goal: Understand what I have.
- Recursively scan directories for audio files
- Extract existing metadata using
mutagen - Use
ffprobefor format info (duration, sample rate, bitrate) - Generate CSV report:
path, title, artist, genre, duration, quality_score - Identify which files need AI analysis (poor metadata)
Quality scoring:
- Good (3 points): Has title, artist, genre, and description
- Fair (2 points): Has some metadata but missing key fields
- Poor (1 point): Filename only, no useful metadata
Phase 2: Audio Analysis (Est. 3 hours)
Goal: Generate rich descriptions for all files.
- For files with "good" metadata: Skip AI analysis, just use existing data
- For "fair" and "poor" files: Run full analysis pipeline
- Extract audio features with librosa (30-second samples to save time)
- Generate descriptions using Ollama
- Store results in SQLite
Smart optimization: Only analyze the intro/middle/outro (3x 10-second clips) instead of the full file. This reduces processing time by ~70% while still capturing the essence of the music.
Phase 3: Indexing (Est. 1 hour)
Goal: Generate embeddings for semantic search.
- Combine all metadata into searchable text
- Generate embeddings using sentence-transformers
- Store in SQLite with proper indexes
- Test search with sample queries
Phase 4: Search API (Est. 2 hours)
Goal: Build a simple FastAPI backend.
# main.py
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
app = FastAPI()
model = SentenceTransformer('all-MiniLM-L6-v2')
@app.post("/search")
async def search(query: str, limit: int = 10):
# Generate query embedding
query_embedding = model.encode(query)
# Search database for similar cues
results = search_similar(query_embedding, limit)
return {"results": results}
@app.get("/cue/{id}")
async def get_cue(id: int):
# Return full metadata for a specific cue
return get_cue_by_id(id)
@app.get("/audio/{id}")
async def stream_audio(id: int):
# Stream audio file for preview
return FileResponse(get_audio_path(id))
Phase 5: Web UI (Est. 3 hours)
Goal: Simple, functional interface.
- Search box with example queries
- Real-time results (debounced, 300ms delay)
- Result cards showing:
- Title, genre, mood, instruments
- Description and tags
- Similarity score (0-100%)
- HTML5 audio player for preview
- "Copy path" button
- Filters: duration range, genre, mood
- Keyboard shortcuts (↑/↓ to navigate results, Enter to play/pause)
Trade-offs: What I'm Giving Up
Going local-first means accepting some compromises:
| Aspect | Cloud APIs | Local-First |
|---|---|---|
| Quality | Best-in-class (GPT-4, Claude) | Good enough (Llama 3.2, Mistral) |
| Speed | 5-10 sec per file | 30-60 sec per file |
| Setup | API keys only | Install Ollama, download models (~5GB) |
| Cost | $5-10 per index | $0 forever |
| Privacy | Audio sent to APIs | 100% local |
| Re-indexing | Costs every time | Free unlimited |
Is "good enough" quality acceptable?
For music cue search, yes. I'm not generating creative content or writing essays. I need:
- "This is orchestral music" → Local LLM can do this
- "Mood is uplifting and dramatic" → Features + LLM can infer this
- "Instruments include strings and piano" → Timbre analysis + LLM works
The descriptions don't need to be perfect. They need to be searchable. Big difference.
What I'd Do Differently
1. Start with a Smaller Test Set
Instead of indexing all 500 files at once, I should start with 50 representative samples. Test the pipeline, tune the prompts, validate search quality—then scale up.
2. Build the Search UI First
I'm planning to build the indexer first, then the search UI. But actually, I should build a mock search UI with fake data first. This helps me understand what metadata I actually need before spending hours extracting it.
3. Consider Hybrid Approach
Maybe use local LLMs for most files, but fall back to Claude API for files where local analysis struggles (e.g., very complex orchestral pieces). Best of both worlds: mostly free, occasionally high-quality.
4. Version the Embeddings
If I improve the prompts or switch models later, I'll need to re-generate embeddings. Should include a version field in the database so I can track which cues need re-indexing.
Key Learnings from the Planning Process
1. KISS (Keep It Simple, Stupid) Is Hard
My first instinct was to over-engineer this:
- "Should I build a recommendation engine?"
- "What about playlist generation?"
- "Should I support collaborative filtering?"
No. I need to search music cues. That's it. Everything else is scope creep.
2. YAGNI (You Aren't Gonna Need It) Applies to Infrastructure Too
I almost convinced myself I needed:
- Kubernetes for deployment
- PostgreSQL instead of SQLite
- Redis for caching
- Docker for containerization
Why? This is a local tool for me. It doesn't need to scale to millions of users. SQLite + FastAPI running on localhost:8000 is plenty.
3. The Best Architecture Is One You'll Finish
Cloud APIs would give me slightly better quality and faster indexing. But the local-first approach is more exciting to me because:
- I get to learn new tools (Ollama, librosa, sentence-transformers)
- I can experiment freely without worrying about API costs
- The entire system runs on my laptop with no dependencies
If the architecture motivates me to finish the project, that's the right architecture.
4. Audio Feature Extraction Is Shockingly Powerful
I assumed I'd need to actually "listen" to the audio (like Whisper does for speech). But librosa can extract so much information just from the waveform:
- Tempo and rhythm
- Pitch and key
- Brightness and timbre
- Harmonic vs percussive content
Combined with a local LLM, these features are enough to generate rich, searchable descriptions.
What's Next
This post is the planning phase. I haven't written a single line of code yet.
Next steps:
- Install dependencies - Ollama, librosa, sentence-transformers
- Build the audit script - Scan directories, extract metadata, generate report
- Test audio analysis - Run librosa + Ollama on 10 sample files, validate quality
- Build the indexer - Full pipeline from audio → embeddings → SQLite
- Build the search API - FastAPI backend with cosine similarity search
- Build the web UI - Simple search interface with audio preview
- Write a follow-up post - Document what actually happened vs what I planned
I'm estimating 12-15 hours of work total. We'll see if that's optimistic or pessimistic.
Final Thoughts
This project is a great example of how AI coding tools change the calculus of what's worth building.
Before AI coding assistants:
- I'd need to learn librosa, Ollama, sentence-transformers, FastAPI, and SQLite
- Each library would take days or weeks to understand
- Total time to build: 2-3 months
- Likelihood of finishing: 20%
With AI coding assistants:
- I understand the concepts (embeddings, semantic search, audio features)
- Claude Code handles the implementation details
- Total time to build: 12-15 hours
- Likelihood of finishing: 90%
The difference isn't that the AI "does it for me." The difference is that I can focus on what to build instead of how to implement every detail.
I still need to:
- Understand the problem domain (music search, audio analysis)
- Choose the right architecture (local vs cloud, SQLite vs PostgreSQL)
- Make UX decisions (what metadata to show, how to display results)
- Validate quality (do the descriptions make sense? does search work?)
But I don't need to memorize the librosa API or debug sentence-transformers tensor dimensions. That's what Claude Code is for.
The best part? Even if this project doesn't work perfectly, I'll have learned a ton about audio analysis, semantic search, and local LLMs. That knowledge carries forward to the next project.
And working with my audio engineer friend means I'm building something that solves a real problem for a real professional. His domain expertise combined with AI coding tools is a powerful combination—he knows exactly what he needs, and I can help him build it.
That's the real value of building in public: the journey, the collaboration, and the learning—not just the destination.
Follow along as I build this with my friend. Next post will cover the actual implementation—what worked, what didn't, and what we learned along the way.