Building Merlin: A Word-by-Word Video Caption App with Zero Ongoing Costs

I built Merlin, a desktop app that adds word-by-word captions to videos using Whisper AI and FFmpeg—all running locally with zero ongoing costs. This post covers the technical decisions, challenges solved, and lessons learned building a distributable Electron app in one development sprint.

October 23, 2025 Electron FFmpeg Whisper

About Me: I'm a business and product executive with zero coding experience. I've spent my career building products by working with engineering teams at Amazon, Wondery, Fox, Rovi, and TV Guide, but never wrote production code myself. Until recently.

Frustrated with the pace of traditional development and inspired by the AI coding revolution, I decided to build my own projects using AI assistants (primarily Claude Code, Codex, and Cursor). This blog post is part of that journey—documenting what I've learned building real production systems as a complete beginner.

TL;DR

Key Learnings:

User experience beats technical purity—bundling dependencies eliminates setup frustration
ASS subtitles are vastly superior to FFmpeg drawtext filters for word-by-word captions
Always test with multiple video resolutions and filenames with special characters
Estimated progress is better than no progress when APIs don't provide callbacks
Ship with constraints—limiting features helps you ship faster

The Problem: Expensive SaaS for Simple Video Captions

Video captions have become table stakes for content creators. Word-by-word captions (where each word appears precisely when spoken) are especially popular on social media, but the existing solutions have major drawbacks:

Expensive SaaS subscriptions - $20-50/month for services like Descript, Kapwing, or VEED
Privacy concerns - Uploading videos to cloud servers
Usage limits - Pay-per-minute or monthly quotas
Vendor lock-in - Can't export your workflow

I wanted something different: a tool that runs entirely on your machine, with zero ongoing costs, that anyone can download and use immediately.

The Solution: Local-First Desktop App

Merlin is an Electron app that combines three powerful open-source tools:

Whisper (OpenAI) - AI transcription with word-level timestamps
FFmpeg - Video processing and caption rendering
Electron - Cross-platform desktop app framework

The entire workflow runs locally:

Upload your video
Whisper transcribes with word-level precision
Style your captions in real-time preview
FFmpeg burns captions into the video
Download your captioned video

No cloud. No subscriptions. No ongoing costs.

Architecture Decisions

Decision 1: Electron vs. Web-Only

The Question: Should I build a web app or a desktop app?

Web App Approach:

User installs FFmpeg manually (complex on Windows)
User installs Python + Whisper (pip install)
Configure PATH variables
Build a local web server or Electron wrapper anyway

Desktop App Approach:

Bundle everything in one installer
User downloads DMG, drags to Applications
Everything just works

Verdict: Electron was the clear winner for user experience. Yes, the bundle is larger (~250MB), but the alternative requires 30+ minutes of terminal commands for non-technical users.

Trade-offs:

Pros: One-click install, offline-first, native file access
Cons: Larger download, platform-specific builds required
Decision: Worth it—user experience is priority #1

Decision 2: ASS Subtitles vs. FFmpeg Drawtext

The Question: How should I render word-by-word captions?

I tested two approaches:

Option A: FFmpeg drawtext filter

Build complex filter chains for each word
Calculate positions, timing, and fades manually
Extremely difficult to debug

Option B: ASS (Advanced SubStation Alpha) subtitles

Professional subtitle format with karaoke effects
Built-in support for styling, timing, and positioning
FFmpeg's subtitles filter handles rendering

Example ASS subtitle for karaoke effect:

Dialogue: 0,0:00:01.20,0:00:01.45,Default,,0,0,0,,{\k25}Hello
Dialogue: 0,0:00:01.45,0:00:01.89,Default,,0,0,0,,{\k44}world

The {\k25} syntax creates the precise timing—Whisper provides start/end times, and ASS handles the rest.

Verdict: ASS subtitles won by a landslide. Cleaner code, better debugging, industry-standard format.

Decision 3: Native Whisper vs. Whisper.cpp

The Question: Should I bundle Python Whisper or use whisper.cpp (C++ port)?

Python Whisper:

Official OpenAI implementation
Requires Python runtime (~200MB)
Proven, stable, well-documented

whisper.cpp:

Faster C++ port
No Python dependency
Less mature, harder to integrate

Verdict: I chose Python Whisper for v0.1. It's battle-tested, and since we're already bundling Electron (~200MB), the Python dependency is acceptable. whisper.cpp would be a great optimization for v2.

Technical Challenges

Challenge 1: Caption Size Mismatch Between Preview and Export

Problem: Captions looked perfect in the preview but appeared tiny in the exported video.

Root Cause: ASS subtitles have a reference resolution (PlayResX and PlayResY) that must match the video resolution. I had hardcoded these to 1920x1080, but the test video was 640x360.

Solution:

1. Use ffprobe to detect video resolution:

const probeOutput = execSync(
    `${ffprobePath} -v error -select_streams v:0 -show_entries stream=width,height -of csv=p=0 "${videoPath}"`,
    { encoding: 'utf8' }
);
const [width, height] = probeOutput.trim().split(',').map(Number);

2. Set ASS resolution dynamically:

PlayResX: ${videoWidth}
PlayResY: ${videoHeight}

Time Spent: 2 hours of debugging, 10 minutes to fix once I understood the issue.

Learning: Always test with multiple video resolutions. What works on 1080p might break on 360p.

Challenge 2: FFmpeg Path Escaping for Special Characters

Problem: Export failed with error: No such filter: 'Pal - YMH Studios (360p'

The filename was: Can AI Satisfy My Girlfriend Not Today, Pal - YMH Studios (360p, h264).mp4

Root Cause: FFmpeg's subtitles filter requires special character escaping. Spaces, colons, and backslashes break the filter string.

Solution: Escape the ASS file path before passing to FFmpeg:

const escapedAssPath = assPath
    .replace(/\\/g, '\\\\')  // Escape backslashes
    .replace(/:/g, '\\:');    // Escape colons

const ffmpeg = spawn(ffmpegPath, [
    '-i', videoPath,
    '-vf', `subtitles='${escapedAssPath}'`,  // Wrap in single quotes
    '-c:a', 'copy',
    '-y',
    outputPath
]);

Time Spent: 1 hour of frustration, 5 minutes to fix.

Learning: Always test with filenames containing spaces, special characters, and Unicode. Users will upload videos named "My Trip to São Paulo! (2025).mp4" and expect it to work.

Challenge 3: Progress Tracking Without Whisper API

Problem: Whisper doesn't provide progress callbacks—it just runs and outputs the result.

Solution: I estimated progress based on audio duration:

Extract audio with FFmpeg (0-25% progress)
Run Whisper (~30-95% progress, estimated at 0.5x realtime)
Complete transcription (100%)

// Estimate Whisper processing time (base model ~0.5x realtime)
const estimatedTime = audioDuration * 2;  // 2 seconds per 1 second of audio
const progressPerUpdate = 65 / (estimatedTime / (progressInterval / 1000));

progressTimer = setInterval(() => {
    if (lastProgress < 95) {
        lastProgress = Math.min(95, lastProgress + progressPerUpdate);
        event.sender.send('transcription-progress', {
            progress: Math.round(lastProgress),
            message: 'Transcribing with Whisper...'
        });
    }
}, progressInterval);

Verdict: Not perfect, but good enough for v0.1. Users get visual feedback instead of a frozen UI.

Learning: When you can't get real progress, estimate it. A slightly inaccurate progress bar is better than no progress bar.

What Worked Well

1. Test-Driven Development with Puppeteer

I wrote 10 Puppeteer tests before building the UI, then implemented the UI to pass the tests. Benefits:

Caught async/sync bugs early (wrong expect() helper function)
Verified Sparrow aesthetic programmatically (colors, borders, fonts)
Screenshots provide visual proof of design
Regression testing for free

Example test:

test('Caption styling controls exist and have correct defaults', async () => {
    const fontSize = await page.$eval('#fontSize', el => el.value);
    const fontColor = await page.$eval('#fontColor', el => el.value);
    expect(fontSize).toBe('32');
    expect(fontColor).toBe('#ffffff');
});

Verdict: TDD worked brilliantly for UI development. 100% of tests passed on first try after fixing the expect helper.

2. KISS Principle: Only 5 Styling Controls

I resisted the temptation to add 15+ controls (outline, shadow, rotation, animation, etc.) and stuck with exactly 5:

Font size
Font color
Background color
Background opacity
Position

Result: Clean, uncluttered UI. Advanced users can export the ASS file and edit it manually if needed.

Learning: Constraints breed creativity. Limiting controls forced me to choose sensible defaults.

3. Quick Style Templates

Instead of making users fiddle with controls, I added 6 one-click templates:

Social Bold
YouTube
TikTok
Podcast
Minimalist
Pro Clean

Each template is just a preset object:

const templates = {
    'social-bold': {
        fontSize: 48,
        fontFamily: 'Impact',
        fontColor: '#ffffff',
        bgColor: '#000000',
        bgOpacity: 90,
        position: 'bottom'
    },
    // ...
};

User feedback: "I just clicked TikTok and it looked perfect immediately."

Performance Results

Tested on MacBook Pro M1 with 5-minute video (640x360):

Operation	Time	Notes
Audio extraction	~2 seconds	FFmpeg `-c copy` for speed
Whisper transcription	~2.5 minutes	Base model, ~0.5x realtime
Caption rendering	~5 minutes	FFmpeg burns captions to video
Total	~7.5 minutes	For 5-minute video

Accuracy: ~95% word-level timestamp precision (Whisper base model)

Scaling: Linear. 10-minute video = ~15 minutes processing time.

Lessons Learned

For Technical Founders

User experience beats technical purity
Bundling dependencies is "wasteful" but eliminates 30 minutes of setup frustration.
Ship with constraints
Only 5 controls, Apple Silicon only, no code signing—these constraints let me ship v0.1 in days instead of months.
Estimate when you can't measure
Whisper doesn't report progress, so I estimated it. Close enough is good enough for v0.1.
Test with real-world data
Filenames with spaces, emojis, special characters—users will break your assumptions.
Optimize later, if ever
7.5 minutes for a 5-minute video is acceptable. Don't optimize prematurely.

For AI-Assisted Development

I built Merlin with Claude Code as my pair programmer. What worked:

Structured prompts: "Write tests first, then implement"
Incremental changes: Small commits, frequent testing
Code reviews: I reviewed every line Claude wrote
Domain knowledge: I understood FFmpeg, ASS, Electron—Claude accelerated implementation

What didn't work:

Asking Claude to "figure out" complex FFmpeg filter chains (too many edge cases)
Blindly accepting code without testing

Verdict: AI is a force multiplier, not a replacement. I still needed to understand the domain deeply.

What's Next (If I Build V2)

Features I'd Add

Batch processing - Queue multiple videos
Animation effects - Fade in/out, slide, bounce
Multi-language support - Whisper supports 99 languages
Cloud sync - Optional backup to S3/Dropbox
Intel Mac builds - Currently Apple Silicon only

Optimizations

whisper.cpp integration - Faster transcription without Python
GPU acceleration - Use CUDA/Metal for rendering
Streaming preview - Real-time caption preview without full render
Code signing - Remove macOS Gatekeeper warning

Try Merlin

Download: github.com/sparrowfm/merlin/releases
Source Code: github.com/sparrowfm/merlin
License: MIT (free forever)

Prerequisites:

# macOS
brew install ffmpeg
pip install openai-whisper

Then download Merlin-0.1.0-arm64.dmg, drag to Applications, and start captioning.

Closing Thoughts

Building Merlin taught me that "zero ongoing costs" is a viable business model for desktop apps. Not everything needs a subscription. Sometimes, a one-time download solves the problem better than a cloud service.

The tech stack (Whisper + FFmpeg + Electron) is accessible to any developer who can write JavaScript. The hard parts aren't the code—they're understanding user needs, making smart trade-offs, and shipping something imperfect but useful.

If you're building a similar tool, ship v0.1 quickly. Get it in users' hands. Learn what actually matters. Optimize later, if ever.

Questions? Feedback? Open an issue on GitHub or reach out at sparrowfm.github.io/sparrow.

Precision captioning, frame by frame.