Stripping YouTube down to text: a tour of Transcriptor

A small local-first pipeline that takes a YouTube URL and gives back a transcript file. Why I built it, the Windows-specific landmines I tap-danced across to get it working, and the hacks that exist because Windows is shit.

Ollie · 12 May 2026 · 11 min read

Star Citizen's content scene is a lot of one-hour streams, ninety-minute patch breakdowns, and the occasional two-hour lore essay. I'm not opposed to any of that. What I am opposed to is sitting still for ninety minutes when I could be doing literally... anything else. Driving to work (wherein.. I may listen to those update videos as a podcast). Washing dishes. Watching paint cure on a freshly-glued shelf that I didn't follow the instructions to build and it's wonky, shit, and generally unsuitable for the task.

Whatever.

Most of the time what I actually want from a video is the bit at minute thirty-seven where the host says the thing about the new flight model. Or the bit at minute fifty-two where they spoil what's actually coming in 4.9. The video is fine. The video is just inconvenient fucking packaging for the information inside it.

So I built Transcriptor.

(I'm aware "Transcriptor" sounds like a knock-off Decepticon. I was tired. Naming things is hard. Moving on.)

What the thing is

A small Python pipeline that takes a YouTube URL and gives back a transcript file. Audio extraction via yt-dlp, transcription via faster-whisper, GPU-accelerated when there's a GPU to accelerate on. Five output formats (txt, srt, vtt, json, tsv), large-v3 Whisper model by default, language auto-detected, voice-activity-detection on by default so the model doesn't have to spend cycles transcribing dead air.

The whole pipeline is one Python file (transcribe.py, ~250 lines). The daily use looks like this:

Invoke-YouTubeTranscribe "https://www.youtube.com/watch?v=..."

Invoke-YouTubeTranscribe is a PowerShell function in my profile that wraps the Python entry point with sensible defaults, an output directory, and (eventually, when I get round to it) a Claude Code summarisation switch I'll come back to. But the wrapper is boring. The wrapper is fifteen lines of param() and a call. The interesting stuff lives in transcribe.py, and a terrifying amount of that interesting stuff exists because Windows is, well, Windows.

Why local instead of just paying somebody

The path of least resistance for "I want a YouTube video as text" is to throw the URL into an Otter.ai or a Riverside or one of the dozen SaaS transcription services that will do it in their cloud and email you the result. That's the thing a sane person would do.

I am, frequently, not a sane person. Three reasons:

Privacy. Random YouTube audio is a weird thing to ship into somebody else's pipeline. It's content somebody else made; I have no right to feed it through a third party's ML stack and approximately fuck-all visibility into what they do with it on the way out.
Composability. The whole point of getting a transcript is to do something with the transcript. In my case, that's "pipe it into claude -p with an SC-specific summarisation prompt and drop the output back to the terminal." That flow does not exist on Otter. It does exist on a transcript file that lands in my working directory and emits exactly one line of stdout for the shell wrapper to grep.
Cost. Once the model's downloaded the marginal cost of a transcript is the electricity. On a 4080 Super running large-v3 in float16, an hour of audio comes back in a handful of minutes. Not a number worth metering.

There's a personal axiom hiding in there. Anything I'd plausibly run more than once a week is worth owning the pipeline for. Otter would have cost nothing to start; it would have started costing money the moment I hit the free-tier limit, and would have kept costing forever. One weekend of Python is a much better trade for somebody who, like me, has the patience of a thousand monks and the financial discipline of an absolute toddler.

(I have an espresso machine. Two grinders. I've spent an inordinate amount of money on Star Citizen. I'm not allowed to make the financial argument with a straight face.)

faster-whisper, not openai-whisper

The obvious starting point for any of this is OpenAI's reference whisper package. It's the one everybody uses, the one every tutorial points at, the one I started with, and the one I rapidly stopped using.

It's a fucking pain. It drags torch along for the ride (about 2 GB on disk), which is fine if you've already got torch, deeply tedious if you haven't. It's also slower than it needs to be: the inference path goes through PyTorch's Python bindings with all the per-step overhead that implies.

faster-whisper replaces that path with the CTranslate2 inference engine. Same model weights, same accuracy, but a C++ runtime that handles batching and kernel dispatch directly. The numbers it advertises are 4 to 5 times faster than the reference implementation on the same hardware. In practice on my box it's closer to 5x with large-v3 in float16. And I don't have to keep torch on disk. Two wins.

There's a secondary win that doesn't get talked about enough. faster-whisper decides whether to use the GPU by calling ctranslate2.get_cuda_device_count(). That call doesn't need torch. The reference whisper package uses torch.cuda.is_available(), which obviously does need torch, and which has a delightful failure mode where you have a perfectly working GPU and a perfectly installed torch and CUDA inference is still off because at some point pip silently grabbed the CPU-only torch wheel and now you're transcribing on your sad little CPU and you can't work out why everything's taking ten times longer than it should.

Yes, I'm describing something I did. Took me about two hours the first time. Two hours I will never get back. By removing torch entirely, that whole class of "is CUDA actually working today" question goes away. The library either sees the GPU or it doesn't. There's no third "yes it sees it but secretly it's not using it" state. Beautiful.

No system CUDA Toolkit. This is the bit I'm proudest of

If you've ever installed CUDA on a Windows machine for ML work, you know what's coming. The system CUDA Toolkit is a 3 GB download, an installer that wants admin rights, a PATH manipulation that fights with whatever else you've got installed, a versioning matrix you have to keep in your head, and the constant low-grade dread that a driver update or a Visual Studio update will quietly turn it into a brick.

You don't actually need any of that for faster-whisper. The CUDA runtime libraries it cares about (cuBLAS for the matmuls, cuDNN for the convolutions) all ship as pip wheels:

# pyproject.toml
dependencies = [
    "yt-dlp>=2026.1.1",
    "faster-whisper>=1.1.0",
    "nvidia-cublas-cu12",
    "nvidia-cudnn-cu12>=9.0",
]

uv sync and you're done. No system Toolkit. No PATH hacks. No version juggling. The wheels are huge (a couple of gigs between them) but they're scoped to the venv, deterministic, and disappear cleanly when you nuke .venv/. The only thing you genuinely still need is an Nvidia driver new enough to expose CUDA 12.x, which is true of every driver shipped in the last two years.

This is the kind of quality-of-life improvement that should be the default on every ML stack, and it is, somehow, not. I cannot tell you how many hours of my life I've spent fighting system CUDA installs that pip wheels would have made a non-issue.

(I can tell you. It's a lot. Don't make me count.)

The Windows DLL search path landmine

This is the bit where Windows really shows its pimpled, disgusting, gross rear end.

CTranslate2 loads cuBLAS and cuDNN at runtime via LoadLibrary. That's the plain old Win32 dynamic-link load call from approximately the dawn of computing. It searches a fixed set of locations: the executable's directory, the system directories, the current directory, and anything on PATH. That's it. That's the list.

What LoadLibrary does not search is the modern "secure DLL search list" you get when you call AddDllDirectory. That's a separate API, added in Windows 7 specifically because the old PATH-based search was a security disaster (DLL planting was the world's biggest hobby in 2008). Modern Python's os.add_dll_directory calls into that newer API.

So. You've pip-installed nvidia-cublas-cu12. The DLLs are sitting in .venv\Lib\site-packages\nvidia\cublas\bin\. Python's os.add_dll_directory knows they're there. CTranslate2's LoadLibrary call does not, because nobody told PATH.

You import faster_whisper, and you get a Library not found error referencing a DLL that is unambiguously on disk a few directories below the script that's looking for it. Cool. Great. Love it.

The fix sits at the top of transcribe.py and runs before any CUDA-touching import:

def _register_nvidia_dlls() -> None:
    if sys.platform != "win32":
        return
    site_pkgs = Path(sys.prefix) / "Lib" / "site-packages"
    nvidia_root = site_pkgs / "nvidia"
    if not nvidia_root.exists():
        return
    bin_dirs: list[str] = []
    for bin_dir in nvidia_root.glob("*/bin"):
        if not bin_dir.is_dir():
            continue
        bin_dirs.append(str(bin_dir))
        try:
            os.add_dll_directory(str(bin_dir))
        except (FileNotFoundError, OSError):
            pass
    if bin_dirs:
        os.environ["PATH"] = (
            os.pathsep.join(bin_dirs) + os.pathsep + os.environ.get("PATH", "")
        )

_register_nvidia_dlls()

Belt and braces. Both. Set the secure list with os.add_dll_directory for any modern library that respects it. Prepend PATH for the legacy LoadLibrary path that CTranslate2 actually uses. Without both, the script breaks in a different and confusing way every time one of the underlying libraries decides to update its own load semantics.

Five minutes of work the first time you hit it. About six hours of work to figure out that's the actual problem, because the error messages along the way are uniformly terrible and Google's first ten results all confidently suggest that you reinstall CUDA, which is wrong, and which will not help you, and which is exactly the kind of advice that makes me want to bin Windows and go live in a yurt.

But fine. It works now. Moving on.

The `os._exit(0)` hack, which I'm not proud of

This one's grosser. Here's the literal bottom of main():

sys.stdout.flush()
sys.stderr.flush()
os._exit(code)

os._exit is the nuclear option. It bypasses every cleanup hook Python has. No atexit callbacks. No object destructors. No flushing of any IO you hadn't already flushed yourself. The OS reclaims the process and that's it. Goodnight.

You do not write os._exit casually. You write it because the normal sys.exit path has been observed, repeatedly, to make your day worse.

What happens here is that CTranslate2's CUDA context and cuDNN 9's internal allocator both have destructors that run during the Python interpreter's shutdown sequence. On Windows specifically, with the combination of versions I'm pinned to, those destructors race each other into a state where one of them tries to free memory the other has already freed. Windows catches it as a stack buffer overrun (STATUS_STACK_BUFFER_OVERRUN, exit code 0xC0000409) and fast-fails the process with the kind of error code that makes you assume you've done something deeply wrong.

The transcript file is, at that point, already on disk. Flushed. Fine. Functionally nothing is wrong. But the wrapper sees a non-zero exit code and treats the whole run as a failure, which means it never finds the TRANSCRIPT: line on stdout, which means the (future, theoretical, vapourware) Claude Code summarisation step never runs, which means I sit there blinking at a terminal wondering why my perfectly good transcript hasn't been summarised.

os._exit skips the destructors entirely. The Python interpreter exits before either destructor gets a chance to misbehave. The OS reclaims the CUDA context the same way it would reclaim it if I'd hard-killed the process from Task Manager. Everything that needed to land on disk has already landed on disk by then.

Is this elegant? No. Is it the kind of thing I'd put my name to in a code review at work? Absolutely not. Is it staying in until somebody upstream fixes the destructor race in either CTranslate2 or cuDNN 9 on Windows? You bet your sweet ass it is.

The stdout/stderr discipline

The pipeline is meant to be called by a shell wrapper. The wrapper needs two things: somewhere to surface progress to the user (so they know the transcription hasn't hung, which it can absolutely look like it has, because large-v3 takes a beat to load) and a machine-readable handoff to find the output file (so the wrapper can pipe it to the next step).

The contract is one line. All progress goes to stderr ("Downloading audio", "Loading model", "Transcribing", that sort of thing). Exactly one line goes to stdout, in this format:

TRANSCRIPT: C:\Users\[USERNAME]\AppData\Local\Temp\yt_whisper_a1b2c3d4e5f6.txt

The PowerShell wrapper does this:

$transcriptLine = $output | Where-Object { $_ -like "TRANSCRIPT: *" }
$transcriptPath = $transcriptLine -replace '^TRANSCRIPT: ', ''

Two lines. No regex sophistication. No JSON parsing. No mode flags. The Python side and the shell side agree on a single prefix and a single absolute path, and that's the whole interop contract. If you want to pipe transcribe.py into a tool that doesn't care about progress, you redirect stderr to $null and you've got a clean machine-readable stream. If you want to watch it work, you don't.

This isn't a thing you have to do. It's just a thing that, once you've done it twice, you do for every CLI tool you write that's meant to be called by something else, because the alternative is parsing log output with regex and hating yourself.

What about security

The threat model for a tool that runs on my laptop, transcribes YouTube audio, and never talks to anything outside its own process tree is, charitably, pretty fucking thin. But for the record:

No secrets, no API keys, no environment variables. The pipeline is fully offline once the model's downloaded.
No third-party data flow. The audio doesn't leave the machine. Compare and contrast with literally every cloud transcription service.
yt-dlp, ffmpeg, and deno are system tools, not bundled into the venv. They're updated via winget on a normal package-manager cadence. That's a real supply chain surface but it's a separate one from the project, and it's the same surface every other tool on the machine is using.
The Whisper model weights come from Hugging Face's CDN on first run and are cached locally. That's the one trust decision the tool makes on your behalf; subsequent runs are fully offline.

The real risk surface for this kind of tool is the model. A maliciously-tuned ASR model could in principle smuggle text into transcripts it had no business producing. I'm running the reference large-v3 weights from SYSTRAN's distribution, which has been around long enough and used widely enough that the trust is approximately the same as the trust I'm placing in any other ML model I'm running locally. Which is: moderate, eyes open, fingers crossed.

I'll write up an article on ASR models at some point, I'm sure.

Maybe.

What's still missing

A few things, in roughly the order I'd add them if I ever get round to it:

The Claude Code summarisation switch. I had this in the previous version of the script and lost it in the refactor, because I am fundamentally the world's worst project manager of my own free time. The flow is: transcript file on disk, pipe its contents to claude -p with a system prompt that asks for a Discord-shaped summary, drop the result back to the terminal (and optionally to a .summary.md next to the transcript). Two evenings of work. Possibly less. I've been saying "two evenings" for about three months.
A --keep-audio flag. Right now the temp audio file is always deleted in the finally block. Useful default, occasionally inconvenient if you want to re-run with different model settings without re-downloading 14 MiB of opus.
Playlist mode. yt-dlp --no-playlist is the current behaviour. A --playlist flag that iterates and writes one transcript per video (with a sensible naming scheme rather than yt_whisper_<random-hex>) is the obvious next step.
Chapter detection from the VAD output. Silero VAD already chunks the audio into voice/non-voice regions; the longer silences correlate weakly with topic shifts. I'd want to verify that claim with real data before committing to it, but if it holds, the transcript could be auto-segmented into chapters that map to the speaker's pacing.
A real macOS port. The DLL-resolution code is no-op on non-Windows so the script should work on macOS unmodified. I haven't tested it. The nvidia-* wheels obviously don't apply, and CTranslate2's Metal support is a separate path I'd have to wire in. One day. Probably not soon.

Nothing on this list is urgent. The thing transcribes videos. The next layer up is where the interesting work lives, and it's also where I've been studiously not doing the interesting work for months. So it goes.

Why I bother

As per the last article, I don't really know. Same answer as last time.

That's most of what I want out of personal tooling. Comprehensibility, locality, the ability to fix it, and the freedom to stick a load-bearing os._exit at the bottom of main() if that's what the situation calls for.

If you want a closer look at the repo: it lives at a path on my laptop that is profoundly unhelpful to you. I'll get a public mirror up at some point, probably, in the same energy as my "two evenings of work" promises above. If you're trying to build your own and you got stuck on the Windows DLL bit specifically, the snippet earlier in this post is the bit you're looking for. Steal it. It's not load-bearing IP. It's just an annoying afternoon I don't want anybody else to have to repeat.

Next post will probably be about something less load-bearing. Still pending: the birdwatching one.