We Argued About Podcast Titles Every Week So I Trained a Model to Help

Brian Stever

2026 · Python, ranking models, LLMs, transcripts, LaTeX

Abstract. Every week at Snack Labs, we'd have the same conversation about episode titles for Sickboy. Someone would pitch something clever, someone else would pitch something searchable, and the final choice would come down to whoever argued longest. After years of that across 1,193 episodes, I decided to stop guessing. This project builds an AI-assisted ranking framework: LLMs generate title and description candidates from episode transcripts, and a show-specific model trained on historical performance data ranks them. Across 84 model variants and 869 transcripts, the best configuration identifies the top-performing episode 34% of the time and lands the winner in the top three 64% of the time. The funniest finding is also the most useful: LLMs are decent brainstorming partners and mediocre judges of their own work.

1.Problem Statement

Picking podcast titles never got less stressful, no matter how many times we did it. Everyone has an opinion, very few of those opinions are testable, and the strongest argument is often just the person speaking with the most confidence. After a while, that started to feel like a solvable problem rather than an inevitable one.

The central idea is straightforward: let LLMs do what they're good at (generating lots of plausible alternatives) and let a show-specific ranking model do what it's better at: learning from a real archive of past episode performance. Separate ideation from selection. The distinction sounds obvious once you say it out loud, which is usually a sign it was worth formalizing.

2.Dataset and Scope

The study uses 1,193 Sickboy episodes, dual-platform performance metrics, and 869 episode transcripts. There's a formal runbook, a claims-to-artifacts map, and frozen requirements, partly because I wanted the work to be reproducible, and partly because I'd been burned before by analysis I couldn't retrace six months later.

The project isn't just a paper companion. It's a full reproducibility package. Candidate generation, evaluation, ablations, transcript gates, visualization scripts, and production-pipeline runners all sit in the same place. It reads less like a one-off analysis and more like a small research operating system for the metadata problem I'd been complaining about for years.

Table 1. Study scale.

ComponentValue
Episodes in archive1,193
Transcripts available869
Model variants evaluated84
Opening-alignment sample1,128 episodes

3.Ranking Pipeline

The pipeline has three stages. First, generate many candidates from transcript material. Then score those candidates using a learning-to-rank approach grounded in the show's historical performance. Finally, return a shortlist rather than pretending a single “best” title emerged from pure mathematical truth. That last step is important. Editorial judgment isn't replaced; it's constrained to a much better neighborhood.

Stage 1

Generate

Use transcript-informed prompts to produce many title and description candidates.

Stage 2

Rank

Score candidates with a show-specific model trained on historical performance data.

Stage 3

Editorial Pick

Choose from the top cluster with human judgment instead of asking the model to cosplay taste.

Figure 1. The generate-rank-review loop. The main move is separating candidate generation from candidate selection.

I also wanted the project to survive contact with our actual workflow, not just produce a manuscript. So there's a practical app layer alongside the research scripts: a Gradio tool the team can actually use on publish day. That felt more honest than a project that lives only in a PDF.

4.Findings

The best configuration finds the highest-performing episode 34% of the time and places it in the top three 64% of the time, with NDCG@5 reaching 0.909. That's materially better than random and, more importantly, good enough to actually change how we pick titles. If your top five options are tightly clustered and historically informed, the final choice can safely go back to brand voice and human instinct.

The secondary finding is funnier: when you ask LLMs to rank their own title candidates by likely performance, they hover around 51% pairwise accuracy. That is approximately coin-flip territory with better prose. The system works best when the model writes and then politely steps aside.

The opening-alignment analysis was the part that surprised me most. Episodes whose title topic appears earlier in the audio tend to get lower average consumption. This feels slightly rude until you think about listener psychology for a few seconds. If you spend the title promising the answer, then immediately deliver the answer, you may have built an exit ramp instead of an entrance.

Table 2. Key findings.

MetricValue
Top-1 hit rate34%
Top-3 hit rate64%
NDCG@50.909
LLM pairwise ranking accuracy~51%

5.Reflection

What I like about this project is that it refuses both easy extremes. It's not “AI will replace editors,” and it's not “AI is useless slop.” It's a more practical claim: AI is useful for generating options, historical data is useful for narrowing them, and humans are still needed for the final act of choosing something that sounds like the show. I've sat in those meetings. The model can't do what the team does. But the team makes better choices when the options are better.

Also, there's something satisfying about proving that the model which writes the title is not automatically the best judge of that title. That feels like a narrow metadata result and also a pretty decent life principle.