Building an AI-powered Podcast Intelligence Tool

João Fiadeiro2023-08-03

In this post I will take you through the process of building an AI-powered tool using Langchain and Whisper that helps podcast lovers generate summaries and make sense of their favorite content.

My goal was to leverage some wonderful tools that make speech transcription (Whisper) and language understanding (Claude) so easy to build something useful. I will take you through the process of downloading all the episodes of your favorite podcast, transcribing them with high accuracy, and leveraging an LLM to produce detailed summaries.

I ran through the steps below for my favorite podcast “The Delphi Podcast” which goes in depth with founders and researchers in the web3 space.

The Github repo is linked below. It’s totally open-sourced so I encourage you to check it out, contribute, or fork it.


Context

Podcasts are my favorite medium for good reason. They deliver a wealth of relevant information in an easily digestible format. I can consume passively, even offline. It’s no wonder 62% of Americans have listened to a podcast, with 42% tuning in monthly. But there's a major downside - it's incredibly difficult to rediscover forgotten content. I've lost count of the hours I've wasted desperately scanning episodes for a half-remembered quote, listening at 2x speed. More often than not, it's in a different episode altogether.

Even hardcore podcast fans are limited to 1/3 the total runtime, tops. Information retrieval should be far easier. While Google transcripts enable search, users can't download them. Podcast lovers need better tools to efficiently mine these audio goldmines.

This pain is especially acute in web3. Podcasts are popular for their decentralized, community-driven nature. They grant entrepreneurs a platform to share ideas and build brands as thought leaders. After Twitter, podcasts are the best source for VCs researching use cases, degens hunting alpha, and newbies looking to learn.

But many shows, like my favorite The Delphi Podcast, don't provide transcripts. This limits accessibility for millions of potential customers. It also hurts people without hours to listen, as episodes routinely exceed an hour. There must be a better way, I figured.

So I built a tool to automatically transcribe and summarize episodes from my go-to podcasts. It taps natural language processing from Anthropic and OpenAI to extract key insights without having to listen to full episodes. Now I can quickly reference old quotes or get the main points from a recent episode.

I invite fellow podcast lovers to try my new tool and unlock podcasts' full potential. Together we can transform an untamed flood of audio into a searchable reservoir of knowledge. If you love podcasts like I do, I think you’ll enjoy this.

AI to the rescue

Recent advances in two key technologies provide the perfect solution to the podcast problems I outlined.

First, automatic speech recognition (ASR) is now a solved issue for English thanks to systems like OpenAI's Whisper. Released in 2022, Whisper was trained on a massive 680,000 hours of diverse supervised data. It transcribes English audio with 96% accuracy at blazing speeds, ideal for long-form content like podcasts.

Second, large language models (LLMs) like Anthropic's Claude allow us to make sense of these massive transcripts. We can summarize episodes, retrieve relevant snippets using natural language, or even generate new related content.

Crucially, the tooling to leverage these technologies has also taken a huge leap. Streamlit enables developers to easily build UI-driven apps powered by machine learning. Supabase offers open source alternatives to services like Firebase for data storage/retrieval and authentication.

With these building blocks, we can achieve so much more. Let me walk you through how I built a pipeline that automatically downloads, transcribes, and summarizes podcast series on demand.

Part 1: Downloading audio

RSS feeds demonstrate why protocols trump platforms. An RSS feed is a simple XML file containing episode metadata - titles, descriptions, dates, and crucially audio URLs. Podcasters publish this file, which apps then subscribe to for automatic new episode delivery.

Without RSS enabling this seamless push model, listeners would have to manually check sites instead of getting automatic updates. RSS allows decentralized distribution, letting listeners subscribe directly without a centralized platform. For podcasters, it provides efficient sharing with audiences across the web. RSS has been core to podcasting from the start by powering automated episodic content delivery.

If we have a podcast's RSS feed URL, downloading the audio is trivial. The Python script get_podcasts.py does the following:

  1. Uses the feedparser library to parse the RSS feed and identify mp3 files.
  2. Downloads the mp3s in parallel to a local directory.

Usage is simple:

python get_podcasts.py \
--rss_feed 'https://anchor.fm/s/89f4aa68/podcast/rss' \
--podcast_title 'delphi_podcast'

By leveraging the decentralized RSS protocol, we've enabled automated bulk audio download for any podcast. Next we'll transcribe the audio using Whisper.

Part 2: Transcribing the audio and processing

Now we get to the fun part. With the audio files downloaded, we have to pass them through Whisper to get our transcripts. The script transcribe.py does exactly this. Given a directory of audio files, we call model.transcribe with the “small” pre-trained model. I found that this model provided the best balance between performance (6x meaning a 60 minute podcast is transcribed in ~10 minutes) and error-rate. The model returns both a raw transcription in a single string (with punctuation) but also “segments” object containing the transcription broken down by a segment (i.e. continuous audio containing speech, typically separated by silences) along with specific timestamps per segment (useful for captions!).

Note that this step is much faster if you have a beefy GPU. I used Paperspace’s free A6000 for this step and observed a rate of 2-4 minutes per episode (depending on the length). I can highly recommend Paperspace but you can also use Google Colab.

Next comes the tricky bit - prepping transcripts for the LLM. LLMs struggle with massive chunks of text. Like humans, they perform best with concise, focused information. We can't just feed a 10,000 word transcript to Claude or GPT-4.

Instead, we need a chunking strategy to pull relevant snippets as context. First, we don't want to truncate sentences. Second, single sentences lack enough context on segment topics.

My solution: intelligently chunk transcripts into coherent paragraphs of 3-5 sentences. This provides sufficient context without overwhelming the LLM. The scripts below handle this transcript segmentation.

It combines Whisper's sentence boundary detection with custom logic to group related sentences. Given a full transcript, it outputs a list of topical paragraph chunks ready for Claude.

Now we have a pipeline to download, transcribe, and prepare podcasts for summarization and search. Next I'll cover the Claude QA system to interact with transcripts.

Joining truncated sentences

Consider the last two lines in the example below. We want to join the last two lines so that we get a full segment “Mr. Zuckerberg said that Reels is increasing overall app engagement and that the company <SPLIT> believes it is gaining share in the short-form video market.”

My approach was to detect if a segment ended with a special character like a question mark or period and join subsequent segments until finding one that does.

# Group transcript segments into paragraph chunks
# Check if segment ends with sentence boundary suffix
# If so, join queued segments into paragraph and reset queue
# Else just append segment to current queue
# Continue segment by segment to build multi-sentence paragraphs
# Provides coherent context chunks for downstream NLP

segment_queue = []
docs = []
suffixes = ("?", ".",'"',)
for i, segment in enumerate(segments):
    text = segment["text"].strip()
    if segment["text"].endswith(suffixes):
        if len(segment_queue) > 0:
            docs.append(" ".join(segment_queue) + " " + text)
            segment_queue = []
        else:
            docs.append(text)
    else:
        segment_queue.append(text)

Joining segments into semantically coherent chunks

Then I implemented a separate loop that collects segments until a maximum size is reached:

# Further chunk paragraphs into model-friendly sizes 
# Aim for chunks of 1024 characters for Claude

# Track running character count
# Append segments to queue until hit chunk size

# If exceeded chunk size, join queue into chunk
# Add to final chunks list
# Clear queue and reset count

# Continue appending to queue for next chunk
# End result is bite-sized chunks for Claude

chunk_size_chars = 1024

chunked_docs = []
running_char_count = 0
new_queue = []

for i, segment in (enumerate(docs)):
    chunk_length = len(segment)
    if running_char_count <= chunk_size_chars:
        new_queue.append(segment)
        running_char_count += chunk_length
    else:
        chunk = " ".join(new_queue)
        chunked_docs.append(chunk)
        new_queue = []
        running_char_count = 0

This allowed me to generate some really clean chunks of text that are (mostly) semantically coherent and of an appropriate size for an LLM to retrieve, parse, and understand. For example:

"Mr. Zuckerberg attributed some of those gains to Reels, the company's short-form video product. Mr. Zuckerberg said that Reels is increasing overall app engagement and that the company believes it is gaining share in the short-form video market. When we started this work last year, our business wasn't performing as well as I wanted. But now we're increasingly doing this work from a position of strength, Mr. Zuckerberg said on Wednesday in a call with analysts. Mr. Zuckerberg said he expects so-called generative AI to have impacts on every one of Meta's apps and services.”

Part 3: Interacting with this content with an LLM

Now for the truly exciting part - unleashing LLMs on our processed transcripts to generate value. We've done the heavy lifting of downloading, transcribing, and preparing the data. Now the fun really begins.

As their name implies, large language models excel at understanding and generating natural language. They're the perfect solution for summarizing podcasts, extracting key points, and even creating interactive experiences.

We can build a "Retrieval QA" system, where user questions fetch relevant transcript snippets to prime an LLM response. The possibilities are endless.

I built my tool using two fantastic frameworks - Langchain for LLM integration and Streamlit for the web app UI. Langchain simplifies interacting with models like Claude and GPT-4. Streamlit makes building interactive, Python-driven web apps a breeze.

With just a few lines of code, I can:

  • Summarize episodes into concise overviews
  • Extract key quotes on any topic
  • Generate learning notes and flashcards
  • Build a smart search engine over transcripts
  • Create an AI assistant to answer podcast-related questions

The tutorials for Langchain and Streamlit are excellent, so I won't rehash the details here. But in a nutshell, they enabled me to quickly build an engaging podcast tool with Claude's natural language capabilities.

I'm thrilled to finally put all this data to work for podcast lovers like myself. Keep reading to see my tool in action!

Overview

Building a podcast summarizer

Our first use-case is producing a summary for a particular podcast. If we can generate some cliff-notes for a podcast episode, we can either use it as a replacement (if we don’t have time to listen to it) or to augment our understanding of the content.

As I discussed above, LLMs (at time of writing) don’t love huge contexts. They tend to forget stuff half-way and over-index on the beginning and end of a prompt. So if we just feed it a massive transcript it will likely not do a great job. In fact, if the input is too large, you might even hit the token limit. We need a way of handling summarization when the input is very large.

We will use MapReduce pattern which is fairly common in handling large datasets. The idea is simple: break down the problem into parts, do an operation on the parts, then aggregate. In our case, we will take each chunk (remember, we’ve already split the transcript into chunks), summarize it, then summarize the summaries.

Since this is such a common patterns for summarizing very large documents with LLMs, Langchain gives us a handy Summarization Chain. Let’s break it down.

For the Map step, I am having Claude write a concise summary of each chunk created during the pre-processing phase. I am using the following prompt:

Write a concise summary of the text in enclosed in <content></content>. Be specific and factual. Add detail where appropriate. The summary will be used to write an article. Do not write anything before your response. <content> {text} </content> CONCISE SUMMARY:

This creates N summaries which are combined in the Reduce step using the prompt:

<content> {text} </content>

Article:

I have found that prompting the LLM to write a “blog” yields nice content that is both thorough and well-written. Below is an example of the summary “article” produced. I also output the “thinking” process (the MapReduce intermediate steps) as bullet-point notes.

Intermediate steps (example)

  • Celestia is a modular blockchain network that pioneered the concept of separating data availability/ordering from execution.
  • Traditional blockchains like Bitcoin and Ethereum bundle availability, ordering, and execution together in a "monolithic" design.
  • Celestia's modular design orders and makes data available but leaves execution to clients. This avoids forcing all nodes to execute every transaction.
  • Modular blockchains solve the "data availability problem" - ensuring all nodes can access transaction data needed to verify blocks.
  • Celestia's innovation is "data availability sampling" - allowing light clients to verify data availability without downloading full blocks. This prevents block producers from withholding data needed for fraud proofs.

Building a Question & Answering system

Building the Q&A tool is quite similar. Instead of taking the entire content and running it though a “summarization” pipeline, we instead compress information with embeddings. A text embedding is a numerical representation of text data that encodes the semantic meaning of words and sentences in a way that can be understood by machine learning models. The goal of text embeddings is to convert text into numeric vectors that capture the contextual meaning of words and relationships between them. The other nice property of representing your text as a bunch of vectors is that you can compute similarity between them using very simple methods like cosine similarity.

First, I compute embeddings for all the chunks in a given podcast and insert them into a DeepLake vector store. Then, when a user submits a query, we identify the chunks that are most semantically relevant to the query and insert those into a prompt. Once more, LangChain makes this very easy thanks to a RetrievalQA chain which does most of the heavy lifting.

Here is an example output:

Conclusion

Building this podcast tool highlights the power of combining off-the-shelf AI. With just a weekend of hacking, we've created something truly useful. I now efficiently mine my favorite shows, like The Delphi Podcast, for insights. The possibilities are endless when you blend solutions like Whisper, Claude, and Streamlit.

I encourage you to fork the code and experiment. Let's build an engaged community around surfacing podcasts' hidden value. Here are some ideas I'm excited to explore:

  • A decentralized effort to archive and transcribe all the world's podcasts. There are projects preserving books and film - why not audio? A DAO could collect funds and charge for access to AI-generated summaries and notes.
    • There are over 4 billion podcasts out there publishing ~300,000 episodes per month. Assuming an average footprint of 50 MB per episode and 2 minutes of compute (on a A6000) for transcription, we’re talking about 15TB per month and 10,000 hours of compute (that’s like 15 GPUs…).
    • Apparently, only 7% of podcast creators store backups. I imagine that in a few years many of the servers hosting the mp3 files will be down… Audio and transcripts should be archived with Filecoin. There are solutions to fund decentralized storage perpetually.
  • Adding translations to unlock this content for non-English speakers. The incremental compute cost is negligible.
  • Synthesizing audio reports using advances in text-to-speech, like ElevenLabs. Listeners with only 15 minutes could digest a concise episode.
  • Building personalized recommender systems by connecting transcripts to user profiles and interests.
  • Enabling smart search across archives to instantly find relevant podcast moments.

The possibilities are truly endless. Podcasts represent a goldmine of insightful conversations and ideas. But they remain trapped in audio silos. With the right tools, we can finally unleash their potential. Let's work together to drive innovation in this space!


See More Posts

background

"Crypto" vs. "Crypto-Crypto": More Than a Semantical Difference

João Fiadeiro

background

Navigating the Universe of AI-Generated Imagery with an Immersive 3D experience

João Fiadeiro

background

The Case for Data Bounty Hunters

João Fiadeiro

Show more

Work with me.

Interested in collaborating? Reach out!

Copyright © 2023 João Fiadeiro. All rights reserved.