Talk Features: Searching Transcripts and Closed Captions

Slides: index one per 10 seconds (e.g.)

Closed captions: convert vtt/srt to text (eliminate overlap)

To actually do searching, need times in the text

Configuring Solr

Closed captions are lists of timings and the text that is displayed in that time.

Solr is not made to store complex data formats, so the easiest way to store
caption files is a string array. You can easily make it so that people can
search the caption and not the times, by adding an analysis step to the
indexing.

This approach won’t pick up matches that extend over several lines.




















Talk Ranking: Measures of Difficulty

What are we measuring?

  • Is the content needlessly difficult to read
  • Is the content ‘advanced’ for it’s designated field (this is something someone offered to pay for)

Examples

  • Keynotes at a conference vs. masters level classes
  • Continuing education for doctors. vs. TED talks

Options

  • Flesch kincaid (etc)
  • Measure word usage vs others in the category
  • Measure word usage vs. standard usage

Are there papers on this?

Talk Ranking: Measures of Quality

FindLectures.com ranks talks differently than Google or more traditional full text search engines.

To encourage discovering new topics, it attempts to show a variety of topics and speakers in the results, until you narrow using a search term or facets. This helps to replicate the experience of browsing a library. This is notably different from Amazon or Youtube, which show you topics based on what you previously watched.

If talks aren’t of high quality, browsing a collection at random leads to a lot of dead-ends. FindLectures tracks some quality measures to improve this experience, listed below. Most of these are experimental, so not every factor may work correctly or apply to everything.

One of the goals of FindLectures is to surface speakers who are doing interesting work, but who are not focusing on marketing their efforts. Consequently, I’ve tended to treat some ranking factors as binary values (you have written a book or you haven’t). This ideally acts as a low-pass filter on talks, without eliminating people who are early in their careers.

Note that popularity is largely ignored - I see this as a measure of how good the speaker is at marketing themselves. On social media, there are many sizeable communities that focus their efforts on rigging the rankings of various sites (e.g. Google image search, reddit, etc).

Quality factors on text

  • Does the talk start with a long introduction?
  • Does the speaker say “um” frequently?
  • Do they speak excessively quickly or slowly?
  • If the speaker uses an esoteric vocabulary, the talk may be good for practictioners of their field, but poor for others.

Speaker Attributes

  • Is the speaker getting invited to a lot of conferences? - Is there a wikipedia entry about the speaker?
  • Did the speaker write a book? Who is the publisher?
  • Does the speaker get invited to many different conferences?
  • Does the speaker get invited back to the same conference?
  • Did someone else vet this speaker? (E.g. a conference or university)

History

  • How old is the talk?
  • Are there books written about the speaker?

Audio

  • Is the audio in both channels (if stero)
  • Are there mic problems (clipping, etc)
  • How well can a machine transcribe the text? If the stated transcript is very different from a generated transcript, it may indicate unclear audio.
  • Length (minutes) - 20-45 minutes is ideal

Video

  • Are there slides? (Indicates some preparation)
  • Are there closed captions?
  • Is the camera stable?

Other measures

  • Did someone recommend this speaker / conference / course?

Software Architecture of FindLectures.com

UI

  • TypeScript + React + React-Bootstrap
  • Node (TypeScript)

DB

  • Solr

Schema

JSON stored in Git.

The size of the Solr index is approximately 2GB per 100k videos.

Fields:

  • Title
  • Description
  • Speaker Bio
  • Speaker Names
  • Speaker Amazon IDs
  • Speaker Wikipedia URLs
  • IAB Categories (actual taxonomy extended a bit)
  • Collection name
  • Tags
  • Closed captions
  • Transcript
  • Auto-generated Transcript
  • Length
  • Audio Quality Score
  • Text Quality Score
  • Metadata Quality Score
  • Year Given
  • Talk Type
  • Features (Closed Captions, etc)
  • Link to video / audio player

Why?

  • Small files
  • Easy to data correct
  • Any language can access this trivially (except Scala play, which uses a bad JSON format)
  • Diffs work great for testing [TODO: get image]

Schema:
Types of attributes:

  • Datatype (string, integer, date)
  • Array valued vs non-array
  • Hierarchical

    More things are hierarchies and arrays than you’d think -
    e.g. the ‘collection’ a video is from. Video might come
    from two places (hacker news, some agency website). Could
    be a hierarchy, if you pull from Reddit and mark the subreddit
    as something that you can drill into.

ETL Overview

Designed around Progressive enhancement:

Progressive Image Download

From https://blog.codinghorror.com/progressive-image-rendering/

This allows using the “best” tool for the job:
Node, Python, Scala, Bash scripts
RabbitMQ

Progressive enhancement occurs by timescale.

  1. Short - scraping a site. Typically 1-500 pages.
    E.g., a speaker agency might have 20 pages, one per speaker.
    Some conferences put everything on one page.

    At this stage, you might get:

    - title
    - length
    - speaker bio (Typically written by speaker)
    - transcripts (sometimes)
    - talk description 
    
  2. Video lookup
    This means getting basic info from youtube, vimeo, soundcloud, etc.

    This might get you:

    • title (can tell year video made)
    • length
    • video description (can tell you year video made)
    • closed captions
    • category (youtube uses ad industry taxonomy, like Watson)
  3. Enhancement
    These are scripts that get run once in a while.
    E.g.:

    • Checking for 404’ed pages
    • Checking for “ums” in texts
    • Adding flesch-kincaid difficulty score
    • Batch uploads to watson for category tagging
    • Batch uploads to speech recognition API
  4. Index time changes

    • Quality scoring
    • Merge captions into a “transcript”
    • Normalize how frequently people / topics show up so you get breadth
  5. Processing heavy data
    Download videos, audio, run these files through various processing techniques.

    This is very dependent on “use the best tool for the job”, so language independence
    is most important at this stage.

    Attributes obtained at this stage:

    • Audio quality (clipping, SNR, channel imbalance)
    • Speech to text
    • Slides
    • Facial recognition?

ETL - Crawling

First version was python + beautifulsoup (lots of mini scrapers.)

Migrated to Node.js for the “glue” code of ETL.

The good parts:

  • Use “best tools for the job”, i.e. use ‘child_process’.spawn to call curl to do downloads. Some tools
    write to stderr when they should do stdout, so you have to fiddle with this to get the right thing.
    If it takes a long time, just use the video ID, make files for the processed output of each source video.

  • Node’s async behavior lets you parallelize file system operations but you can get ‘too many open files’
    errors. I used graceful-fs to fix this.

  • Curl gives you text. Do “unescape” on this, then use “url-regex” library on NPM to get ALL urls.
    This avoids differences between URLs in hrefs, URLs written in text (e.g on a forum), iframes.
    Youtube has a ton of different formats. These will get normalized later (we just want the video ID)

    • m.youtube.com
    • www.youtube.com/embed/[id]
    • www.youtube.com?v=[id]
    • www.youtube.com/plalist/1234?v=[id]
    • etc
  • For sites with structure, use Cheerio -
    jQuery selector API in Node server side

  • To parallelize this, make a bunch of jobs in RabbitMQ. The definition
    of a job is:

    1. Links to crawl
    2. Rules (regex) for links to traverse
    3. Pre-set data (e.g. if you know the speaker’s name already)
    4. Rules to find data (lambdas with jQuery selectors)

      This needs to be serializable. Enter serialize-javascript

  • There is also a script that does ‘upsert’ at the end of crawling.

    If a video has been already seen, we want to update any new fields we
    found so there aren’t duplicates.

    I had to add locking at this stage, since I’m not using a real database.
    Used async-lock.

ETL - Video lookup

Use Youtube-dl here, ffmpeg, sox. Not hosing videos puts DMCA enforcement mostly on Youtube
(there are a ton of bootlegged videos posted to Reddit either from Youtube or sites designed
to facilitate copyright infringement)

Captions are generated by machines, if not by a person (on youtube). Captions are interesting,
in that they are just a timing + text, potentially with words highlighted. For big players in
music, they do fun CSS things on their captions. This means the caption will repeat (as each
word is highlighted). These can be spliced (essentially the same idea genome sequencing algorithms).

Highlighting in search results is a bit of a pain, since you can only highlight the timed text
within a boundary. However, this lets you skip to a portion of a video you care about.

ETL - Large file Processing

70,000 videos @ 45 minutes each - around 24 TB
This would cost about $300/mo on S3, similar on other services.

Hard drives cost ~$45/TB at the moment, and you can purchase up to
12 TB at once. It’s double if you do RAID, however big performance / electricity
hit for this.

Can download / split 4-6k videos per day.

Ways to mitigate:

- Use ffmpeg to split audio / video portions (audio - reencode as mp3)
- Download lower quality video (risky, for OCR)
- Quality tasks may only need first N minutes of video
- ffmpeg can compress video, but this is very time consuming

Slides:

- Use tesseract for OCR. Basically grab frames every N seconds, to find out if there's text 
  (this indicates either slides or an intro)
- Store in same format as closed captions (time window + text)
- [TODO insert paper reference] does a more sophisticated approach (finding out when 
  a slide becomes visible). They indicate their processing takes 1/10th (verify?) the 
  time of the actual video. I think this can be improved a lot with my cruder approach.

Some people train models on these. One neat trick is to use partial video.

Easy to compute SNR in Python.

Many machine learning tools are native to Python, e.g. numpy for matrix stuff.

Scala / Spark has some useful tools as well (good thing is it can max your CPU).

Sox (audio swiss army knife) detects clipping as a side-effect, which is useful. also
spits out video stats.

External Services

  • Deepgram
  • Watson

FindLectures.com: A Launch Story

I started FindLectures.com as a wide-ranging research project. Once it was usable, I showed a few people and posted comments on Reddit and Hacker News.

The reddit posts led to write-ups on The Next Web and LifeHacker article.

Since then, several people have expressed interest in how this went and what I’ve learned. In this post, I’ll show you the stats from the launch and screenshots of the tools I’ve been using.

Prior to Dec 3 there was very little usage, so the charts start there. The first chart shows the total traffic, since then:
Google Analytics traffic during launch

For scale, the peak usage was 3,371 sessions in one day, and the lowest is 294. A “page view” occurs every time someone selects a facet in the search UI. At the time of writing, my personal blog gets a little over 19,000 sessions a month, mostly from SEO, so I’m pleased with the results so far.

The next Google Analytics screenshot shows the breakdown by which site referred the traffic. The Next Web uses Facebook for commenting on their articles, which I suspect is the source of that traffic. Similarly, the traffic from Flipboard, Pocket, Tumblr, and Feedly are all effectively engineered by TNW.

You can also see here that someone submitted FindLectures.com to Producthunt, and it didn’t do well. A comparable utility called Class Central did well there, but site stability issues hurt me (more on that later). I suspect that design plays a big role as well.

A newspaper in Chennai also wrote up FindLectures.com, in Tamil - note the traffic from tamil.thehindu.com.

Google Analytics - Traffic by Referrer

Google Analytics has a view that groups traffic by “channel”. I.e. from a search engine, a regular website, or a social media site. This is a crude metric, since they conveniently don’t recognize DuckDuckGo as a search engine. They also treat Youtube and StackOverflow as “social media.”

The interesting thing about this screenshot is that it shows FindLectures.com receiving several hundred click-throughs from search engines during the period when it was most active. Prior to this, FindLectures.com didn’t show up in any results. This made me wonder, what are these people searching for, and will it affect the rankings in Google?

Google Analytics - Channels view

It turns out that we can answer both questions. Google Webmaster Tools shows that most of these people were searching for “findlectures”:

Google Webmaster Tools - Screenshot

I’m more interested in how I rank for “find lectures”, because that’s something people would actually look for without knowing about the site. The top spot for that query is held by the Church of Christian Science, who deserve it, having sponsored public lectures since 1879.

To find out how this affected my ranking, Google Webmaster Tools lets you drill into the results for a term. Here we can see that ranking for this term did increase after a few days. Whether it will stay this way remains to be seen.

Search engine queries give you interesting insights into people’s minds - who searches for “bill clinton hobbies”?

Google Webmaster Tools - Screenshot

If you build something useful, ideally you naturally get links to what you’ve built. For instance, the next screenshot shows traffic I get from Stackoverflow over time, which follows the ideal dream, increasing slowly over several years:

Gary's blog - Stackoverflow

Google Analytics also has a “search” integration - if you tell it what URL parameter contains people’s queries, they give you a nice report.

From these queries, it seems that many people using FindLectures.com are software developers. This is a good thing, because if these people like what you build, they’re likely to refer friends and family. For free applications, there are many successful products that help with recruiting (e.g. Stackoverflow, LinkedIn ads) or continuing education (e.g., books, app academies).

There are some interesting rarer searches - speaker names come up a lot (e.g. Ayn Rand, Trump), as well as topics around sports and cars. “Football” is an interest specialty category, because it’s impossible to match people with which of the two sports they want. When possible, I’m trying to hide highly nuanced cultural terms which could unfairly bias you for/against a speaker. In the most prominent instance, I’m hiding titles: “president” could be a university president or U.S. president, “bishop” is different in a Roman catholic church than Baptist, UK universities have many titles that mean nothing to me, etc.

Top Searches on FindLectures.com

For a while the site crashed periodically so I monitored the real time view of Google analytics to see if it was running. This shows you you how many people are on the site at a given time. For several hours after the Lifehacker article was posted, there were over a hundred concurrent users.

In the long run, I’d like to encourage people to discover talks they wouldn’t normally find on their own (there are some great historical lectures).

Anecdotally, I noticed that a lot of people clicked into topics on religion, philosophy, and spirituality. I think this is an area where the application can be useful. Some of these lectures are especially difficult to tag, but I think that the library style categorization system really shines. A well-designed tagging system can offer a neutral judgement on the value of the content, and lets you listen to speakers who you might not approach in the real world.

Google Analytics Real Time View

The next chart shows the top ten videos, which are almost entirely software development topics. I included tech talks to help me track with the state of the art in the field. This is a big differentiation point to similar search engines, as they typically focus on full courses covering introductions to computer science. There is some guilt associated with not finishing a class, and introductory CS material is not that useful on it’s own.

Top Ten most clicked lectures

Google Analytics also has the ability to do custom reports if you send it data to their Javascript API. The free version “limits” you to 20 metrics and dimensions (basically rows/columns describing an event, like a search, click, etc).

I set up a report on what portion of search results had lectures that playable in the search results, and how many talks were by known authors.

Number of search results with books or inline player

The API for Google Analytics is so easy to use that unscrupulous people write scripts to send fraudulent referrer data to your analytics. These typically are lead pipelines to sites that helpfully offer you the opportunity to join a botnet. The most interesting recent example of this is “vote trump” spam, which still continues well after the election.

Consequently you should be careful about trusting the analytics from any site. This spam is also regressive - smaller sites will be off far more than larger ones.

Real user actions speak more clearly to usage - I’ve received a dozen emailed thank-you notes, one hand-written card, and five people have written in recommending video series (usually their own). Four or five people also wrote in to report stability issues during outages.

After doing a few in-person demos, I realized that for some people faceted navigation is overwhelming. Positioning this application as “search” is a little misleading, because results without queries are a intentially a high-quality random to aid discovery. More specifically, results try to give you an even distribution of non-technical topics, correct for gender, and filter out things that have major issues (too short, bad audio, lots of ums, etc)

I decided to try making an email list to help the people who found this overwhelming and send out the best videos we’ve found. For now, I’m using Drip, but considering migrating to AWeber.

I initially assumed that no one would sign up for this, but you can see that the percentage of people who sign up is quite high (about 10x higher than I anticipated).

Many lecture collections came from research that I and others Wingspan did, finding videos for our lunch and learn program. My sister and I also share a spreadsheet containing 176 video ratings, so I have enough good videos to populate this for some time.

Drip - Overview Screenshot

The “Next Web” article was much more positive than Lifehacker, so I suspect that is why their signups are so much higher.

It’s interesting that there are 0 signups from “Facebook”. I think this is a combination of those being people on phones, and being from the comments section of TNW. If this were from people writing a Facebook status that reference the site, it would be much higher (among the first 20 subscribers, about half are my friends and family).

Drip - Signups by referrer

I set the emails up to be like a “course”, so that I can see how a small number of people react to the videos I select, before a large group sees them. Ideally, the initial sequence encourages people to trust my judgement on video choices (assuming I have any).

I notice that there is significant psychological pressure to emailing 300 people at once, which I imagine will be much higher if this list grows.

Drip also tracks metrics on what people click on - currently it’s not enough to be interesting, but I do notice that this seems to encourage people to return to the site.

Drip - Email Sequence

Overall I was hurt by site stability issues - it always takes time to stabilize an application when you set it up for the first time. In this case, any time an invalid HTTP request came through the site would briefly crash. Sometimes, the site would come back up.

The most significant stability improvement came from migrating the site to Heroku, which took me just over an hour.

Currently, the Node.js server runs there. I also use Solr, which runs on the same server as my personal blog. This makes Solr effectively free (since I already pay Linode for my blog). Heroku and Linode are in different data centers in New Jersey, so the traffic between the two goes out over the public internet. This adds approximately 25 ms latency to every request.

There is a “solr” offering as an add-on to Heroku, but it requires your data to match a preset schema, and there is no way to upload your own data, as it’s designed to be an add-on to a Rails app.

One of the big selling points of Heroku is that you can drag a slider to choose how many nodes you want (see below). This means you get a free load balancer. They bill per node and per day - for a two server setup, this costs $50 / mo, but could go down to $7 on a low-RAM single node configuration. However, one can easily increase or decrease the cost per day based on anticipated usage, which would be very expensive to build yourself.

Heroku - server configuration

At one point I tried using a VM on DigitalOcean, but found that it would require a $20/mo server to have enough RAM for both Solr and the Node.js server.

One of the cool things you get on Heroku is application health reporting - here, you can see how much RAM the servers use under actual usage:

Heroku - metrics #1

You can also see throughput - there are options for alerting, although I haven’t explored that because I don’t want to get paged for this site :)

Heroku metrics #2

There are a ton of add-ons in the Heroku market. Below, you can see a splunk-style search engine for logs. Heroku doesn’t retain these for long, so this is necessary.

Heroku - Logging

I noticed that iPhones have a particular tendency to send tons of HTTP requests to test what features your site supports, one of which you can see in the log above.

If you watch the logs for a web app in real time, you will see all kinds of strange things come through. As a taste, Google Analytics has a report of what browsers hit the site - it’s more varied than I ever imagined:

Google Analytics report of browser

I set up an application called Bugsnag, which captures Javascript errors. Some of these browsers have unexpected Javascript errors. 3/4 of the way down the screenshot, there is an error parsing the documentation comments in Lodash, which would make the site unusable for those people.

Bugsnag screenshot

There is a risk of spending too much time thinking about metrics. For me, the ideal valuable outcome of watching a talk is that it sinks into my mind and changes what decisions I make in the future, but this is difficult to measure. For a site like this, it’s only successful in “traffic” measures if people remember and return, and for now, it’s too early to see how that will play out.