Historic videos and talks are recorded on different mediums depending on the time period - there are very early recordings on wax cylinders and silent films, which later become reel-to-reels, and VHS. Videos online don’t take off until Youtube became the primary destination for consumption.
Early videos must be digitized, so funding is a huge issue. Most are in museums dedicated to a person or cause, like a US Presidential library, with some aspects of preservation and curation outsourced to the National Archives. However, video digitization is limited by funding, which is affected by the popularity of the person or their family, the size of their estate, and ability to convince people to give to clearly optional causes - I suspect donating to the Bush library is like girl scout cookies for rich people. Copyright is an issue, but it’s typically significantly less of a problem.
In a few cases, major libraries digitize a signficiant portions of their collections - the Digital Public Library of America is helping a lot in this area. When this is done by fan clubs of the individual, like presidential libraries, these tend to only be videos that help the image of the individual (try finding videos about Bill Clinton’s extracurricular activities in his library, for instance)
For older videos, categorization is tricky, because the modern video production styles didn’t exist. In particulary, Documentaries are a modern style - early styles included “actualities”, which are just recordings of life.
You could also justify including speeches in text only, by earlier Presidents, or historical figures, but this introduces copyright as an issue (when you load a video or audio from a third party site, you’re outsourcing this significant legal issue).
Many important historical figures from the era of video are just not available online, because their are not well-known enough. Many countries that gained independence from colonialism post WWII had important leaders, who are rarely available online, unless their are really popular. For example, Haillie Selassie, the Ethiopian leader is available online, because of the popularity of Rastafarianism and his involvement in the League of Nations, but other English speaking leaders are harder to find - to the best of my knowledge, the first leader of Nigeria (Kwame Nkrumah) is unavailable.
Language curation is tricky, because it is hard to identify some similar languages that are Romance languages - I found accuracy of language detection of Spanish, Portuguese, and French to be poor, for instance. Tools that do language identification are also biased towards European languages, and not trained for native texts.
Tools which do entity recognitiion often can tag countries, but assume a “modern” concept of countries, and a list that more or less matches what is on a world map today, although they may deal well with disputed terrorities. Prior to the modern diplomatic system, and during colonialism, these categories were not as they currently are, both in definition, name, and boundaries.
With Presidents, it is a challenge to decide what to include - ideally you don’t want to succumb to sensationalism, but if you’re pulling from Youtube, this is a risk. Many presidents, but not all, have presidential libraries, so you want to manually find stuff that is missing and tag it - each additional dataset (presidents, astronauts, etc) is like a product though, and for marketing purposes additional data added can be like a product launch.
Presenting historical information is also a challenge in several ways. TItles have specific cultural meaning - it means a different thing to be a baptist bishop than catholic, and many countries have certifications that are official titles, which do not translate well. For this site, I’ve tended to remove titles, for egalitarian purposes, and to make it harder to discriminate for or against someone on this basis.
Years are also interesting - for presidentail talks I’ve fixed the facets so they are chronological. This makes it so you can learn things about the talks, just by seeing the lists over time.
Amazon retains author identifiers, but these are managed and created by publishers, so dead presidents from earlier eras do not have their metadata managed particularly well, and you have to do it yourself.
Many libraries have tons of cool artifcats, wihch they’ve digitized into obscene formats, like realplayer, or custom applications that are a pain to scrape and have no metadata.
- Data Storage
- Handling Sparse Data For Ranking
- Crawler Design
- Misc. Problems
- Content-type specific advice
FindLectures.com indexes talks from conference websites, speaker agencies, select social media sites, and conference recording companies.
Rather than crawl the internet at large, I select individual sites to crawl, which results in a higher quality list of videos, because they’ve been vetted by the site advertising them.
I think of this vetting as a low-pass filter on speakers - you still get bad material, but it rules out a lot of the youtube manifesto content.
This technique allows the crawler to find more metadata. Each video is stored in a JSON file, indexed by it’s URL on Youtube, Vimeo, or Soundcloud, so if it shows up in multiple places, the metadata can be combined.
Speakers typically write their own bios for conferences, so you can get validated information about them - if you wanted to get a view of tech conferences that evenly balances men and women, you easily get their preferred pronouns from their bio.
Conceptually, building a crawler is simple:
- Load the site’s robots.txt file, parse it, and test future URLs against it
- Load a starting page, or a sitemap to obtain a list of URLs
- Parse any desired metadata from the page
- For each URL on the page, note any links, filtering out previously seen URLs
- Load remaining pages, in sequence
JSON files stored in git.
This has a few advantages - you can test by looking at diffs:
You can also have the Node.js script upload changed files to Solr:
A couple notes:
- This uses solr’s auto-field syntax.
- Ending a field with “_s” is a string, “_i” is a number.
- Ending a field with “_ss” is an array.
- Most fields can be part of a hierarchy (e.g. collection, topic)
- Many fields can be arrays - topic, speaker
- Some fields are floats (length)
- Some fields are sparse, or have an implicit associated with them - this is important for ranking.
- Works as a linter, plus you get types.
- Easy to add incrementally
Write a script to do soft-validation (fields that can’t be blank, types must match Solr definitions)
To get data into the search engine, it’s pulled from several sources, modelled with progressive enhancement in mind - data from the internet first, then data that is joined in from other APIs or scraped sites.
The final results are stored in Solr, and re-built from scratch periodically, each time the ranking is changed.
Recently, a security firm uncovered a major botnet that automated fake page views on video ads, to generate ad revenue. I read a report on thiis botnet, and found some good Node libraries for scraping. 2 3
const urlRegex = require(‘url-regex’);
str = decodeHTMLEntities(str);
Most websites are built with content management tools like Wordpress, so scraping has become easier over time. Most conference sites have very few pages (e.g. one per speaker), so they are typically easy to process.
In your browser console, you can often write a jQuery expression to obtain metadata:
$(‘ul.speaker-list’).val( () => $(this).text())
Videos may be referenced in multiple ways - ,
To find these, we apply a URL regular expression, apply a URL entity decoder to get the source URL, and for Youtube, normalize the URL format. The major benefit of this is that it discards the structure of the page - most crawlers hit issues with invalid HTML (I gave up on BeatifulSoup for this). This also allows me to get information out of hacker news comments, which may be just text that contains URLs, but not an actual link.
For the individual steps, I use the best available tools: Curl, Youtube-dl, FFMpeg, Sox. Curl is used to get HTML, Youtube-dl to get videos or metadata ffrom youtube, soundcloud, vimeo, and many others. FFMpeg is a general purpose tool for processing tools, which can extract screenshots, and Sox is the swiss army knife of audio processing.
Subtitles can have duplicated content - one line for each highlighted word. This resembles the classic DNA re-assmbly problem in Bioinformatics - need to find overlaps an get the minimum set of text.
Once we do this, we also need to store text with timings, so that we can show where in a video text is.
To make the site useful with potentially sparse metadata, I model this process after progressive enhancement. If you can only find basic metadata about a video on a pass, like the title and length, it’s still more useful than youtube, because youtube doesn’t let you filter by the length, year, or topic - I have a script that does an upsert, and combines the output of a new data processsing script to existing data.
- JSON.stringify / JSON.parse are blocking
- Use RabbitMQ
- Folder structure - don’t write everything to one place, or you get x0,000 folders [use hex hash]
Serialization - because this process requires lambdas and Regex tests, we need to do better than seralizing key/value props.
One problem is JSON.stringify is blocking -
If you try to crawl the entire internet, you’ll get temporarily blocked by your ISP for doing DNS queries too fast. [Q: how do I handle timeouts?]
Once we have a list of videos, we obtain additional metadata:
Youtube-dl to get title / description
Regex to get year given
Sox to get audio quality information (left / right balance, clipping)
FFMpeg to extract and format subtitles
Regex to count ums / etc
Speaker name, bio, wikipedia if available on a conference site
Each time a video is found, it is marked as coming from a specific collection - the collection is a useful concept because you can identify whether a speaker is referenced on social media, to boost their ranking.
- 1.Robots.txt[https://www.npmjs.com/package/robots-parser] ↩
- 2.http://go.whiteops.com/rs/179-SQE-823/images/WO_Methbot_Operation_WP.pdf ↩
- 3.Cheerio.js: https://github.com/cheeriojs/cheerio ↩
A few months ago, I launched a discovery engine for lectures, called FindLectures.com. To keep quality high, I manually select individual presentations, speakers, and video collections for inclusion — now over 125,000 talks.
Many people struggle to finish Coursera courses, so I prioritize standalone talks. Even so, a search engine full of options can be overwhelming, so I offer an email list where I send the best talks I’ve seen. I was initially was unsure if anyone would care, so I didn’t write any emails until a dozen people signed up.
As it turned out, once I launched the site over 500 people signed up. This isn’t a huge list, but it’s comparable in size to the email list of a typical church or small non-profit.
I periodically get wonderful responses like this:
“I just love this newsletter! I usually find at least one of the lectures very interesting and watch it later.
Thank you so much for this =)”
Most people sign up through a pop-up in the lower right hand corner. One person wrote a blog post reviewing FindLectures.com, including a note about how this annoyed him. Another asked me to add a form so he could send a link to his coworkers and students.
So far, 3% of people who visit the site request the emails — but a surprising number fail to confirm the subscription. I’m not sure why this is. I found that the meaning of “subscribe” varies by cultural background—one gentleman wrote in concerned that he had purchased something.
I thought I might run out of material to send, but it hasn’t happened yet!
Emails go out at 9:00 AM on a Monday, and “opens” spike around this time. I suspect a lot of people watch videos over their lunch break — you can some evidence in the spike around noon:
Recommending that someone watch a half hour video is not a neutral act. The best videos seep into your mind and change how you think. It’s difficult to know if a talk will interest a global audience — depending on your perspective, a single lecture could be too technical, fluffy, or edgy. Many people also prefer the convenience of podcasts, especially for listening in the car or at jobs without wi-fi.
When people don’t get value from they emails, they eventually unsubscribe. Typically this is visible in their usage history, by either not opening emails or not clicking links.
The email list is set up like a class — every person starts from the beginning, so these emails are still useful if I’m ever unable to continue. A nice benefit of this is that if there is spelling error in an email, it only goes out to a few people.
Occasionally someone unsubscribes and later returns, so for the time being they will resume where they left off.
Email software removes an address if the recipient hits ‘report spam’ or the address bounces. For a list of this size, I only need 2–3 new emails per week to keep growing, at the rate people are unsubscribing.
A few friends asked if I have plans to “monetize” this project. I did include Amazon affiliate links to speaker’s books — they get a few clicks, but on average each click is worth pennies. I imagine this would be more valuable if there was a call-to-action, or if I was selling an item directly.
The following screenshot shows the clicks for an individual email:
Notice that I copied a marketing tactic from Cooper Press — URLs to talks include UTM parameters, which causes click-throughs to display in Google Analytics for the target website as an advertising campaign.
If you read along this far, you might be interested to know where all these people came from — there was an initial spike from The Next Web and Life Hacker, followed by smaller traffic spikes from some smaller blogs and large Facebook pages.
Below, I’ve taken screenshots of each month of each month:
This is still a fun project, so I’m going to keep writing the emails.
I moved the list to AWeber. This has allowed me to seamlessly add a Typeform survey on signup. Several people already filled out the survey, and I learned that there is huge demand for curated software development talks.
AWeber is big on ‘personalization’ of emails, so I’m looking forward to experimenting with this aspect of the product. I’m considering adding a personalized ‘track’ to the emails, which would give me more opportunities to promote speakers directly.
“Lunch and Learn” programs are a great way for a team to develop common knowledge by watching presentations together. This can be helpful for research or training - software development teams often start lunch and learn programs to learn about the latest tools in the field. Lunch and learn programs also change the culture of a team - ideas that come from an outside party tend to have inherent credibility.
Here are some tips for ensuring your lunch and learn is a success:
Consistency is important - having the same time and location each week helps people get in the habit of attending.
If you are choosing videos, you’ll get better attendence if you cater topics to the audience. Highly technical content is suitable for a very focused team, whereas general interest content is suitable for building relationships in a heterogeneous team.
Often, the best material from a lunch and learn is the discussion at the end of a presentation. To facilitate this, it’s important to leave time.
If you’re watching a video, make sure to leave time to set up! This can be more time consuming than you’d expect.
If you don’t have lunch places nearby, have a team member get a group order. It takes extra work to coordinate, but food is usually more interesting than any video.
Use lunch and learn program to practice talks for conferences
Invite sales team members to demonstrate what they show to clients - this is a great way to get a different perspective on your business.
If you choose videos, preview them in advance to check for problems with the audio or speaker. Videos that discuss sensitive issues can be “not safe for work” in a mixed audience, even if the content is otherwise of high quality.
Use a shared spreadsheet to vote on lunch places or video topics
Encourage team members to develop their public speaking skills. If you have summer interns, have them work on a project and demo their work at the end of the term.
FindLectures.com started as a list of videos from our lunch and learn.
If you’re looking for a place to start, here are ten places with great collections of videos, that aren’t just TED talks:
- Gresham College - This UK university has offered public lectures for over 400 years. Many of these are fascinating historical topics.
- Startup School - This educational series by Y Combinator covers aspects of starting a tech business.
- Hacker News - Also a Y Combinator production, searching for interviews and conference talks turns up great crowd-sourced suggestions.
- InfoQ puts on sizeable tech conferences every year - they have a ton of content.
- Lanyrd - this is a social media site for conference speakers / organizers - lots of great material buried in the site.
- Confreaks.tv is a video recording company that does a lot of tech-oriented conferences.
- Google has a great lunch and learn program - why not use their videos?
- University of California TV - This is a public broadcasting channel with tons of great videos.
- UCLA Archives - UCLA maintains a fantastic collection of speeches by cultural icons from the 1960s and 1970s. I’ve found that videos by famous musicians are especially popular for company lunch and learns.
- The Chicago Humanities Festival brings in cultural icons to speak about a range of contemporary topics.
FindLectures.com is a curated list of collections of video and audio selections. Entries in a collection go through automated checks which eliminates or lower the rank of the talk.
Now that the collection have grown, more people have been asking what “curated” means. When I started FindLectures.com, I used videos selected for Wingspan’s lunch and learn program. Most came from larger collections - talks at specific conferences, unversity lecture series, book atuhors, or lectures by renowned speakers. For this site, I’ve expanded the last category to include anyone of historical significance, as these are often quite interesting, and this technique avoids me having to distinguish between truth and alternative facts.
While not all conferences are organized the same, they tend to include speakers that are a draw for attendees, or that at least allow the conference organizers to enter the social networks of respected people in their chosen field.
For universities, good commencement speakers get the institution in the news, and lecture series are a way to showcase the research of staff. In some cases, a generous alumnus will set aside money to pay for a lecture series. From what I have seen, the “vetting” of speakers for these tends to be based on seniority, but this often allows an experienced professor explore a topic of interest.
I typically use the presence of a wikipedia page as crude indication of influence. Having written a book for a recognized publisher is also a positive signal, although some publishers clearly do much better vetting of their author.
There are two common types of presentation that don’t map well to this method of curation: youtube stars, and people paid to hold a specific opinion (e.g. lobbyists, some activist organizations, politicians). Both types tend to be prolific, which risks starving out better content, so when these are included, they get significant ranking penalties.
Length tends to be a good quality filter - most universities youtube channels include dozens of short interviews with students, so by filtering to topics over the 5-10 minute range, you get a significantly better experience. FindLectures.com is often used by people on their lunch break, so talks in the 20-40 minute range get a ranking boost. Similarly, for historic videos, newer videos are not ranked as highly.
Social media sites can be good sources of recommended videos, but come with some issues. Reddit tends to have a lot of bootlegged videos, for instance. If a video is recommended by multiple sources (e.g. a Hacker News link to a tech conference video), it gets a ranking boost.
Finally, videos may be removed entirely if the audio is unintelligible (Youtube does not do this), or ranked lower for significant issues (clipping, lots of ums, and so on).
Like software architecture, there is some structure around this - this is like debugging.
Drip emails set up like a course.
Google analytics - tracks usage, search terms, facet usage
- HN Comments (lot of individual feedback on missing talks)
- Reddit posts in /r/startups (being found by writers)
- dev.to (90k twitter list)
- Something that probably will work is Cooper Press (weekly dev emails - they do a lot from reddit posts)
This shows changes after adding ‘cards’ about speakers:
- The feature seems really instructive, to me. Getting data was not difficult, as most of the people who are famous enough to end up here are the first results in book searches already.
- Data comes from Open Library. There is an opportunity to use Wikidata to get some additional trivia, but that will require learning their arcane query language.
- These could also be useful as topic cards (books on Python, etc).
- These changes are to make the site look better, so that it’s more credible.
- Semantic UI is mostly compatible with Bootstrap, so I was able to get this to work piecemeal
- Semantic UI has a lot more components than Boostrap - I chose it because it had cards. A lot of the “Bootstrap alternatives” or themes have obvious rendering defects on their sites.
- The footer looks much more professional than before.
- Cards are more complex than I expected - you have to establish rules for when to show them (e.g. I filtered to a speaker, all the talks from a search result are by one speaker, I searched for a speaker by name)
- I set it up so that hovering over a speaker name brings up the card, like Rapportive. There is a bit of code to ignore the case where your mouse moves fast across the screen, catching the hover in the process.
- I tried to ‘sticky’ the cards. I had to do a custom implementation - I think you need to convert the whole site for this Semantic UI feature to work. It also would otherwise require jQuery. Right now there are some issues, e.g. if you get a big card and scroll to the bottom, it can overlaps the footer.
- Introducing Semantic UI made the search results look better, but in some cases the rendering is glitchy, especially on a Mac. The play buttons in headers don’t wrap correctly, and the vertical alignment is wrong for some of the metadata (speaker names).
This shows the search suggestion box after adding Semantic UI:
- This is based on the top prior searches - a fixed list of around 400.
- It pulls suggestions doing ‘starts with’.
- There is an opportunity to use full text search (e.g. lunr) to make that better.
- The search box is from Semantic UI, and I think it looks much more professional than the original Bootstrap one.
- People have some trouble finding download buttons.
- Some people have suggested it could be “slicker”
- I think the facet changing is distracting, because it refreshes too fast.
- The inline help doesn’t render well.
- No one uses the inline player (Reddit has this, but their icons are larger)
- A couple people have suggested “tabs” for categories
- Maybe show some options for things you can search for (to help with discovery)
- I’d like a version where the search box is tied to the top of the screen
- This doesn’t show what matched your query, or why you should watch a talk
- Show talks from a variety of topics (novel sorting technique)
- Show talks from speakers of a different background
- Include historical talks as a category, as well as personal interviews
- Topics assigned by a machine (lets you browse stuff you didn’t know was there)
- Year filtering
- Speaker name filtering
Slides: index one per 10 seconds (e.g.)
Closed captions: convert vtt/srt to text (eliminate overlap)
To actually do searching, need times in the text
Closed captions are lists of timings and the text that is displayed in that time.
Solr is not made to store complex data formats, so the easiest way to store
caption files is a string array. You can easily make it so that people can
search the caption and not the times, by adding an analysis step to the
This approach won’t pick up matches that extend over several lines.