FindLectures.com is a discovery engine talks, including historic speeches and interviews. The site organizes audio and video content by topic, speaker, and time period. I’ve included videos of every U.S. President from Theodore Rooselt on, some hard to find speeches by post-colonial African leaders, talks by activists recorded in the 70s at UCLA, interviews from libraries (via DPLA). This essay will discuss interesting lessons learned from crawling and organizing historical videos.
The earliest audio recordings date from the late 1800s [citation], and the earliest video from the early 1900s [citation] - the first recorded speech by a U.S. President is from Woodrow Wilson [citation]. Most videos from this time were silent, with still frames containing a caption - this text is significantly easier to extract than modern videos of a speaker with their slides, although the typeface is unusual.
[graph length vs. time]
If you survey famous Youtube channels, many of them have hundreds of short videos, so when I built FindLectures.com, I had an eye to find longer talks, that would be suitable for someone to watch on their lunch break (15-45 minutes). With older videos this is hard to do, because the recording mediums on early recordings on wax cylinders and silent films didn’t lend themselves to long content.
As recording technologies improved, the content gets much longer. It’s worth noting that online videos didn’t take off until Youtube became the primary destination for consumption, which was in the late 90s [citation].
Looking at U.S. Presidential archives, Obama was the first president to target Youtube as a primary video destination - the George W. Bush library claims to have 10s of thousands of hours of VHS content that is not indexed. Presidential libraries outsource curation work to the National Archives, but the libraries provide funding - how many videos are eventually digitized depend on the popularity of the president, and how much they care about historical preservation. There are numerous small museums with collections of video, but who don’t have the resources to do digitization (e.g. there is an Ayn Rand Library, or the Foxfire library).
In a few cases, major libraries digitize a signficiant portions of their collections - the Digital Public Library of America is helping a lot in this area. When this is done by fan clubs of the individual, like presidential libraries, these tend to only be videos that help the image of the individual (try finding videos about Bill Clinton’s extracurricular activities in his library, for instance)
Many important historical figures from the era of video are just not available online, because their are not well-known enough. Many countries that gained independence from colonialism post WWII had important leaders, who are rarely available online, unless their are really popular. For example, Haillie Selassie, the Ethiopian leader is available online, because of the popularity of Rastafarianism and his involvement in the League of Nations, but other English speaking leaders are harder to find - to the best of my knowledge, the first leader of Nigeria (Kwame Nkrumah) is unavailable.
For older videos, categorization can be tricky, because the modern video production styles didn’t exist.
Documentaries are a modern video format [insert definition]. Precursors to this video style were actualities and [insert] - actualities are just videos of life - more like videos people record with camera phones to social media.
When I built the first version of FindLectures, I used the IBM Watson API to tag videos by topic, and to find countries referenced in a video. The most common topic tagging systems use one of two taxonomys: the IAB (ad industry) or [insert news format] which is used for newspaper tagging. The ad industry taxonomy is the richest (five layers), but it has some categories which are really strange for text.
Libraries use several taxonomies for books which initially appear to be promising alternatives. The Dewey Decimal system is popular, but requires paid licensing, and the licensing is not priced to accomadate my use case. There is a book store tagging system which is promsing [citation], which is more “user friendly.” There is also a European book cataloging system, which is licensed freely, but lacks depths in technical areas (some companies sell add-ons for legal or medical taxonomies, for instance - I had to build my own for software). The Library of Congress system is fantastically rich, but in order to be used effectively, one would need a ton of text to train each category, or find a way to map videos into a subset. Essentially, there is no way to train an AI against it, without the text for hundreds of books for each category.
Tools which do entity recognitiion often can tag countries, but assume a “modern” concept of countries, and a list that more or less matches what is on a world map today, although they may deal well with disputed terrorities. Prior to the modern diplomatic system, and during colonialism, these categories were not as they currently are, both in definition, name, and boundaries.
Language curation is tricky, because it is hard to identify some similar languages that are Romance languages - I found accuracy of language detection of Spanish, Portuguese, and French to be poor, for instance. Tools that do language identification are also biased towards European languages, and not trained for native texts.
Presenting historical information is also a challenge in several ways. TItles have specific cultural meaning - it means a different thing to be a baptist bishop than catholic, and many countries have certifications that are official titles, which do not translate well. For this site, I’ve tended to remove titles, for egalitarian purposes, and to make it harder to discriminate for or against someone on this basis.
Years are also interesting - for presidentail talks I’ve fixed the facets so they are chronological. This makes it so you can learn things about the talks, just by seeing the lists over time.
Amazon retains author identifiers, but these are managed and created by publishers, so dead presidents from earlier eras do not have their metadata managed particularly well, and you have to do it yourself.
Many libraries have tons of cool artifcats, wihch they’ve digitized into obscene formats, like realplayer, or custom applications that are a pain to scrape and have no metadata.
You have a lot more metadata from wikipedia - rich bios about speakers lives. What they were involved in. Measure of controversy, influence. Find people with similar stuff in their bio. Ability to correct for what I learned in school
to talk about: cultural sensitivity, accent folding, titles