Building a Crawler in Node.JS

1

Introduction

FindLectures.com is a discovery engine for tech talks, historic speeches, and academic lectures. The site rates audio and video content for quality, showing different recommended talks each day on a variety of topics.

Filtering by topic

FindLectures.com crawls conference sites to get talk metadata, such as speaker names and bios, descriptions, and the date a video was recorded. Often these attributes are sparsely populated, or available across multiple websites. Additional attributes are inferred from audio and video content, but require more sophisticated data extraction to be useful in a text- oriented search engine like Solr.

Filtering by year

This essay will discuss interesting lessons learned from crawling historical videos, demonstrate information extraction with machine learning, and show how to map real world problems to search engine functionality.

Design

In 2015, VentureBeat announced that Wordpress powered 25% of the internet 6 - the rise of content management applications means sites are typically structured in a formulaic way, which makes them easy to scrape.

Building a custom crawler is not the most efficient way to obtain data, but few alternatives exist, and obtaining data directly from websites is getting easier. Well-funded sites often offer APIs to deter scraping, but these tend impose severe restrictions on what data you can obtain. Very few sites have APIs that are easy to use (notable exceptions are the DPLA 9 and Confreaks.tv 10)

Filtering by speaker

In the future, more sites will incorporate structured data 7 into pages, which tags information within a page. Major search engines are the driver for this change - Google offers custom search result rendering for recipes and reviews, for instance, and seems to adding additional search integrations as sites adopt the technology.

Several pieces of data extracted from sites

FindLectures.com crawls a site at a time, which limits the amount of data it can collect, but this allows obtaining high quality metadata and prevents the inclusion of spam videos. The crawler extracts information from different types of content: HTML, text, video, and audio. Additionally, talks are the central concept, so if a talk is mentioned by multiple sites, new metadata is merged as it is found.

Search engines maintain many attributes that affect ranking, but which are hidden from end users. Google claims to consider over 200 ranking factors, for instance. 5 Youtube ranks popularity heavily, which allows manifestos and conspiracy theories to dominate legitimate content. A key decision in FindLectures.com was to encourage quality content in search results, while also discouraging homogeneity of topics (i.e. ideological diversity for political videos, a range of difficulties for science videos)

Data Structure

Each video is stored in a single JSON file. All files are checked into a git repository.

This makes it easy to test a change by inspecting file diffs:
Diff

This also gives you a simple way to track changes that need to be published to your Solr index:

[title] [] [url] [link text]
1
git diff --name-only | grep json/

JSON files are easy to handle in most programming languages (except Scala Play, which do not follow conventional JSON formatting). Multi-language support is important to me - this allows using the best libraries for any language. There are many good machine learning tools for Python, but the search engine runs on the JVM.

In the future, Apache Arrow 11 may be a compelling alternative to JSON.

Some miscellaneous notes on data types:

  1. FindLectures.com uses Solr’s auto-field syntax - by using a special naming scheme, Solr knows the type of every attribute, so you don’t have to change Solr configs.
    1. Ending a field with “_s” is a string, “_i” is a number.
    2. Ending a field with “_ss” is an array (e.g. speaker list, topic list).
  2. Fields can hierarchical (e.g. collection, topic).
  3. Fields can be integers or floating point (length, year given)
  4. Many fields are sparsely populated.
  5. Fields can have uncertainty associated to them. When the search engine is populated, this is resolved in a step that computes a ranking boost.
  6. Attributes can correspond to features in the product (e.g. “can_play_inline”, “speaker_has_books_for_sale”). This is common - e-commerce sites include fields like gross margin in ranking 4.
  7. Descriptions, transcripts, and closed captions are run through the IBM Watson API, to obtain more attributes.
  8. Speakers typically write their own bios for conferences, so you get high quality biographical information (e.g. preferred gender pronouns).

Crawler Design

Conceptually, building a crawler is simple:

  1. Load the site’s robots.txt file, parse it, and test future URLs against it
  2. Load a starting page, or a sitemap to obtain a list of URLs
  3. Parse any desired metadata from the page
  4. For each URL on the page, note any links, filtering out previously seen URLs
  5. Load remaining pages, in sequence

FindLectures.com uses a custom crawler written in TypeScript. While JavaScript is a great platform for working with the DOM or JSON, Node.js lacks robust libraries for basically everything else. In particular, handling HTTP, processing video, or processing audio are painful. For these tasks, I use command line libriares, and TypeScript as the glue code. This prevents lockin on specific libraries.

[Execute a Shell Command in Node.js] [] [url] [link text]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
function runCommand(command: string, cb) {
let out = '';
let err = '';
const ran = child_process.exec(command,
(err, stdout, stderr) => {
cb(stderr, stdout)
});
ran.stdout.on('data', function(data) {
out += data;
});
ran.stderr.on('data', function(data) {
err += data;
});
}

Regardless of language, crawling involves hard problems: handling HTTP (especially HTTPS), parsing HTML, or working with unknown video/audio formats. Any of these “standards” could take years to handle in a robust way, so it’s important to choose strong libraries.

Currently, I’m using several fantastic and well-maintained command line tools, which are available for basically any OS: curl (for HTTP), sox (audio), ffmpeg (video decoding), Youtube-dl (obtaining video titles, descriptions, subtitles).

To scale the process, I store the list of URLs to crawl in a RabbitMQ database, which allows new processes to pull from the stack. This also aids reliability - if a process crashes or runs out of RAM, it can simply be restarted.

The crawler is designed around the principle of progressive enhancement. Each content type runs at a different pace (crawling, video metadata from Youtube, analysis from Watson, audio, and video processing). Each successive stage is substantially slower, so they get their own scripts, allowing intermediate results to be uploaded and useful while later stages run. Additionally, metadata on the same video found by crawling multiple sites is merged.

To support the principle of progressive enhancement, the crawer implements an upsert routine, which detects many different forms of URL to the same video.

Upon completion, results are pushed to a Solr index. This is re-built from scratch periodically, for instance each time the ranking algorithm is changed.

If you research scraping, you’ll find many commercial and open source tools. A very popular Python library is BeautifulSoup 8. Before I wrote a Javascript cralwer, I wrote about two dozen scrapers with this library. For these scripts I would pull content from the site with curl, then parse with Beautiful Soup. With practice I could write a scraper in an hour or two, but occasionally would hit a site with HTML malformed in a way that caused BeautifulSoup to completely choke.

My bitbucket account - lots of scraping utilities

Parsing HTML

Recently, a security firm uncovered a major botnet that automated fake page views on video ads to generate ad revenue. This botnet was written in Node.js, and used a lot of libraries that you might consider in a cralwer 2 3

In this application, there are two types of URLs we care about: links to video or audio files, and links to other pages on the current site.

Video links can be iframes for embedded content, hrefs, or pure text (this is sometimes true in forums).

Video URls will always be absolute URLs, so the easiest way to find all of them is to run a regular expression on the page contents. This has some interesting side effects - you will even find video links in Javascript comments.

[title] [] [url] [link text]
1
2
3
4
const urlRegex = require('url-regex');
// Hacker News does some odd encoding
const decodedPage = decodeHTMLEntities(page);

From there, we can apply a filter to establish whether the URL is one we want to follow or not. If a site has speaker pages, you may wish to treat this differently from video pages, for instance.

[title] [] [url] [link text]
1
2
3
4
5
urls.filter(
(url) =>
url.indexOf("https://www.gresham.ac.uk/watch/") > 0 ||
url.indexOf("https://www.gresham.ac.uk/lectures-and-events/")
);

We need to do a second pass on the page to get metadata. To do this, we’ll want to apply our list of jQuery selectors. Since jQuery is a browser library, we’ll use the excellent cheerio library, which gives us a jQuery API on the command line.

[title] [] [url] [link text]
1
2
3
const cheerio = require('cheerio');
const $ = require('cheerio').load(str);

The reason I like this technique is that basically every website includes jQuery, so you can usually write expressions for all the attributes you want to scrape in the browser console, although the jQuery API is a bit idiosyncratic if you’re used to ‘modern’ JavaScript.

[title] [] [url] [link text]
1
$('ul.speaker-list').val( () => $(this).text())

We can put all of this together, and build a full configuration object, which when applied, gives us every interesting piece of metadata about a video:

[title] [] [url] [link text]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
{
speakerName_ss: () => {
return $('.speaker-name').map(
function() {
return $(this).text().trim()
}
)
},
audio_url_s: () => {
return $('.audio-player a').attr('href')
},
transcript_s: () => {
return $('.transcript-text-content p[style*="text-align: justify"]').text()
},
talk_year_i: () => {
return $('.sidebar-block-header')
.text()
.match('\\b\\d\\d\\d\\d\\b')[0]
},
tags_ss: () => $('.tags a').map(
function() {
return $(this).text().trim()
}
),
description_s: () => $('.copy p').text(),
collection_ss: ['Gresham College'],
url_s: url_s,
collection_ss: ['Gresham College']
}

Before we decide to go to the next page, we’ll want to check against robots.txt, to ensure the site owner is ok with us crawling.

[title] [] [url] [link text]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
const robotsParser = require('robots-parser');
const robots = robotsParser('http://www.example.com/robots.txt', [
'User-agent: *',
'Disallow: /dir/',
'Disallow: /test.html',
'Allow: /dir/test.html',
'Allow: /test.html',
'Crawl-delay: 1',
'Sitemap: http://example.com/sitemap.xml',
'Host: example.com'
].join('\n'));
robots.isAllowed(
'http://www.example.com/test.html',
'FindLectures.com/1.0'
);

Upsert

To make the site useful with potentially sparse metadata, I model this process after progressive enhancement. If you can only find basic metadata about a video on a pass, like the title and length, it’s still more useful than youtube, because youtube doesn’t let you filter by the length, year, or topic - I have a script that does an upsert, and combines the output of a new data processsing script to existing data.

URLs to a video can come in many forms:
M.youtube.com
www.youtube.com
//yturlshortenrer//
youtube.com?123&list=
Youtube.com?123
youtube.com&v=123
“You should visit youtube.com/123…” (text comment on a forum)

Subtitles

[title] [] [url] [link text]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function parseSrt(text) {
let lines = text.split("\n");
let matchBreak = /\d\d:\d\d:\d\d,\d\d\d --> \d\d:\d\d:\d\d,\d\d\d/i;
let aNumber = /^\d+/ig;
let transcript = "";
for (let i: number = 0; i < lines.length; i++) {
let line = lines[i];
if (!line.match(matchBreak) && !line.match(aNumber)) {
transcript += ' ' + line;
}
}
return transcript.replace(/\s+/ig, ' ');
}

Subtitles can have duplicated content - one line for each highlighted word. This resembles the classic DNA re-assmbly problem in Bioinformatics - need to find overlaps an get the minimum set of text.

[title] [] [url] [link text]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
function findOverlap(a, b) {
if (b.length === 0) {
return "";
}
if (a.endsWith(b)) {
return b;
}
if (a.indexOf(b) >= 0) {
return b;
}
return findOverlap(a, b.substring(0, b.length - 1));
}
function filterDuplicateText(lines) {
for (var i = 0; i < lines.length; i++) {
lines[i] = lines[i].trim();
}
let idx = 0;
let text = lines[0];
while (idx < lines.length - 1) {
let overlap =
findOverlap(lines[idx], lines[idx + 1]);
if (overlap.length >= 5) {
let nonOverlap = lines[idx + 1].substring(overlap.length);
if (nonOverlap.length > 0) {
text += ' ' + nonOverlap;
}
} else {
text += ' ' + lines[idx + 1];
}
idx++;
}
return text;
}

Once we do this, we also need to store text with timings, so that we can show where in a video text is.

[title] [] [url] [link text]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
function captionSearch(lines) {
let line0 = /^\d+$/;
let line1 = /^\d+:\d+:\d+,\d+ --> \d+:\d+:\d+,\d+$/;
let seen = {};
let states = ["line0", "line1", "text"]
let processors = [x => null, x => x, x => x]
let nexts = [x => !!x.match(line0), x => x.match(line1), x => x === '']
let skips = [x => false, x => false, x => {let res = seen[x]; seen[x] = true; return res}]
let transitions = [1, 2, 0]
let idx = 0;
let stateIdx = 0;
let result = [];
let thisRow = [];
while (idx < lines.length) {
let line = lines[idx].trim();
let thisLineResult = processors[stateIdx](line);
if (thisLineResult !== null && thisLineResult !== "") {
if (!skips[stateIdx](thisLineResult)) {
thisRow = thisRow.concat(thisLineResult);
}
}
if (nexts[stateIdx](line)) {
stateIdx = transitions[stateIdx];
if (stateIdx === 0) {
const allText = thisRow.join(' ');
if (!allText.match(line1)) {
result.push(allText);
}
thisRow = [];
}
}
idx++;
}
return result;
}

Using RabbitMQ

If you build a crawler, chances are you’ll want to start more threads. Using RabbitMQ to store a work list is a great way to do this.

If you come from a non-Javacript background, it’s really tempting to write code like this:

[title] [] [url] [link text]
1
2
writeFileSync("json/1/" + itemNumber + ".json",
JSON.stringify(result, null, 2));

While a simple implementation, this has two potentially slow blocking calls (writeFileSync and JSON.stringify) - only the file writing can be converted to async.

You’ll recall from above that the crawler for each site is a series of lambdas or regular expressions, as well as static data:

[title] [] [url] [link text]
1
2
3
4
5
6
7
8
9
{
speakerName_ss: () => {
return $('.speaker-name').map(
function() {
return $(this).text().trim()
}
)
},
collection_ss: ["Gresham College"]

This object is basically a template for the site data, but since it’s JavaScript and not JSON, it is more difficult to store in RabbitMQ. Rather than serializing this with JSON.stringify, we can use a fantastic library named “serialize-javascript” 13, a library developed at the late Yahoo!.

Code serialize this way can be re-awoken using “eval”, which is safe because we wrote these lambdas ourselves. For the purposes of my own sanity, I’ve included GUIDs in the payloads sent to RabbitMQ, which you can also see in the code example:

[title] [] [url] [link text]
1
2
3
4
5
6
7
8
9
10
11
const serialize = require('serialize-javascript');
const uuidV4 = require('uuid/v4');
const message = {
messageId: uuidV4(),
data: data
}
const messageData = serialize(message, {space: 2});
ch.sendToQueue(queueName, new Buffer(messageData), {persistent: true});

Reliability

TypeScript adds types to Javascript - it’s easy to slowly incorporate, and adds some linting features as well.

Even though this system has no pre-defined schema, I found it valuable to write a script to do do some basic validation. This includes a list of fields that can’t be blank, and verifying that type names match Solr definitions (i.e. a type named “speakerName_ss” should be an array of strings)

Coding in a functional style makes it easier to spot defects. Side-effecting code can change behavior far from where it acts. Seamless-immutable 12 is a great library for flushing out defects.

On the file system, you’ll need to watch how much content you download. If you put 10s of thousands of files or folders in one location, querying the file system gets incredibly slow (NTFS, at least). A simple fix is to hash some part of a file name into a hex code (e.g. A1BDE1), and create an intermediate layer of folders.

For node, you almost always want to use graceful-fs, which does a backoff when encountering the dreaded “too many open files” error. This library can easily monkey-patch itself into Node:

[title] [] [url] [link text]
1
2
3
const realFs = require('fs');
const gracefulFs = require('graceful-fs');
gracefulFs.gracefulify(realFs);

If you try to build a crawler and run it on your home network, you will likely get rate-limited DNS queries (thanks FiOS!)

Handling Sparse Data For Ranking

  1. [TODO] Make a report that shows how sparse attributes are
  2. [TODO] Model examples with Wolfram Alpha
  3. [TODO] Put data in Postgres, fiddle with adding stuff together

8 https://www.crummy.com/software/BeautifulSoup/bs4/doc/
9 https://dp.la/
10 Confreaks.tv
11 https://arrow.apache.org/
12 https://github.com/rtfeldman/seamless-immutable
13 https://github.com/yahoo/serialize-javascript

TODO - CTA for scraping weekly
TODO - Github link?