Query IMDb with DuckDB
Recently a friend shared a link to a fantastic microsite all about movies where characters say the name of the movie… in the movie. A phenomenon my friends and I have long dubbed “The Faceoff”. The authors did a great job with their research, process, visualizations, and just being awesome all around. No notes.
Some of the data that went into the site came from IMDb’s non-commercial datasets, which I had not previously known about, and was curious to poke around the data myself.
With the recent SQLite renassiance, my first idea was to to use SQLite’s built-in CSV support to transform the data into a queryable format. However I quickly noticed IMDb’s data was in TSV (Tab-separated values) format, and it was unclear how to import that into SQLite without bringing in additional tools.
Then I remembered DuckDB, which I had filed away in my
head as the SQLite of OLAP databases. I figured if anything can import a .tsv
file out of the box, it has to be DuckDB! Well spoilers, it can’t exactly, but
you can still get it to work.
Importing
After installing the DuckDB binary (I used Nix), I needed
to figure out how to get DuckDB import IMDb’s .tsv
files. After some trial and
error, turns read_csv
works just fine after overriding some of the built-in
options. I ended up defining a macro for this:
CREATE TEMP MACRO read_tsv(
source,
columns_ := NULL
) AS TABLE SELECT * FROM read_csv(
source,
delim = '\t',
nullstr = '\N',
header = true,
escape = NULL,
quote = NULL,
columns = columns_
);
TEMP
means the macro only is only available in the current connection, and
doesn’t persist once it is closed. This is exactly I ended up gathering this and
the rest of the statements into the same import.sql
file and piping it into
duckdb
in one go. I wouldn’t need the macro afterwards.
Next, we can use the macro to start bringing in the data, starting with the various “names”, or people involved with creative works.
CREATE TABLE IF NOT EXISTS names AS SELECT * FROM read_tsv(
'https://datasets.imdbws.com/name.basics.tsv.gz',
columns_ := {
'id': 'TEXT PRIMARY KEY NOT NULL',
'name': 'TEXT NOT NULL',
'birth_year': 'INT NOT NULL',
'death_year': 'INT',
'professions_raw': 'TEXT',
'known_for_title_ids_raw': 'TEXT'
}
);
Pretty neat that DuckDB can read directly from a URL, automatically decompressing the response, and creating a table directly from the results.
A few of those columns contain comma-separated values, so we can extract those
into a column using the native
LIST
type.
ALTER TABLE names ADD COLUMN IF NOT EXISTS professions TEXT[];
ALTER TABLE names ADD COLUMN IF NOT EXISTS known_for_title_ids TEXT[];
UPDATE names
SET professions = regexp_split_to_array(professions_raw, ',')
WHERE professions IS NULL;
UPDATE names
SET known_for_title_ids =
regexp_split_to_array( known_for_title_ids_raw, ',')
WHERE known_for_title_ids IS NULL;
Next we can bring in the “titles”, aka the TV shows and Movies themselves, using the same process.
CREATE TABLE IF NOT EXISTS titles AS SELECT * FROM read_tsv(
'https://datasets.imdbws.com/title.basics.tsv.gz',
columns_ := {
'id': 'TEXT PRIMARY KEY NOT NULL',
'type': 'TEXT NOT NULL',
'title': 'TEXT NOT NULL',
'original_title': 'TEXT NOT NULL',
'is_adult': 'BOOLEAN NOT NULL',
'start_year': 'INT NOT NULL',
'end_year': 'INT',
'runtime': 'INT NOT NULL',
'genres_raw': 'TEXT'
}
);
ALTER TABLE titles ADD COLUMN IF NOT EXISTS genres TEXT[];
UPDATE titles
SET genres = regexp_split_to_array(genres_raw, ',')
WHERE genres IS NULL;
ALTER TABLE titles ADD COLUMN IF NOT EXISTS average_rating DOUBLE;
ALTER TABLE titles ADD COLUMN IF NOT EXISTS votes INT;
UPDATE titles
SET average_rating = ratings.average_rating,
votes = ratings.votes
FROM read_tsv(
'https://datasets.imdbws.com/title.ratings.tsv.gz',
columns_ := {
'title_id': 'TEXT NOT NULL',
'average_rating': 'DOUBLE NOT NULL',
'votes': 'BIGINT NOT NULL',
}
) ratings
WHERE ratings.title_id = titles.id
AND (titles.average_rating IS NULL OR titles.votes IS NULL);
Finally, the join table between the people and the movies they are related to.
DuckDB can extract the characters_raw
column content as a LIST
, but we want
to remove the quotations surrounding each element using regexp_extract
.
CREATE TABLE IF NOT EXISTS titles_names AS SELECT * FROM read_tsv(
'https://datasets.imdbws.com/title.principals.tsv.gz',
columns_ := {
'title_id': 'TEXT NOT NULL',
'ordering': 'INT NOT NULL',
'name_id': 'TEXT NOT NULL',
'category': 'TEXT NOT NULL',
'job': 'TEXT',
'characters_raw': 'TEXT[]'
}
);
ALTER TABLE titles_names ADD COLUMN IF NOT EXISTS characters TEXT[];
UPDATE titles_names
SET characters = list_transform(
characters_raw,
c -> regexp_extract(c, '^"(.*)"$', 1)
)
WHERE characters IS NULL;
And yes, that ->
is a
“lambda”. Maybe other database
engines have them, but it’s my first time seeing one in a SQL dialect.
Querying
Now that we’ve got a few gigabytes of data loaded up, we can start running some queries! DuckDB’s SQL dialect has all the basics you’d expect, so its easy to start exploring.
If you’ve been on IMDb’s site, you’ve probably seen the “Top 250”. How about the “Bottom 10”?
SELECT title, average_rating
FROM titles
WHERE votes > 100000
AND type = 'movie'
ORDER BY average_rating ASC
LIMIT 10;
-- ┌──────────────────────┬────────────────┐
-- │ title │ average_rating │
-- │ varchar │ double │
-- ├──────────────────────┼────────────────┤
-- │ Radhe │ 1.9 │
-- │ Epic Movie │ 2.4 │
-- │ Meet the Spartans │ 2.8 │
-- │ Adipurush │ 3.1 │
-- │ Catwoman │ 3.4 │
-- │ Batman & Robin │ 3.8 │
-- │ The Last Airbender │ 4.0 │
-- │ Fifty Shades of Grey │ 4.2 │
-- │ Movie 43 │ 4.3 │
-- │ Fantastic Four │ 4.3 │
-- ├──────────────────────┴────────────────┤
-- │ 10 rows 2 columns │
-- └───────────────────────────────────────┘
Yep, those are all pretty universally panned. Though this is the first time I’ve heard of Radhe.
How about the actors or directors who have been associated with the most “well-known” movies? With the same 10,000 threshold of votes defining “well-known”.
SELECT names.name, count(*) AS total_movies
FROM names
JOIN titles_names ON titles_names.name_id = names.id
JOIN titles ON titles.id = titles_names.title_id
WHERE titles.type = 'movie'
AND votes > 100000
GROUP BY ALL
ORDER BY total_movies DESC
LIMIT 10;
-- ┌──────────────────┬──────────────┐
-- │ name │ total_movies │
-- │ varchar │ int64 │
-- ├──────────────────┼──────────────┤
-- │ Robert De Niro │ 39 │
-- │ Brad Pitt │ 37 │
-- │ Tom Hanks │ 37 │
-- │ Johnny Depp │ 35 │
-- │ Tim Bevan │ 34 │
-- │ Matt Damon │ 32 │
-- │ Stan Lee │ 30 │
-- │ John Williams │ 30 │
-- │ Tom Cruise │ 30 │
-- │ Steven Spielberg │ 30 │
-- ├──────────────────┴──────────────┤
-- │ 10 rows 2 columns │
-- └─────────────────────────────────┘
Again, that looks about right. In hindsight, it makes sense Robert De Niro would be high up, having been around for so long, and being prolific even into his older age.
This was the first time I had used DuckDB, and I came away very impressed. I can see myself reaching for it again in the future, if I have more data-spelunking needs.