Encyclopedia Britannica data visualisation

2020 National Library of Scotland
overview
This project was developed as part of the University of Edinburgh’s Data Science for Design course.
Our dataset is the eight OCR’ed editions of the Encyclopaedia Britannica, which were released between 1768-1860.
The Encyclopaedia is an aggregation of general knowledge and therefore is an interesting dataset as it showcases how knowledge and its categorisation evolved throughout time.
As a UI designer, I worked closely with our data scientist and game designer in the group. We aimed to present an entry point of our findings for an initial overview of the content and structure of the Encyclopaedia in an accessible and engaging way.

Challenge

01

Data was impossible to create RegEx queries that wouldn’t pick up noise.

02

Letters being misinterpreted as other letters or punctuation symbols.

03

Most of the headers are badly recognised in the txt. file.

04

Image wan't match with the content in the same file.

Solution
Due to the data being unstructured text, we had to generate our own structured data from it.
We decided to focus on very specific aspects of the data: simple entries of the form “TERM, definition”, as well as references to topics of the form “See x”, and used reference counts as a proxy for the popularity of a topic.
As our data is the original, unclean OCR, it was impossible to create RegEx queries that wouldn’t pick up noise, or leave some data out. However, this isn’t much of an issue for us, as we can assume that the same kinds of omissions will be made across editions and terms, so the processed data will still be indicative of the proportional differences between terms across editions. Having noisy data did mean that extracting the longest and most reference-heavy term required manual inspection of the top contenders, as it couldn’t easily be done computationally.
Target Users
Our audience is the general public, anyone who is interested in the Encyclopaedia but is overwhelmed by the amount of content available.

Website as a platform to show our investigation of the dataset.

We selected five commonly referenced topics in the Encyclopedia Britannica, they are Anatomy, Architecture, Agriculture, Botany and Chemistry.
We presentED the image that is related to these fields and if you movethe cursor to those pictures, the number of the referenced topics in each edition pop up and you can seethe changes by that. In the lower-left corner, there is also an image showing the popularity of these topics, this was adjusted after our data holder's suggestion that the concept of popularity and number of counts might be misleading to people, since they were initially in different pages.

Open Resource

Let's dig in the code a bit 💻
This project was also published on the
All the code used for processing the data is openly available on