Learn more about the people and technologies that make this project possible.

In March of 2012, the Encyclopedia Britannica ceased printing paper editions of its handsomely bound reference books. The Encyclopedia Britannica, first published in Edinburgh, Scotland in 1768, remains the oldest English language encyclopedia in continuous production, but it will only be updated through its online offering in the following years. In the era of community based online encyclopedias like Wikipedia, now is an interesting time to reflect on the content of the complete print run of the Encyclopedia Britannica. This also represents an interesting moment to reflect on how past systems of defining general knowledge has worked to shape societal prejudices, beliefs, and assumptions.

This project uses data science techniques like Natural Language Processing (NLP) and Machine Learning (ML) to chart the evolution of popular conceptions of race and racialization across all 15 editions of the encyclopedia released over its 244 year history. We have generated this project site to disseminate our findings, validate our methods, and build a network of researchers interested in working with our dataset. We have prepared a robust Application Programming Interface (API) hosting 1.6 GB of processed data relating to our work. We hope to build out a network of collaborators interested in pursuing questions related to general knowledge texts. Our site also hosts interactive visualizations that allow users to explore this large dataset alongside our analysis. The visualizations are offered on a Open Access model and Creative Commons License, which only requires attribution and for research to be shared under the same terms (cc BY-SA). Portions of our code are provided on Github in an effort to validate our methods in detail for peer review, but the code is also available for reuse under the BSD-3 license.


  • Aaron Mauro (co-PI) Assistant Professor Digital Humanities and English
     Director of the Penn State Digital Humanities Lab
     Penn State (Behrend)
  • Nina Jablonski (co-PI) Evan Pugh University Professor of Anthropology
     Director of the Center for Human Evolution and Diversity
     Penn State (UP)
  • Theresa Wilson (associate researcher)
     Research Technologist 4
     Penn State (UP)


This project is made possible by the recent growth in open source data science tools in the Python community. The preprocessing stages were made possible by Python 3.+ support for utf-8 throughout and built-in string handling. The project also relies on the Natural Language ToolKit, which is excellent for computational linguistic work because of the extremely high quality of its documentation. The project benefits from many of the features of the NLTK like those available through WordNet and the VADER sentiment analysis tool. The analysis of this large corpus uses distributed computing methods with Apache Spark when appropriate. However, much of the preliminary processing for this project has been possible with Python’s own built-in multiprocessing library. The first 1768-1771 edition of the EB was just three volumes in length and only ~2,962,590 words. By the 1987 edition, the EB was ~35,957,712 words, an expansion of a factor of 12!

To approach this large corpus, we have used several ML based text classification methods. Sci-Kit Learn’s library of algorithms has given us access to Bayesian analysis methods in a reliable and open format. Like the NLTK, Sci-Kit Learn is very well documented and often used in a research context. With this robust set of tools, we have developed a classifier capable of differentiating between racialized features and those associated with nationality. In an effort to demonstrate the functionality of the classifier and help validate our results, we have produced a classification tool that allows anyone to use the classifier in real time and learn how it behaves. Because the EB has a consistent style that has also evolved slowly between volumes, we believe our classifier is surprisingly robust and functions well across a range of linguistic styles and lexical norms between periods. Finally, we have used Gensim’s implementation of LDA topic modeling to determine shifting topics between editions and volumes between editions. We have presented these results with Gensim’s own use of PyDLAvis, which is ported from a popular R package by the same name.

The open and transparent presentation of our data and methods is a cornerstone of our project. We are following a somewhat unorthodox practice, at least in an academic context, of sharing data first and publishing second. The project site is developed with the Flask framework, and the visualizations are generated with PyGal. We are also indebted to the Kozea team, who also develop CairoSVG. We are collecting a sample of our source code on the Project Github page. Our API gives access to all our non-copyrighted data for other researchers to explore. We believe this Open Data methodology is central to validating these results in the humanities and social sciences. We have also prepared a detailed guide to our API for non-programmers to assist in the use, validation, and exploration of our project data. Finally, we have also provided access to our word count data in an interactive tool on our main page! Please follow our site as this project continues to grow! Please also contact us to learn more or become part of our network!