Raymond Hicks received his PhD from Emory in Political Science. He worked for 10 years as Statistical Programmer at the Niehaus Center for Globalization and Governance at Princeton University. Recently, he moved to Columbia University where he is the Project Manager for the History Lab. His research interests include monetary and trade policy and his worked has appeared in The Journal of Politics, the British Journal of Political Science, and International Organization among other journals.
While many social scientists use quantitative data, the increasing volume of digitized and “born digital” collections of books, articles, documents, and social media present tremendous new opportunities for research. But for those with little training in computer science, the barriers to entry can seem daunting.
The goal of this workshop is to offer a very practical introduction to computer packages accessible to social scientists with varying skill sets. Participants will learn how to organize and analyze textual data, get an overview of advances in natural language processing and machine learning, see how they can grapple with old research problems with new rigor, and take on entirely new kinds of questions. Focusing on textual data from History Lab – a comprehensive resource that has culled millions of declassified government documents from the U.S., United Kingdom, and Brazil – we will introduce principles of examining digitalized text. We will examine how to bring textual data into Python or R, how to use Python for web scraping, and how to look at textual data using string functions.
At the end of the course, participants should be comfortable with the basics of digitalized text. This includes understanding the structure of the data, what can and cannot be done with digital data, and some simple techniques in analysing the data. They should be able to open text data from a variety of sources and understand its formatting.
The course is an introduction to digital text. Some familiarity with computers is necessary. We will be using a couple of different packages to examine the data so some experience with computer languages would be helpful.
Background knowledge required
OLS = e
Stata = e
R = e
Python = e
SQL = e
e = elementary, m = moderate, s = strong
Introduction to Text Analysis
The goal of this workshop is to provide an introduction to the use of digitalized text. Humanities scholars stand poised to benefit from exciting new technologies in text analysis. Machine-learning tools can help scholars quickly process large amounts of text from nearly any imaginable source. “Text as data” can offer compelling new insights into research questions and scholarly topics in the humanities. Focusing on textual data from History Lab – a comprehensive resource that has culled millions of declassified government documents from the U.S., United Kingdom, and Brazil – we will cover the beginning aspects of examining digitalized text. The course is very hands-on. Participants will be expected to bring their own laptops. Each session will include some time devoted to testing out the concepts taught in the session.
Session 1: What is Digital Text?
In the first session we will introduce digital text. How does it differ from numeric text? Can the same types of methods be used for both? What is metadata and how is it used?
We will walk through the different programs that could be used to analyze textual data. The most important of these will be R and Python, but we will also look at MySQL and Stata. Some packages are better at some tasks than others. In the end, the choice of software is one of personal predilection.
We will get a first look at the data available through the History Lab website. We will talk about the different collections available as well how the collections are formatted.
Session 2: Bringing text into software, part 1
When we have data, whether textual or numeric, we often want to bring it into different programs to analyze. Text data comes in many different formats so we will discuss a variety of methods for inputting data. We will also discuss some basics of data manipulation to help us combine data from different sources.
The History Lab data is available from an SQL database. We will learn how to use SQL queries to extract data from the database. Because the data are often available in different tables, we will focus on combining fields from different tables to get the information we need. Sometimes it is easier to combine data in other programs so we will also explore how to merge data in Stata and R.
– SQL data
– Merging data
Session 3: Bringing text into software, part 2
Often text data is not available preformatted as in SQL tables. In those cases, it becomes more difficult to get the data organized in a format that we would like. In this session, we will investigate how to bring in textual data in other formats, including html data and JSON data.
– Beautiful soup
– JSON data
Session 3: Getting around the data
Once the data is in a program, what do we do with it? How can we explore the information that is in the data, especially if there is a lot of it? In this session, we will focus on string functions in different packages to more easily get a sense of the data. One aspect that will receive extra attention is regular expressions.
– How do we know what is there?
– Regular expressions
– Subsetting data
Session 4: Quantifying text data
We will continue looking at the data and concentrate on different ways to quantify the information. There are different options we have when attempting to quantify data. This session will discuss those and how to get data in the best format possible.
– Counting mentions
– Collapsing data
– Reshaping data
Session 5: Statistical introduction
In the next three sessions, we will move away from text data a bit and focus on more general quantitative aspects. This session will look at descriptive statistics such as means and standard deviations and how to calculate them.
– Means, standard deviations
Session 6: Graphs and tables
How do we present descriptive statistics of our data? We will focus on graphs and tables in this session and the best ways to show readers the data we have.
– Dates in statistical packages
– Line graphs
– Bar charts
Session 7: Measures of association
How do we determine whether differences are meaningful? That is, what are the ways to test hypotheses about data? Here we will focus on some of the more simple forms of hypothesis testing.
T-tests examine whether the means of two groups are different from one another. Chi squared is a measure of association for more ordinal values that estimates whether information within a table is different from what it should be based on the sum of rows and columns. We will also look at the concept of correlation of two variables.
– Chi squared, t-tests
Session 8: Descriptive text
We can do some interesting things with textual data that we cannot easily do with numeric data. In this session we will explore some of these techniques which can give a better sense of the information within the text.
– Word clouds
– Key words in context
– Frequency tables
Session 9: Regression and logit analysis
The advanced topics in textual analysis are better understood with some knowledge of regression analysis, particularly logit/probit analysis. We will first look at multivariate regression and then turn to logit/probit analysis which uses a dichotomous dependent variable.
– Dependent/independent variables
Session 10: Advanced topics: Topic models and network analysis
In the final session, we will briefly touch upon some of the more advanced topics in text analysis.