Please note: This course will be taught online only. In person study is not available for this course. 

Iulia Cioroianu is a Lecturer in the Department of Politics, Languages and International Studies at University of Bath, and an affiliate of the Institute for Policy Research and the ART-AI Centre for Doctoral Training. Iulia is a computational social scientist who studies the effects of social media and online information exposure on political competition and polarization using natural language processing and quantitative text analysis, machine learning and survey experiments. She holds a Ph.D. in Political Science from New York University and an M.A. from Central European University. Before joining the University of Bath, she was a research fellow in the Q-Step Centre for Quantitative Social Sciences at the University of Exeter, and a teaching fellow in the LSE Department of Methodology.

 

Course content

Our world is increasingly being recorded as digital text, capturing human knowledge and interactions to an unprecedented level and providing a rich source of data for researchers across different academic disciplines and subjects. Consequently, computational text analysis methods and tools are becoming increasingly popular and starting to make their way into the core research methods curriculum.

This course is designed to provide social science researchers an entry point to computational text analysis. Participants will gain hands-on experience designing and implementing a quantitative text analysis research project and will learn to discuss, evaluate and interpret the results. Each class consists of a 2-hours lecture followed by a 1.5 hours lab in which participants apply the methods covered in the lecture in R.

We will start with an overview of computational text analysis methods and discuss examples of their application across multiple disciplines and research fields. We will then survey the main ways in which text data can be acquired and present several major online text data sources.

The first steps in a text analysis research project – covering imputing, importing, manipulating and storing text data under different formats, as well as cleaning and processing it – often prove to be the most challenging for beginners. After addressing this initial set of issues we will study: the main ways in which text data can be turned into numbers; descriptive methods such as frequency tables and word clouds; automated dictionary methods (such as those developed to extract different emotions from text); text comparison methods (which are often used to study the diffusion and evolution of laws, policies and ideas); and text scaling methods (such as those used by political scientists to map the positions of political actors in the ideological space).

Finally, the course provides an introduction to machine learning applied to text data: supervised classification (routinely used in multiple disciplines to label large volumes of text documents based on a small subset of coded data) and unsupervised learning methods (as a very light introduction to topic modelling).

Course objectives

At the end of the course, participants will have an understanding of the current quantitative text analysis research landscape, the ways in which computational text analysis can be applied to their area of interest and the main data sources, tools and methods available for further exploration.

Participants will also gain hands-on experience designing and implementing a quantitative text analysis research project in R and will be able to discuss and interpret the results and acknowledge the limitations of the methods used.

Essential texts – (this text will be provided by ESS):

Grimmer, J., Roberts, M. E., & Stewart, B. M.: Text as Data: A New Framework for Machine Learning and the Social Sciences, Princeton University Press, 2022. ISBN: 978-0691207551

Course prerequisites 

Students are expected to have working knowledge of R, and be familiar with basic (undergraduate level) research design and statistical analysis notions.

Background knowledge required

Maths

Calculus – elementary

Linear regression – moderate

Statistics

OLS – moderate

Maximum likelihood – elementary

Software / Programming

R – moderate

2F: Quantitative Text Analysis

The course is delivered in ten sessions. Each session consists of a lecture and a lab. During the lab students will work through sets of examples that involve collecting different forms of online text data and analysing it using the text analysis methods introduced in the course.

Textbook: Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

Day 1

Overview of computational text analysis methods and their application across multiple disciplines. Preview of used software and packages.

Readings:

  1. Benoit, K. 2020. “Text as Data: An Overview.” In Curini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.
  2. Textbook Chapter 2.  
  3. Gilardi and Wuest (2018). Text-as-Data Methods for Comparative Policy Analysis Working paper. Available at: https://www.fabriziogilardi.org/resources/papers/Gilardi-Wueest-TextAsData-Policy-Analysis.pdf

 

Day 2

What is digital text? Acquiring text data. Collecting data from different sources and in different formats – intro to Application Programming Interfaces and web scraping.

Readings:

  1. Textbook Chapters 3 and 4.
  2. McCormick, T. H., Lee, H., Cesare, N., Shojaie, A., & Spiro, E. S. (2017). Using Twitter for Demographic and Social Science Research: Tools for Data Collection and Processing. Sociological Methods & Research, 46(3), 390–421. https://doi.org/10.1177/0049124115605339
  3. Glez-Peña, D., Lourenço, A., López-Fernández, H., Reboiro-Jato, M., & Fdez-Riverola, F. (2014). Web scraping technologies in an API world. Briefings in Bioinformatics, 15(5), 788–797. https://doi.org/10.1093/bib/bbt026

 

Day 3

Manipulating and storing text data. Formatting and working with text files in different formats. File formats, storage options, indexing and meta-data.

Readings:

  1. Batrinca, B., & Treleaven, P. C. (2015). Social media analytics: A survey of techniques, tools and platforms. AI & SOCIETY, 30(1), 89–116. https://doi.org/10.1007/s00146-014-0549-4

 

Day 4

Text cleaning and pre-processing. Introduction to regular expressions and natural language processing.

Readings:

  1. Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. https://doi.org/10.1017/pan.2017.44
  2. Welbers, K., Atteveldt, W. V., & Benoit, K. (2017). Text Analysis in R. Communication Methods and Measures, 11(4), 245–265. https://doi.org/10.1080/19312458.2017.1387238
  3. Textbook Chapters 5 and 9.
  4. Regular Expressions as used in R, https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html

 

Day 5

Text to numbers. Descriptive text methods. Comparing texts and computing text similarity measures.

Readings:

  1. Textbook Chapters 7 and 8.
  2. Erk, K. (2012). Vector Space Models of Word Meaning and Phrase Meaning: A Survey. Language and Linguistics Compass, 6(10), 635–653. https://doi.org/10.1002/lnco.362
  3. Jansa, J. M., Hansen, E. R., & Gray, V. H. (2018). Copy and Paste Lawmaking: Legislative Professionalism and Policy Reinvention in the States: American Politics Research. https://doi.org/10.1177/1532673X18776628
  4. Linder, F., Desmarais, B., Burgess, M., & Giraudy, E. (2020). Text as Policy: Measuring Policy Similarity through Bill Text Reuse. Policy Studies Journal. https://doi.org/10.1111/psj.12257

 

Day 6

Automated dictionary methods.

Readings:

  1. Textbook Chapters 15 and 16.
  2. Tausczik, Y. R., & Pennebaker, J. W. (2010). The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676
  3. Loughran, T., & Mcdonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x

 

Day 7

Introduction to machine learning applied to text data. Supervised classification.

Readings:

  1. Textbook Chapters 17-20. 
  2. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer. Chapter 2.
  3. Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying Party Affiliation from Political Speech. Journal of Information Technology & Politics, 5(1), 33–48. https://doi.org/10.1080/19331680802149608
  4.  

Day 8

Unsupervised learning. Principal component analysis, clustering methods and introduction to topic models. 

Readings:

  1. Textbook Chapters 12 and 13.
  2. Wilkerson, J., & Casas, A. (2017). Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542
  3. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to Analyze Political Attention with Minimal Assumptions and Costs. American Journal of Political Science, 54(1), 209–228. https://doi.org/10.1111/j.1540-5907.2009.00427.x

 

Day 9.

Document scaling.

Readings:

  1. Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. American Political Science Review, 97(2), 311–331. https://doi.org/10.1017/S0003055403000698
  2. Slapin, J. B., & Proksch, S.-O. (2008). A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3), 705–722. https://doi.org/10.1111/j.1540-5907.2008.00338.x
  3. Perry, P. O., & Benoit, K. (2017). Scaling Text with the Class Affinity Model. ArXiv: http://arxiv.org/abs/1710.08963

 

Day 10.

Further topics and applications. The final lecture and lab will expand upon the previous topics in response to student interests.