Please note: This course will be taught online only. In person study is not available for this course.

Iulia Cioroianu is a Senior Lecturer (Associate Professor) in the Department of Politics, Languages and International Studies at University of Bath, and an affiliate of the Centre for Doctoral Training in Accountable, Responsible and Transparent AI. Iulia is a computational social scientist who studies the effects of social media and online information exposure on political competition and polarization using natural language processing and quantitative text analysis, machine learning and survey experiments. She holds a Ph.D. in Political Science from New York University and an M.A. from Central European University. Before joining the University of Bath, she was a research fellow in the Q-Step Centre for Quantitative Social Sciences at the University of Exeter, and a teaching fellow in the LSE Department of Methodology.
Course Description
Our world is increasingly being recorded as digital text, capturing human knowledge and interactions to an unprecedented level and providing a rich source of data for researchers across different academic disciplines and subjects. Consequently, computational text analysis methods and tools are becoming increasingly popular and starting to make their way into the core research methods curriculum.
This course is designed to provide social science researchers an entry point to computational text analysis. Participants will gain hands-on experience designing and implementing a quantitative text analysis research project and will learn to discuss, evaluate and interpret the results. Each class consists of a 2-hours lecture followed by a 1.5 hours lab in which participants apply the methods covered in the lecture in R. We will start with an overview of computational text analysis methods and discuss examples of their application across multiple disciplines and research fields. We will then survey the main ways in which text data can be acquired and present several major online text data sources.
The first steps in a text analysis research project – covering imputing, importing, manipulating and storing text data under different formats, as well as cleaning and processing it – often prove to be the most challenging for beginners. After addressing this initial set of issues we will study: the main ways in which text data can be turned into numbers; descriptive methods such as frequency tables and word clouds; automated dictionary methods (such as those developed to extract different emotions from text); text comparison methods (which are often used to study the diffusion and evolution of laws, policies and ideas); and text scaling methods (such as those used by political scientists to map the positions of political actors in the ideological space).
Finally, the course provides an introduction to machine learning applied to text data: supervised classification (routinely used in multiple disciplines to label large volumes of text documents based on a small subset of coded data) and unsupervised learning methods (as a very light introduction to topic modelling). The final session covers the use of Large Language Models (LLMs) in political science research tasks.
Course Objectives
At the end of the course, participants will have an understanding of the current quantitative text analysis research landscape, the ways in which computational text analysis can be applied to their area of interest and the main data sources, tools and methods available for further exploration.
Participants will also gain hands-on experience designing and implementing a quantitative text analysis research project in R and will be able to discuss and interpret the results and acknowledge the limitations of the methods used.
Essential texts – (this text will be provided by ESS):
Grimmer, J., Roberts, M. E., & Stewart, B. M.: Text as Data: A New Framework for Machine Learning and the Social Sciences, Princeton University Press, 2022. ISBN: 978-0691207551
Course Prerequisites
Students are expected to have working knowledge of R, and be familiar with basic (undergraduate level) research design and statistical analysis notions.
Background Knowledge required
Maths
Calculus – Elementary
Linear Regression – Elementary
Statistic
OLS – Elementary
Maximum Likelihood – Elementary
Software
R – Moderate

Background textbook: Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Day 1: Overview of computational text analysis methods and their application across multiple disciplines. Preview of used software and packages.
Readings:
- Benoit, K. 2020. “Text as Data: An Overview.” In Curini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.
- [GRS] Chapters 1 and 2.
- Gilardi and Wuest (2018). Text-as-Data Methods for Comparative Policy Analysis Working paper available at: https://www.fabriziogilardi.org/resources/papers/Gilardi-Wueest-TextAsData-Policy-Analysis.pdf
Day 2: What is digital text? Acquiring text data. Collecting data from different sources and in different formats – intro to Application Programming Interfaces and web scraping.
Readings:
- [GRS] Chapters 3 and 4.
- McCormick, T. H., Lee, H., Cesare, N., Shojaie, A., & Spiro, E. S. (2017). Using Twitter for Demographic and Social Science Research: Tools for Data Collection and Processing. Sociological Methods & Research, 46(3), 390–421. https://doi.org/10.1177/0049124115605339
- Glez-Peña, D., Lourenço, A., López-Fernández, H., Reboiro-Jato, M., & Fdez-Riverola, F. (2014). Web scraping technologies in an API world. Briefings in Bioinformatics, 15(5), 788–797. https://doi.org/10.1093/bib/bbt026
Day 3: Manipulating and storing text data. Formatting and working with text files in different formats. File formats, storage options, indexing and meta-data.
Readings:
- Batrinca, B., & Treleaven, P. C. (2015). Social media analytics: A survey of techniques, tools and platforms. AI & SOCIETY, 30(1), 89–116. https://doi.org/10.1007/s00146-014-0549-4
Day 4: Text cleaning and pre-processing. Introduction to regular expressions and natural language processing.
Readings:
- Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. https://doi.org/10.1017/pan.2017.44
- Welbers, K., Atteveldt, W. V., & Benoit, K. (2017). Text Analysis in R. Communication Methods and Measures, 11(4), 245–265. https://doi.org/10.1080/19312458.2017.1387238
- [GRS] Chapters 5 and 9.
- Regular Expressions as used in R, https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
Day 5: Text to numbers. Descriptive text methods. Comparing texts and computing text similarity measures.
Readings:
- [GRS] Chapters 7 and 8.
- Erk, K. (2012). Vector Space Models of Word Meaning and Phrase Meaning: A Survey. Language and Linguistics Compass, 6(10), 635–653. https://doi.org/10.1002/lnco.362
- Jansa, J. M., Hansen, E. R., & Gray, V. H. (2018). Copy and Paste Lawmaking: Legislative Professionalism and Policy Reinvention in the States: American Politics Research. https://doi.org/10.1177/1532673X18776628
- Linder, F., Desmarais, B., Burgess, M., & Giraudy, E. (2020). Text as Policy: Measuring Policy Similarity through Bill Text Reuse. Policy Studies Journal. https://doi.org/10.1111/psj.12257
Day 6: Automated dictionary methods.
Readings:
- [GRS] Chapters 15 and 16.
- Tausczik, Y. R., & Pennebaker, J. W. (2010). The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676
- Loughran, T., & Mcdonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x
Day 7: Introduction to machine learning applied to text data. Supervised classification.
Readings:
- [GRS] Chapters 17-20.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R. Springer. Chapter 2.
- Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying Party Affiliation from Political Speech. Journal of Information Technology & Politics, 5(1), 33–48. https://doi.org/10.1080/19331680802149608
Day 8: Unsupervised learning. Principal component analysis, clustering methods and introduction to topic models.
Readings:
- [GRS] Chapters 12 and 13.
- Wilkerson, J., & Casas, A. (2017). Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542
- Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to Analyze Political Attention with Minimal Assumptions and Costs. American Journal of Political Science, 54(1), 209–228. https://doi.org/10.1111/j.1540-5907.2009.00427.x
Day 9: Document scaling.
Readings:
- Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. American Political Science Review, 97(2), 311–331. https://doi.org/10.1017/S0003055403000698
- Slapin, J. B., & Proksch, S.-O. (2008). A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3), 705–722. https://doi.org/10.1111/j.1540-5907.2008.00338.x
- Perry, P. O., & Benoit, K. (2017). Scaling Text with the Class Affinity Model. ArXiv: http://arxiv.org/abs/1710.08963
Day 10: Using large language models for research.
Readings:
- Ornstein, J. T., Blasingame, E. N., & Truscott, J. S. (2025). How to train your stochastic parrot: Large language models for political texts. Political Science Research and Methods, 1–18. https://doi.org/10.1017/psrm.2024.64
- Törnberg, P. (2024). Large Language Models Outperform Expert Coders and Supervised Classifiers at Annotating Political Social Media Messages. Social Science Computer Review, 08944393241286471. https://doi.org/10.1177/08944393241286471
- Barrie, C., Palmer, A., & Spirling, A. (2024). Replication for Language Models Problems, Principles, and Best Practice for Political Science. URL: http://arthurspirling.org/documents/BarriePalmerSpirling_TrustMeBro.pdf
- Li, L., Li, J., Chen, C., Gui, F., Yang, H., Yu, C., Wang, Z., Cai, J., Zhou, J. A., Shen, B., Qian, A., Chen, W., Xue, Z., Sun, L., He, L., Chen, H., Ding, K., Du, Z., Mu, F., … Dong, Y. (2024). Political-LLM: Large Language Models in Political Science (arXiv:2412.06864). arXiv. https://doi.org/10.48550/arXiv.2412.06864