Iulia Cioroianu is a Lecturer in the Department of Politics, Languages and International Studies at University of Bath, and an affiliate of the Institute for Policy Research and the ART-AI Centre for Doctoral Training. Iulia is a computational social scientist who studies the effects of social media and online information exposure on political competition and polarization using natural language processing and quantitative text analysis, machine learning and survey experiments. She holds a Ph.D. in Political Science from New York University and an M.A. from Central European University. Before joining the University of Bath, she was a research fellow in the Q-Step Centre for Quantitative Social Sciences at the University of Exeter, and a teaching fellow in the LSE Department of Methodology.
Our world is increasingly being recorded as digital text, capturing human knowledge and interactions to an unprecedented level and providing a rich source of data for researchers across different academic disciplines and subjects. Consequently, computational text analysis methods and tools are becoming increasingly popular and starting to make their way into the core research methods curriculum.
This course is designed to provide social science researchers an entry point to computational text analysis. Participants will gain hands-on experience designing and implementing a quantitative text analysis research project and will learn to discuss, evaluate and interpret the results. Each class consists of a 2-hours lecture followed by a 1.5 hours lab in which participants apply the methods covered in the lecture.
We will start with an overview of computational text analysis methods and discuss examples of their application across multiple disciplines and research fields. We will then survey the main ways in which text data can be acquired and present several major online text data sources.
The first steps in a text analysis research project – covering imputing, importing, manipulating and storing text data under different formats, as well as cleaning and processing it – often prove to be the most challenging for beginners. After addressing this initial set of issues we will study: the main ways in which text data can be turned into numbers; descriptive methods such as frequency tables and word clouds; automated dictionary methods (such as those developed to extract different emotions from text); text comparison methods (which are often used to study the diffusion and evolution of laws, policies and ideas); and text scaling methods (such as those used by political scientists to map the positions of political actors in the ideological space).
Finally, the course provides an introduction to machine learning applied to text data: supervised classification (routinely used in multiple disciplines to label large volumes of text documents based on a small subset of coded data) and unsupervised learning methods (as a very light introduction to topic modelling).
At the end of the course, participants will have an understanding of the current quantitative text analysis research landscape, the ways in which computational text analysis can be applied to their area of interest and the main data sources, tools and methods available for further exploration. Participants will also gain hands-on experience designing and implementing a quantitative text analysis research project in R and will be able to discuss and interpret the results and acknowledge the limitations of the methods used.
Familiarity with basic research design and statistical analysis is expected, and exposure to the R computing environment before the course is strongly encouraged.
Representative Background Reading
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297.
Silge, J., & Robinson, D. (2018). Text Mining with R: A Tidy Approach. O’Reilly Media Available online at https://www.tidytextmining.com
Background knowledge required
OLS = elementary
Maximum Likelihood = elementary
R = elementary
The course is delivered in ten sessions. Each session consists of a lecture and a lab. During the lab students will work through sets of examples, some of which are part of a larger project which involves collecting online articles and analysing them using the methods introduced in the course.
Overview of computational text analysis methods and their applications. Preview of used software and packages.
1. Kenneth Benoit. July 16, 2019. “Text as Data: An Overview.” Forthcoming in Cuirini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.
2. Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
3. Gilardi and Wuest, 2018. Text-as-Data Methodsfor Comparative Policy Analysis Working paper. Available at: https://www.fabriziogilardi.org/resources/papers/Gilardi-Wueest-TextAsData-Policy-Analysis.pdf
What is digital text? Acquiring text data. Collecting data from different sources and in different formats – intro to Application Programming Interfaces and web scraping.
1. Wallach, H. (2016, March). Computational Social Science: Toward a Collaborative Future. Computational Social Science: Discovery and Prediction. https://doi.org/10.1017/CBO9781316257340.014
2. McCormick, T. H., Lee, H., Cesare, N., Shojaie, A., & Spiro, E. S. (2017). Using Twitter for Demographic and Social Science Research: Tools for Data Collection and Processing. Sociological Methods & Research, 46(3), 390–421. https://doi.org/10.1177/0049124115605339
3. Glez-Peña, D., Lourenço, A., López-Fernández, H., Reboiro-Jato, M., & Fdez-Riverola, F. (2014). Web scraping technologies in an API world. Briefings in Bioinformatics, 15(5), 788–797. https://doi.org/10.1093/bib/bbt026
Manipulating and storing text data. Formatting and working with text files in different formats. Storage options, indexing and meta-data.
1. Batrinca, B., & Treleaven, P. C. (2015). Social media analytics: A survey of techniques, tools and platforms. AI & SOCIETY, 30(1), 89–116. https://doi.org/10.1007/s00146-014-0549-4
Text cleaning and pre-processing. Introduction to regular expressions and natural language processing.
1. Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. https://doi.org/10.1017/pan.2017.44
2. Welbers, K., Atteveldt, W. V., & Benoit, K. (2017). Text Analysis in R. Communication Methods and Measures, 11(4), 245–265. https://doi.org/10.1080/19312458.2017.1387238
3. Regular Expressions as used in R, https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
Text to numbers. Descriptive text methods. Comparing texts and computing text similarity measures.
1. Erk, K. (2012). Vector Space Models of Word Meaning and Phrase Meaning: A Survey. Language and Linguistics Compass, 6(10), 635–653. https://doi.org/10.1002/lnco.362
2. Jansa, J. M., Hansen, E. R., & Gray, V. H. (2018). Copy and Paste Lawmaking: Legislative Professionalism and Policy Reinvention in the States: American Politics Research. https://doi.org/10.1177/1532673X18776628
3. Linder, F., Desmarais, B., Burgess, M., & Giraudy, E. 2018. Text as Policy: Measuring Policy Similarity through Bill Text Reuse. Policy Studies Journal. https://doi.org/10.1111/psj.12257
Automated dictionary methods.
1. Tausczik, Y. R., & Pennebaker, J. W. (2010). The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676
2. Loughran, T., & Mcdonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x
Introduction to machine learning applied to text data. Supervised classification.
1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R (2013 edition). Springer. Chapter 2.
2. Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying Party Affiliation from Political Speech. Journal of Information Technology & Politics, 5(1), 33–48. https://doi.org/10.1080/19331680802149608
Unsupervised learning. Principal component analysis, clustering methods and introduction to topic models.
1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R (2013 edition). Springer. Chapter 10
2. Wilkerson, J., & Casas, A. (2017). Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542
3. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to Analyze Political Attention with Minimal Assumptions and Costs. American Journal of Political Science, 54(1), 209–228. https://doi.org/10.1111/j.1540-5907.2009.00427.x
1. Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. American Political Science Review, 97(2), 311–331. https://doi.org/10.1017/S0003055403000698
2. Slapin, J. B., & Proksch, S.-O. (2008). A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3), 705–722. https://doi.org/10.1111/j.1540-5907.2008.00338.x
3. Perry, P. O., & Benoit, K. (2017). Scaling Text with the Class Affinity Model. ArXiv:1710.08963 [Cs, Stat]. http://arxiv.org/abs/1710.08963
Further topics and applications. The final lecture and lab will expand upon the previous topics in response to student interests.