Dr Iulia Cioroianu is a Prize Fellow at the Institute for Policy research at University of Bath. She holds a Ph.D. in political science from New York University and an M.A. in political science from Central European University. Before joining the IPR, she was a research fellow in the Q-Step Centre for quantitative social sciences at the University of Exeter, and a pre-doctoral fellow in the LSE Department of Methodology.

Iulia is a social data scientist who studies the effects of social media and online information exposure on political competition and polarization using natural language processing and quantitative text analysis, machine learning and survey experiments. Her work was published in Electoral Studies, Social Networks and AAAI conference proceedings, and was features in NCRM podcasts and research methods videos. She received an IBM Faculty Award as well as an ESRC IAA Innovation Fellowships, and is currently working with the IBM Centre for Advanced Studies in Amsterdam on the project Understanding News Bias (UNBias). The project develops algorithms for measuring topic-specific ideological positions in news articles, and a web browser extension which reveals these positions to users, while offering them the opportunity to read other articles on the same topic but which may present a different ideological perspective.

Course Content
Our world is increasingly being recorded as digital text, capturing human knowledge and interactions to an unprecedented level and providing a rich source of data for researchers across different academic disciplines and subjects. Consequently, computational text analysis methods and tools are becoming increasingly popular and starting to make their way into the core research methods curriculum.

This course is designed to provide social science researchers an entry point to computational text analysis. Participants will gain hands-on experience designing and implementing a quantitative text analysis research project and will learn to discuss, evaluate and interpret the results. Each class consists of a 2-hours lecture followed by a 1.5 hours lab in which participants apply the methods covered in the lecture.

We will start with an overview of computational text analysis methods and discuss examples of their application across multiple disciplines and research fields. We will then survey the main ways in which text data can be acquired and present several major online text data sources.

The first steps in a text analysis research project – covering imputing, importing, manipulating and storing text data under different formats, as well as cleaning and processing it – often prove to be the most challenging for beginners. After addressing this initial set of issues we will study: the main ways in which text data can be turned into numbers; descriptive methods such as frequency tables and word clouds; automated dictionary methods (such as those developed to extract different emotions from text); text comparison methods (which are often used to study the diffusion and evolution of laws, policies and ideas); and text scaling methods (such as those used by political scientists to map the positions of political actors in the ideological space).

Finally, the course provides an introduction to machine learning applied to text data: supervised classification (routinely used in multiple disciplines to label large volumes of text documents based on a small subset of coded data) and unsupervised learning methods (as a very light introduction to topic modelling).

Course Objectives
At the end of the course, participants will have an understanding of the current quantitative text analysis research landscape, the ways in which computational text analysis can be applied to their area of interest and the main data sources, tools and methods available for further exploration. Participants will also gain hands-on experience designing and implementing a quantitative text analysis research project in R and will be able to discuss and interpret the results and acknowledge the limitations of the methods used.

Course Prerequisites
Familiarity with basic research design and statistical analysis is expected, and exposure to the R computing environment before the course is strongly encouraged.

Representative Background Reading
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297.

Required texts

Silge, J., & Robinson, D. (2018). Text Mining with R: A Tidy Approach. O’Reilly Media Available online at https://www.tidytextmining.com

Background knowledge required
Statistics
OLS = elementary
Maximum Likelihood = elementary

Computer Background
R = elementary

The course is delivered in ten sessions. Each session consists of a lecture and a lab. During the lab students will work through sets of examples, some of which are part of a larger project which involves collecting online articles and analysing them using the methods introduced in the course.

Day 1

Overview of computational text analysis methods and their applications. Preview of used software and packages.

Readings:

1. Kenneth Benoit. July 16, 2019. “Text as Data: An Overview.” Forthcoming in Cuirini, Luigi and Robert Franzese, eds. Handbook of Research Methods in Political Science and International Relations. Thousand Oaks: Sage.
2. Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
3. Gilardi and Wuest, 2018. Text-as-Data Methodsfor Comparative Policy Analysis Working paper. Available at: https://www.fabriziogilardi.org/resources/papers/Gilardi-Wueest-TextAsData-Policy-Analysis.pdf

 

Day 2

What is digital text? Acquiring text data. Collecting data from different sources and in different formats – intro to Application Programming Interfaces and web scraping.

Readings:

1. Wallach, H. (2016, March). Computational Social Science: Toward a Collaborative Future. Computational Social Science: Discovery and Prediction. https://doi.org/10.1017/CBO9781316257340.014
2. McCormick, T. H., Lee, H., Cesare, N., Shojaie, A., & Spiro, E. S. (2017). Using Twitter for Demographic and Social Science Research: Tools for Data Collection and Processing. Sociological Methods & Research, 46(3), 390–421. https://doi.org/10.1177/0049124115605339
3. Glez-Peña, D., Lourenço, A., López-Fernández, H., Reboiro-Jato, M., & Fdez-Riverola, F. (2014). Web scraping technologies in an API world. Briefings in Bioinformatics, 15(5), 788–797. https://doi.org/10.1093/bib/bbt026

 

Day 3

Manipulating and storing text data. Formatting and working with text files in different formats. Storage options, indexing and meta-data.

Readings:

1. Batrinca, B., & Treleaven, P. C. (2015). Social media analytics: A survey of techniques, tools and platforms. AI & SOCIETY, 30(1), 89–116. https://doi.org/10.1007/s00146-014-0549-4

 

Day 4

Text cleaning and pre-processing. Introduction to regular expressions and natural language processing.

Readings:

1. Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. https://doi.org/10.1017/pan.2017.44
2. Welbers, K., Atteveldt, W. V., & Benoit, K. (2017). Text Analysis in R. Communication Methods and Measures, 11(4), 245–265. https://doi.org/10.1080/19312458.2017.1387238
3. Regular Expressions as used in R, https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html

 

Day 5

Text to numbers. Descriptive text methods. Comparing texts and computing text similarity measures.

Readings:

1. Erk, K. (2012). Vector Space Models of Word Meaning and Phrase Meaning: A Survey. Language and Linguistics Compass, 6(10), 635–653. https://doi.org/10.1002/lnco.362
2. Jansa, J. M., Hansen, E. R., & Gray, V. H. (2018). Copy and Paste Lawmaking: Legislative Professionalism and Policy Reinvention in the States: American Politics Research. https://doi.org/10.1177/1532673X18776628
3. Linder, F., Desmarais, B., Burgess, M., & Giraudy, E. 2018. Text as Policy: Measuring Policy Similarity through Bill Text Reuse. Policy Studies Journal. https://doi.org/10.1111/psj.12257

 

Day 6

Automated dictionary methods.

Readings:

1. Tausczik, Y. R., & Pennebaker, J. W. (2010). The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1), 24–54. https://doi.org/10.1177/0261927X09351676
2. Loughran, T., & Mcdonald, B. (2011). When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks. The Journal of Finance, 66(1), 35–65. https://doi.org/10.1111/j.1540-6261.2010.01625.x

 

Day 7

Introduction to machine learning applied to text data. Supervised classification.

Readings:

1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R (2013 edition). Springer. Chapter 2.
2. Yu, B., Kaufmann, S., & Diermeier, D. (2008). Classifying Party Affiliation from Political Speech. Journal of Information Technology & Politics, 5(1), 33–48. https://doi.org/10.1080/19331680802149608

 

Day 8

Unsupervised learning. Principal component analysis, clustering methods and introduction to topic models. 

Readings:

1. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: With Applications in R (2013 edition). Springer. Chapter 10
2. Wilkerson, J., & Casas, A. (2017). Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges. Annual Review of Political Science, 20(1), 529–544. https://doi.org/10.1146/annurev-polisci-052615-025542
3. Quinn, K. M., Monroe, B. L., Colaresi, M., Crespin, M. H., & Radev, D. R. (2010). How to Analyze Political Attention with Minimal Assumptions and Costs. American Journal of Political Science, 54(1), 209–228. https://doi.org/10.1111/j.1540-5907.2009.00427.x

 

Day 9.

Document scaling.

Readings:

1. Laver, M., Benoit, K., & Garry, J. (2003). Extracting Policy Positions from Political Texts Using Words as Data. American Political Science Review, 97(2), 311–331. https://doi.org/10.1017/S0003055403000698
2. Slapin, J. B., & Proksch, S.-O. (2008). A Scaling Model for Estimating Time-Series Party Positions from Texts. American Journal of Political Science, 52(3), 705–722. https://doi.org/10.1111/j.1540-5907.2008.00338.x
3. Perry, P. O., & Benoit, K. (2017). Scaling Text with the Class Affinity Model. ArXiv:1710.08963 [Cs, Stat]. http://arxiv.org/abs/1710.08963

 

Day 10.

Further topics and applications. The final lecture and lab will expand upon the previous topics in response to student interests.