Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Workshop 4: Introduction to natural language processing and topic modeling with Python

Lecturers

Dr. Nicolò Gozzi & Dr. N. Gizem Bacaksizlar Turbic

Description

Documents and full texts as data have a long history in the social sciences. Besides these, Computational Social Science is also concerned with new forms of text data that can be collected from digital platforms and the web. All such datasets resemble expressions of natural language and bring methods from computational linguistics and machine learning like Natural Language Processing (NLP) and automated content analysis to center stage. In the workshop, we will give an introduction to how text data can be preprocessed and analyzed in Python. In particular, we will discuss how information can be extracted from raw texts using regular expressions, how words can be reduced to their basic forms, what language models are, how they allow us to extract meaningful pieces of symbolic communication like n-grams, how grammatical parts of speech (e.g., nouns, verbs) can be identified, and how all those steps combine into a text preprocessing pipeline. At the end of such a pipeline stands a document-word matrix that is ready for analysis. For analysis, we will introduce Latent Dirichlet Allocation (also called topic modeling), a fully automated content analysis method that reduces the dimensionality of the document-term matrix. It assumes that documents are generated from topics and infers topics as groups of words. As data, we will use a popular text corpus still to be determined. The workshop will alternate between live-coding demonstrations and periods in which participants apply that knowledge in context, both using Jupyter Notebooks. The software we will be using are SpaCy and Gensim, two standard Python libraries for NLP and topic modeling.