Workshop material on working with social media data and text mining methods in R
Made with woRkshoptools
Part of the conference: „Forschung zur Digitalisierung in der kulturellen Bildung“ (29-09-2022)
Contact: Veronika Batzdorfer ([email protected])
Social media are central sites of collective opinion formation and form an important basis for describing and explaining social phenomena (e.g., online radicalisation). However, when working with this type of data, decisions in all phases of the research cycle (from data collection to pre-processing steps to analytical decisions) carry risks of bias for validity and reliability aspects.
This workshop will include an introduction to how large amounts of text data from Twitter, which are openly available, can be made accessible and usable for research purposes. It will combine conceptual considerations and practical applications in R
.
- Strategies to collect and process textual data with application programming interfaces (APIs) using common
R
tools. - Potentials of bias in the research data cycle
- Basics of natural language processing (NLP), data cleaning (e.g. with 'quanteda' or 'textclean') and application of common NLP tools for automated text analysis
- Outlook on topic modelling (or word embeddings)
- Bias and ethics in NLP
-
Twitter data: Kaggle Data Dump, Depression Tweets
-
Download & Installing
R
from: https://cran.r-project.org/ -
Download & Installing
RStudio
from: https://www.rstudio.com/ -
Dependencies
pkgs <- c("here", "lubridate", "quanteda", "quanteda.textstats", "tidyverse",
"academictwitteR", "tibble", "kableExtra", "tidytext",
"textclean", "academictwitteR")
install.packages(pkgs)
Time | Content |
---|---|
09:00 - 10:30 | Concepts & challenges when analysing social web data |
10:30 - 11:00 | Coffee break |
11:30 - 12:30 | Getting Started with Twitter data: (i) Sampling, (ii) Pre-processing/ data wrangling & (iii) Basics of textual analyses (frequencies/ co-occurences/ networks) |
12:30 - 13:30 | Lunch |
13:30 - 15:00 | Twitter Demo & Crawling Social web data |
15:00 - 15:30 | Coffee break |
15:30 - 17:00 | Outlook Advanced NLP techniques (e.g., Topic Modeling) & Social web data collection; Bias and Ethics with NLP |
Feature ID | Type | Description |
---|---|---|
post_id | Numeric | identifier of tweet |
followers | Numeric | number of followers in profile |
friends | Numeric | number of friends in profile |
post_created | character | date of posting tweet |
post_text | character | text of original tweet |
user_id | Numeric | identifier of user |
label | Numeric | depression categorization: 1 = depression tweet, 2 = non-depression |
favourites | Numeric | number of external favorites of the tweet |
user_id | Numeric | identifier of user |