-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathsearch_index.json
12 lines (12 loc) · 110 KB
/
search_index.json
1
2
3
4
5
6
7
8
9
10
11
12
[
["index.html", "An Introduction to Text Processing and Analysis with R", " An Introduction to Text Processing and Analysis with R In the beginning was the word ... Michael Clark m-clark.github.io 2018-09-09 "],
["intro.html", "Introduction Overview Initial Steps", " Introduction Overview Dealing with text is typically not even considered in the applied statistical training of most disciplines. This is in direct contrast with how often it has to be dealt with prior to more common analysis, or how interesting it might be to have text be the focus of analysis. This document and corresponding workshop will aim to provide a sense of the things one can do with text, and the sorts of analyses that might be useful. Goals The goal of this workshop is primarily to provide a sense of common tasks related to dealing with text as part of the data or the focus of analysis, and provide some relatively easy to use tools. It must be stressed that this is only a starting point, a hopefully fun foray into the world of text, not definitive statement of how you should analyze text. In fact, some of the methods demonstrated would likely be too rudimentary for most goals. Additionally, we’ll have exercises to practice, but those comfortable enough to do so should follow along with the in-text examples. Note that there is more content here than will be covered in a single workshop. Prerequisites The document is for the most part very applied in nature, and doesn’t assume much beyond familiarity with the R statistical computing environment. For programming purposes, it would be useful if you are familiar with the tidyverse, or at least dplyr specifically, otherwise some of the code may be difficult to understand (and is required if you want to run it). Here are some of the packages used in this document: Throughout tidyverse tidytext Strings stringr lubridate Sentiment gutenbergr janeaustenr POS openNLP NLP tm Topic Models topicmodels quanteda Word Embedding text2vec Note the following color coding used in this document: emphasis package function object/class link Initial Steps Download the zip file here. It contains an RStudio project with several data files that you can use as you attempt to replicate the analyses. Be mindful of where you put it. Unzip it. Be mindful of where you put the resulting folder. Open RStudio. File/Open Project and navigate to and click on the blue icon in the folder you just created. Install any of the above packages you want. "],
["string-theory.html", "String Theory Basic data types Basic Text Functionality Regular Expressions Text Processing Examples Exercises", " String Theory Basic data types R has several core data structures: Vectors Factors Lists Matrices/arrays Data frames Vectors form the basis of R data structures. There are two main types- atomic and lists. All elements of an atomic vector are the same type. Examples include: character numeric (double) integer logical Character strings When dealing with text, objects of class character are what you’d typically be dealing with. x = c('... Of Your Fake Dimension', 'Ephemeron', 'Dryswch', 'Isotasy', 'Memory') x Not much to it, but be aware there is no real limit to what is represented as a character vector. For example, in a data frame, you could have a column where each entry is one of the works of Shakespeare. Factors Although not exactly precise, one can think of factors as integers with labels. So, the underlying representation of a variable for sex is 1:2 with labels ‘Male’ and ‘Female’. They are a special class with attributes, or metadata, that contains the information about the levels. x = factor(rep(letters[1:3], e=10)) attributes(x) $levels [1] "a" "b" "c" $class [1] "factor" While the underlying representation is numeric, it is important to remember that factors are categorical. They can’t be used as numbers would be, as the following demonstrates. as.numeric(x) [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 sum(x) Error in Summary.factor(structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : 'sum' not meaningful for factors Any numbers could be used, what we’re interested in are the labels, so a ‘sum’ doesn’t make any sense. All of the following would produce the same factor. factor(c(1, 2, 3), labels=c('a', 'b', 'c')) factor(c(3.2, 10, 500000), labels=c('a', 'b', 'c')) factor(c(.49, 1, 5), labels=c('a', 'b', 'c')) Because of the integer+metadata representation, factors are actually smaller than character strings, often notably so. x = sample(state.name, 10000, replace=T) format(object.size(x), units='Kb') [1] "80.8 Kb" format(object.size(factor(x)), units='Kb') [1] "42.4 Kb" format(object.size(as.integer(factor(x))), units='Kb') [1] "39.1 Kb" However, if memory is really a concern, it’s probably not that using factors will help, but rather better hardware. Analysis It is important to know that raw text cannot be analyzed quantitatively. There is no magic that takes a categorical variable with text labels and estimates correlations among words and other words or numeric data. Everything that can be analyzed must have some numeric representation first, and this is where factors come in. For example, here is a data frame with two categorical predictors (factor*), a numeric predictor (x), and a numeric target (y). What follows is what it looks like if you wanted to run a regression model in that setting. df = crossing(factor_1 = c('A', 'B'), factor_2 = c('Q', 'X', 'J')) %>% mutate(x=rnorm(6), y=rnorm(6)) df # A tibble: 6 x 4 factor_1 factor_2 x y <chr> <chr> <dbl> <dbl> 1 A J 0.797 -0.190 2 A Q -1.000 -0.496 3 A X 1.05 0.487 4 B J -0.329 -0.101 5 B Q 0.905 -0.809 6 B X 1.18 -1.92 ## model.matrix(lm(y ~ x + factor_1 + factor_2, data=df)) (Intercept) x factor_1B factor_2Q factor_2X 1 0.7968603 0 0 0 1 -0.9999264 0 1 0 1 1.0522363 0 0 1 1 -0.3291774 1 0 0 1 0.9049071 1 1 0 1 1.1754300 1 0 1 The model.matrix function exposes the underlying matrix that is actually used in the regression analysis. You’d get a coefficient for each column of that matrix. As such, even the intercept must be represented in some fashion. For categorical data, the default coding scheme is dummy coding. A reference category is arbitrarily chosen (it doesn’t matter which, and you can always change it), while the other categories are represented by indicator variables, where a 1 represents the corresponding label and everything else is zero. For details on this coding scheme or others, consult any basic statistical modeling book. In addition, you’ll note that in all text-specific analysis, the underlying information is numeric. For example, with topic models, the base data structure is a document-term matrix of counts. Characters vs. Factors The main thing to note is that factors are generally a statistical phenomenon, and are required to do statistical things with data that would otherwise be a simple character string. If you know the relatively few levels the data can take, you’ll generally want to use factors, or at least know that statistical packages and methods will require them. In addition, factors allow you to easily overcome the silly default alphabetical ordering of category levels in some very popular visualization packages. For other things, such as text analysis, you’ll almost certainly want character strings instead, and in many cases it will be required. It’s also worth noting that a lot of base R and other behavior will coerce strings to factors. This made a lot more sense in the early days of R, but is not really necessary these days. For more on this stuff see the following: http://adv-r.had.co.nz/Data-structures.html http://forcats.tidyverse.org/ http://r4ds.had.co.nz/factors.html https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/ http://notstatschat.tumblr.com/post/124987394001/stringsasfactors-sigh Basic Text Functionality Base R A lot of folks new to R are not aware of just how much basic text processing R comes with out of the box. Here are examples of note. paste: glue text/numeric values together substr: extract or replace substrings in a character vector grep family: use regular expressions to deal with patterns of text strsplit: split strings nchar: how many characters in a string as.numeric: convert a string to numeric if it can be strtoi: convert a string to integer if it can be (faster than as.integer) adist: string distances I probably use paste/paste0 more than most things when dealing with text, as string concatenation comes up so often. The following provides some demonstration. paste(c('a', 'b', 'cd'), collapse='|') [1] "a|b|cd" paste(c('a', 'b', 'cd'), collapse='') [1] "abcd" paste0('a', 'b', 'cd') # shortcut to collapse='' [1] "abcd" paste0('x', 1:3) [1] "x1" "x2" "x3" Beyond that, use of regular expression and functionality included in the grep family is a major way to save a lot of time during data processing. I leave that to its own section later. Useful packages A couple packages will probably take care of the vast majority of your standard text processing needs. Note that even if they aren’t adding anything to the functionality of the base R functions, they typically will have been optimized in some fashion, particularly with regard to speed. stringr/stringi: More or less the same stuff you’ll find with substr, grep etc. except easier to use and/or faster. They also add useful functionality not in base R (e.g. str_to_title). The stringr package is mostly a wrapper for the stringi functions, with some additional functions. tidyr: has functions such as unite, separate, replace_na that can often come in handy when working with data frames. glue: a newer package that can be seen as a fancier paste. Most likely it will be useful when creating functions or shiny apps in which variable text output is desired. One issue I have with both packages and base R is that often they return a list object, when it should be simplifying to the vector format it was initially fed. This sometimes requires an additional step or two of further processing that shouldn’t be necessary, so be prepared for it1. Other In this section, I’ll add some things that come to mind that might come into play when you’re dealing with text. Dates Dates are not character strings. Though they may start that way, if you actually want to treat them as dates you’ll need to convert the string to the appropriate date class. The lubridate package makes dealing with dates much easier. It comes with conversion, extraction and other functionality that will be sure to save you some time. library(lubridate) today() [1] "2018-03-06" today() + 1 [1] "2018-03-07" today() + dyears(1) [1] "2019-03-06" leap_year(2016) [1] TRUE span = interval(ymd("2017-07-01"), ymd("2017-07-04")) span [1] 2017-07-01 UTC--2017-07-04 UTC as.duration(span) [1] "259200s (~3 days)" span %/% minutes(1) [1] 4320 This package makes dates so much easier, you should always use it when dealing with them. Categorical Time In regression modeling with few time points, one often has to decide on whether to treat the year as categorical (factor) or numeric (continuous). This greatly depends on how you want to tell your data story or other practical concerns. For example, if you have five years in your data, treating year as categorical means you are interested in accounting for unspecified things that go on in a given year. If you treat it as numeric, you are more interested in trends. Either is fine. Web A major resource for text is of course the web. Packages like rvest,httr, xml2, and many other packages specific to website APIs are available to help you here. See the R task view for web technologies as a starting point. Encoding Encoding can be a sizable PITA sometimes, and will often come up when dealing with webscraping and other languages. The rvest and stringr packages may be able to get you past some issues at least. See their respective functions repair_encoding and str_conv as starting points on this issue. Summary of basic text functionality Being familiar with commonly used string functionality in base R and packages like stringr can save a ridiculous amount of time in your data processing. The more familiar you are with them the easier time you’ll have with text. Regular Expressions A regular expression, regex for short, is a sequence of characters that can be used as a search pattern for a string. Common operations are to merely detect, extract, or replace the matching string. There are actually many different flavors of regex for different programming languages, which are all flavors that originate with the Perl approach, or can enable the Perl approach to be used. However, knowing one means you pretty much know the others with only minor modifications if any. To be clear, not only is regex another language, it’s nigh on indecipherable. You will not learn much regex, but what you do learn will save a potentially enormous amount of time you’d otherwise spend trying to do things in a more haphazard fashion. Furthermore, practically every situation that will come up has already been asked and answered on Stack Overflow, so you’ll almost always be able to search for what you need. Here is an example: ^r.*shiny[0-9]$ What is that you may ask? Well here is an example of strings it would and wouldn’t match. string = c('r is the shiny', 'r is the shiny1', 'r shines brightly') grepl(string, pattern='^r.*shiny[0-9]$') [1] FALSE TRUE FALSE What the regex is esoterically attempting to match is any string that starts with ‘r’ and ends with ‘shiny_’ where _ is some single digit. Specifically, it breaks down as follows: ^ : starts with, so ^r means starts with r . : any character * : match the preceding zero or more times shiny : match ‘shiny’ [0-9] : any digit 0-9 (note that we are still talking about strings, not actual numbered values) $ : ends with preceding Typical Uses None of it makes sense, so don’t attempt to do so. Just try to remember a couple key approaches, and search the web for the rest. Along with ^ . * [0-9] $, a couple more common ones are: [a-z] : letters a-z [A-Z] : capital letters + : match the preceding one or more times () : groupings | : logical or e.g. [a-z]|[0-9] (a lower-case letter or a number) ? : preceding item is optional, and will be matched at most once. Typically used for ‘look ahead’ and ‘look behind’ \\ : escape a character, like if you actually wanted to search for a period instead of using it as a regex pattern, you’d use \\., though in R you need \\\\, i.e. double slashes, for escape. In addition, in R there are certain predefined characters that can be called: [:punct:] : punctuation [:blank:] : spaces and tabs [:alnum:] : alphanumeric characters Those are just a few. The key functions can be found by looking at the help file for the grep function (?grep). However, the stringr package has the same functionality with perhaps a slightly faster processing (though that’s due to the underlying stringi package). See if you can guess which of the following will turn up TRUE. grepl(c('apple', 'pear', 'banana'), pattern='a') grepl(c('apple', 'pear', 'banana'), pattern='^a') grepl(c('apple', 'pear', 'banana'), pattern='^a|a$') Scraping the web, munging data, just finding things in your scripts … you can potentially use this all the time, and not only with text analysis, as we’ll now see. dplyr helper functions The dplyr package comes with some poorly documented2 but quite useful helper functions that essentially serve as human-readable regex, which is a very good thing. These functions allow you to select variables3 based on their names. They are usually just calling base R functions in the end. starts_with: starts with a prefix (same as regex ‘^blah’) ends_with: ends with a prefix (same as regex ‘blah$’) contains: contains a literal string (same as regex ‘blah’) matches: matches a regular expression (put your regex here) num_range: a numerical range like x01, x02, x03. (same as regex ‘x[0-9][0-9]’) one_of: variables in character vector. (if you need to quote variable names, e.g. within a function) everything: all variables. (a good way to spend time doing something only to accomplish what you would have by doing nothing, or a way to reorder variables) For more on using stringr and regular expressions in R, you may find this cheatsheet useful. Text Processing Examples Example 1 Let’s say you’re dealing with some data that has been handled typically, that is to say, poorly. For example, you have a variable in your data representing whether something is from the north or south region. It might seem okay until… ## table(df$region) Var1 Freq South 76 north 68 North 75 north 70 North 70 south 65 South 76 Even if you spotted the casing issue, there is still a white space problem4. Let’s say you want this to be capitalized ‘North’ and ‘South’. How might you do it? It’s actually quite easy with the stringr tools. library(stringr) df %>% mutate(region = str_trim(region), region = str_to_title(region)) The str_trim function trims white space from either side (or both), while str_to_title converts everything to first letter capitalized. ## table(df_corrected$region) Var1 Freq North 283 South 217 Compare that to how you would have done it before knowing how to use text processing tools. One might have spent several minutes with some find and replace approach in a spreadsheet, or maybe even several if... else statements in R until all problematic cases were taken care of. Not very efficient. Example 2 Suppose you import a data frame, and the data was originally in wide format, where each column represented a year of data collection for the individual. Since it is bad form for data columns to have numbers for names, when you import it, the result looks like the following. So, the problem now is to change the names to be Year_1, Year_2, etc. You might think you might have to use colnames and manually create a string of names to replace the current ones. colnames(df)[-1] = c('Year_1', 'Year_2', 'Year_3', 'Year_4', 'Year_5') Or perhaps you’re thinking of the paste0 function, which works fine and saves some typing. colnames(df)[-1] = paste0('Year_', 1:5) However, data sets may be hundreds of columns, and the columns of data may have the same pattern but not be next to one another. For example, the first few dozen columns are all data that belongs to the first wave, etc. It is tedious to figure out which columns you don’t want, but even then you’re resulting to using magic numbers with the above approach, and one column change to data will mean that redoing the name change will fail. However, the following accomplishes what we want, and is reproducible regardless of where the columns are in the data set. df %>% rename_at(vars(num_range('X', 1:5)), str_replace, pattern='X', replacement='Year_') %>% head() id Year_1 Year_2 Year_3 Year_4 Year_5 1 1 1.18 -2.04 -0.03 -0.36 0.43 2 2 0.34 -1.34 -0.30 -0.15 0.47 3 3 -0.32 -0.97 1.03 0.20 0.97 4 4 -0.57 1.36 1.29 0.00 0.32 5 5 0.64 0.73 -0.16 -1.29 -0.79 6 6 -0.59 0.16 -1.28 0.55 0.75 Let’s parse what it’s specifically doing. rename_at allows us to rename specific columns Which columns? X1 through X:5. The num_range helper function creates the character strings X1, X2, X3, X4, and X5. Now that we have the names, we use vars to tell rename_at which ones. It would have allowed additional sets of variables as well. rename_at needs a function to apply to each of those column names. In this case the function is str_replace, to replace patterns of strings with some other string The specific arguments to str_replace (pattern to be replaced, replacement pattern) are also supplied. So in the end we just have to use the num_range helper function within the function that tells rename_at what it should be renaming, and let str_replace do the rest. Exercises In your own words, state the difference between a character string and a factor variable. Consider the following character vector. x = c('A', '1', 'Q') How might you paste the elements together so that there is an underscore _ between characters and no space (“A_1_Q”)? If you highlight the next line you’ll see the hint. Revisit how we used the collapse argument within paste. paste(..., collapse=?) Paste Part 2: The following application of paste produces this result. paste(c('A', '1', 'Q'), c('B', '2', 'z')) [1] "A B" "1 2" "Q z" Now try to produce "A - B" "1 - 2" "Q - z". To do this, note that one can paste any number of things together (i.e. more than two). So try adding ’ - ’ to it. Use regex to grab the Star Wars names that have a number. Use both grep and grepl and compare the results grep(starwars$name, pattern = ?) Now use your hacking skills to determine which one is the tallest. Load the dplyr package, and use the its helper functions to grab all the columns in the starwars data set (comes with the package) with color in the name but without referring to them directly. The following shows a generic example. There are several ways to do this. Try two if you can. starwars %>% select(helper_function('pattern')) I also don’t think it necessary to have separate functions for str_* functions in stringr depending on whether, e.g. I want ‘all’ matches (practically every situation) or just the first (very rarely). It could have just been an additional argument with default all=TRUE.↩ At least they’re exposed now.↩ For rows, you’ll have to use a grepl/str_detect approach. For example, filter(grepl(col1, pattern='^X')) would subset to only rows where col1 starts with X.↩ This is a very common issue among Excel users, and just one of the many reasons not to use it.↩ "],
["sentiment-analysis.html", "Sentiment Analysis Basic idea Issues Sentiment Analysis Examples Sentiment Analysis Summary Exercise", " Sentiment Analysis Basic idea A common and intuitive approach to text is sentiment analysis. In a grand sense, we are interested in the emotional content of some text, e.g. posts on Facebook, tweets, or movie reviews. Most of the time, this is obvious when one reads it, but if you have hundreds of thousands or millions of strings to analyze, you’d like to be able to do so efficiently. We will use the tidytext package for our demonstration. It comes with a lexicon of positive and negative words that is actually a combination of multiple sources, one of which provides numeric ratings, while the others suggest different classes of sentiment. library(tidytext) sentiments %>% slice(sample(1:nrow(sentiments))) # A tibble: 27,314 x 4 word sentiment lexicon score <chr> <chr> <chr> <int> 1 decomposition negative nrc NA 2 imaculate positive bing NA 3 greatness positive bing NA 4 impatient negative bing NA 5 contradicting negative loughran NA 6 irrecoverableness negative bing NA 7 advisable trust nrc NA 8 humiliation disgust nrc NA 9 obscures negative bing NA 10 affliction negative bing NA # ... with 27,304 more rows The gist is that we are dealing with a specific, pre-defined vocabulary. Of course, any analysis will only be as good as the lexicon. The goal is usually to assign a sentiment score to a text, possibly an overall score, or a generally positive or negative grade. Given that, other analyses may be implemented to predict sentiment via standard regression tools or machine learning approaches. Issues Context, sarcasm, etc. Now consider the following. sentiments %>% filter(word=='sick') # A tibble: 5 x 4 word sentiment lexicon score <chr> <chr> <chr> <int> 1 sick disgust nrc NA 2 sick negative nrc NA 3 sick sadness nrc NA 4 sick negative bing NA 5 sick <NA> AFINN -2 Despite the above assigned sentiments, the word sick has been used at least since 1960s surfing culture as slang for positive affect. A basic approach to sentiment analysis as described here will not be able to detect slang or other context like sarcasm. However, lots of training data for a particular context may allow one to correctly predict such sentiment. In addition, there are, for example, slang lexicons, or one can simply add their own complements to any available lexicon. Lexicons In addition, the lexicons are going to maybe be applicable to general usage of English in the western world. Some might wonder where exactly these came from or who decided that the word abacus should be affiliated with ‘trust’. You may start your path by typing ?sentiments at the console if you have the tidytext package loaded. Sentiment Analysis Examples The first thing the baby did wrong We demonstrate sentiment analysis with the text The first thing the baby did wrong, which is a very popular brief guide to parenting written by world renown psychologist Donald Barthelme who, in his spare time, also wrote postmodern literature. This particular text talks about an issue with the baby, whose name is Born Dancin’, and who likes to tear pages out of books. Attempts are made by her parents to rectify the situation, without much success, but things are finally resolved at the end. The ultimate goal will be to see how sentiment in the text evolves over time, and in general we’d expect things to end more positively than they began. How do we start? Let’s look again at the sentiments data set in the tidytext package. sentiments %>% slice(sample(1:nrow(sentiments))) # A tibble: 27,314 x 4 word sentiment lexicon score <chr> <chr> <chr> <int> 1 blunder sadness nrc NA 2 solidity positive nrc NA 3 mortuary fear nrc NA 4 absorbed positive nrc NA 5 successful joy nrc NA 6 virus negative nrc NA 7 exorbitantly negative bing NA 8 discombobulate negative bing NA 9 wail negative nrc NA 10 intimidatingly negative bing NA # ... with 27,304 more rows The bing lexicon provides only positive or negative labels. The AFINN, on the other hand, is numerical, with ratings -5:5 that are in the score column. The others get more imaginative, but also more problematic. Why assimilate is superfluous is beyond me. It clearly should be negative given the Borg connotations. sentiments %>% filter(sentiment=='superfluous') # A tibble: 56 x 4 word sentiment lexicon score <chr> <chr> <chr> <int> 1 aegis superfluous loughran NA 2 amorphous superfluous loughran NA 3 anticipatory superfluous loughran NA 4 appertaining superfluous loughran NA 5 assimilate superfluous loughran NA 6 assimilating superfluous loughran NA 7 assimilation superfluous loughran NA 8 bifurcated superfluous loughran NA 9 bifurcation superfluous loughran NA 10 cessions superfluous loughran NA # ... with 46 more rows Read in the text files But I digress. We start with the raw text, reading it in line by line. In what follows we read in all the texts (three) in a given directory, such that each element of ‘text’ is the work itself, i.e. text is a list column5. The unnest function will unravel the works to where each entry is essentially a paragraph form. library(tidytext) barth0 = data_frame(file = dir('data/texts_raw/barthelme', full.names = TRUE)) %>% mutate(text = map(file, read_lines)) %>% transmute(work = basename(file), text) %>% unnest(text) Iterative processing One of the things stressed in this document is the iterative nature of text analysis. You will consistently take two steps forward, and then one or two back as you find issues that need to be addressed. For example, in a subsequent step I found there were encoding issues6, so the following attempts to fix them. In addition, we want to tokenize the documents such that our tokens are sentences (e.g. as opposed to words or paragraphs). The reason for this is that I will be summarizing the sentiment at sentence level. # Fix encoding, convert to sentences; you may get a warning message barth = barth0 %>% mutate( text = sapply( text, stringi::stri_enc_toutf8, is_unknown_8bit = TRUE, validate = TRUE ) ) %>% unnest_tokens( output = sentence, input = text, token = 'sentences' ) Tokenization The next step is to drill down to just the document we want, and subsequently tokenize to the word level. However, I also create a sentence id so that we can group on it later. # get baby doc, convert to words baby = barth %>% filter(work=='baby.txt') %>% mutate(sentence_id = 1:n()) %>% unnest_tokens( output = word, input = sentence, token = 'words', drop = FALSE ) %>% ungroup() Get sentiments Now that the data has been prepped, getting the sentiments is ridiculously easy. But that is how it is with text analysis. All the hard work is spent with the data processing. Here all we need is an inner join of our words with a sentiment lexicon of choice. This process will only retain words that are also in the lexicon. I use the numeric-based lexicon here. At that point, we get a sum score of sentiment by sentence. # get sentiment via inner join baby_sentiment = baby %>% inner_join(get_sentiments("afinn")) %>% group_by(sentence_id, sentence) %>% summarise(sentiment = sum(score)) %>% ungroup() Alternative approach As we are interested in the sentence level, it turns out that the sentimentr package has built-in functionality for this, and includes a more nuanced sentiment scores that takes into account valence shifters, e.g. words that would negate something with positive or negative sentiment (‘I do not like it’). baby_sentiment = barth0 %>% filter(work=='baby.txt') %>% get_sentences(text) %>% sentiment() %>% drop_na() %>% # empty lines mutate(sentence_id = row_number()) The following visualizes sentiment over the progression of sentences (note that not every sentence will receive a sentiment score). You can read the sentence by hovering over the dot. The ▬ is the running average. In general, the sentiment starts out negative as the problem is explained. It bounces back and forth a bit but ends on a positive note. You’ll see that some sentences’ context are not captured. For example, sentence 16 is ‘But it didn’t do any good’. However good is going to be marked as a positive sentiment in any lexicon by default. In addition, the token length will matter. Longer sentences are more likely to have some sentiment, for example. Romeo & Juliet For this example, I’ll invite you to more or less follow along, as there is notable pre-processing that must be done. We’ll look at sentiment in Shakespeare’s Romeo and Juliet. I have a cleaner version in the raw texts folder, but we can take the opportunity to use the gutenbergr package to download it directly from Project Gutenberg, a storehouse for works that have entered the public domain. library(gutenbergr) gw0 = gutenberg_works(title == "Romeo and Juliet") # look for something with this title # A tibble: 1 x 4 gutenberg_id title author gutenberg_author_id <int> <chr> <chr> <int> 1 1513 Romeo and Juliet Shakespeare, William 65 rnj = gutenberg_download(gw0$gutenberg_id) We’ve got the text now, but there is still work to be done. The following is a quick and dirty approach, but see the Shakespeare section to see a more deliberate one. We first slice off the initial parts we don’t want like title, author etc. Then we get rid of other tidbits that would interfere, using a little regex as well to aid the process. rnj_filtered = rnj %>% slice(-(1:49)) %>% filter(!text==str_to_upper(text), # will remove THE PROLOGUE etc. !text==str_to_title(text), # will remove names/single word lines !str_detect(text, pattern='^(Scene|SCENE)|^(Act|ACT)|^\\\\[')) %>% select(-gutenberg_id) %>% unnest_tokens(sentence, input=text, token='sentences') %>% mutate(sentenceID = 1:n()) The following unnests the data to word tokens. In addition, you can remove stopwords like a, an, the etc., and tidytext comes with a stop_words data frame. However, some of the stopwords have sentiments, so you would get a bit of a different result if you retain them. As Black Sheep once said, the choice is yours, and you can deal with this, or you can deal with that. # show some of the matches stop_words$word[which(stop_words$word %in% sentiments$word)] %>% head(20) [1] "able" "against" "allow" "almost" "alone" "appear" "appreciate" "appropriate" "available" "awfully" "believe" "best" "better" "certain" "clearly" [16] "could" "despite" "downwards" "enough" "furthermore" # remember to call output 'word' or antijoin won't work without a 'by' argument rnj_filtered = rnj_filtered %>% unnest_tokens(output=word, input=sentence, token='words') %>% anti_join(stop_words) Now we add the sentiments via the inner_join function. Here I use ‘bing’, but you can use another, and you might get a different result. rnj_filtered %>% count(word) %>% arrange(desc(n)) # A tibble: 3,288 x 2 word n <chr> <int> 1 thou 276 2 thy 165 3 love 140 4 thee 139 5 romeo 110 6 night 83 7 death 71 8 hath 64 9 sir 58 10 art 55 # ... with 3,278 more rows rnj_sentiment = rnj_filtered %>% inner_join(sentiments) rnj_sentiment # A tibble: 12,668 x 5 sentenceID word sentiment lexicon score <int> <chr> <chr> <chr> <int> 1 1 dignity positive nrc NA 2 1 dignity trust nrc NA 3 1 dignity positive bing NA 4 1 fair positive nrc NA 5 1 fair positive bing NA 6 1 fair <NA> AFINN 2 7 1 ancient negative nrc NA 8 1 grudge anger nrc NA 9 1 grudge negative nrc NA 10 1 grudge negative bing NA # ... with 12,658 more rows rnj_sentiment_bing = rnj_sentiment %>% filter(lexicon=='bing') table(rnj_sentiment_bing$sentiment) negative positive 1244 833 Looks like this one is going to be a downer. The following visualizes the positive and negative sentiment scores as one progresses sentence by sentence through the work using the plotly package. I also show same information expressed as a difference (opaque line). It’s a close game until perhaps the midway point, when negativity takes over and despair sets in with the story. By the end [[:SPOILER ALERT:]] Sean Bean is beheaded, Darth Vader reveals himself to be Luke’s father, and Verbal is Keyser Söze. Sentiment Analysis Summary In general, sentiment analysis can be a useful exploration of data, but it is highly dependent on the context and tools used. Note also that ‘sentiment’ can be anything, it doesn’t have to be positive vs. negative. Any vocabulary may be applied, and so it has more utility than the usual implementation. It should also be noted that the above demonstration is largely conceptual and descriptive. While fun, it’s a bit simplified. For starters, trying to classify words as simply positive or negative itself is not a straightforward endeavor. As we noted at the beginning, context matters, and in general you’d want to take it into account. Modern methods of sentiment analysis would use approaches like word2vec or deep learning to predict a sentiment probability, as opposed to a simple word match. Even in the above, matching sentiments to texts would probably only be a precursor to building a model predicting sentiment, which could then be applied to new data. Exercise Step 0: Install the packages If you haven’t already, install the tidytext package. Install the janeaustenr package and load both of them7. Step 1: Initial inspection First you’ll want to look at what we’re dealing with, so take a gander at austenbooks. library(tidytext); library(janeaustenr) austen_books() # A tibble: 73,422 x 2 text book * <chr> <fct> 1 SENSE AND SENSIBILITY Sense & Sensibility 2 "" Sense & Sensibility 3 by Jane Austen Sense & Sensibility 4 "" Sense & Sensibility 5 (1811) Sense & Sensibility 6 "" Sense & Sensibility 7 "" Sense & Sensibility 8 "" Sense & Sensibility 9 "" Sense & Sensibility 10 CHAPTER 1 Sense & Sensibility # ... with 73,412 more rows austen_books() %>% distinct(book) # A tibble: 6 x 1 book <fct> 1 Sense & Sensibility 2 Pride & Prejudice 3 Mansfield Park 4 Emma 5 Northanger Abbey 6 Persuasion We will examine only one text. In addition, for this exercise we’ll take a little bit of a different approach, looking for a specific kind of sentiment using the NRC database. It contains 10 distinct sentiments. get_sentiments("nrc") %>% distinct(sentiment) # A tibble: 10 x 1 sentiment <chr> 1 trust 2 fear 3 negative 4 sadness 5 anger 6 surprise 7 positive 8 disgust 9 joy 10 anticipation Now, select from any of those sentiments you like (or more than one), and one of the texts as follows. nrc_sadness <- get_sentiments("nrc") %>% filter(sentiment == "positive") ja_book = austen_books() %>% filter(book == "Emma") Step 2: Data prep Now we do a little prep, and I’ll save you the trouble. You can just run the following. ja_book = ja_book %>% mutate(chapter = str_detect(text, regex("^chapter [\\\\divxlc]", ignore_case = TRUE)), chapter = cumsum(chapter), line_book = row_number()) %>% unnest_tokens(word, text) ja_book = ja_book %>% mutate(chapter = str_detect(text, regex("^chapter [\\\\divxlc]", ignore_case = TRUE)), chapter = cumsum(chapter), line_book = row_number()) %>% group_by(chapter) %>% mutate(line_chapter = row_number()) %>% # ungroup() unnest_tokens(word, text) Step 3: Get sentiment Now, on your own, try the inner join approach we used previously to match the sentiments to the text. Don’t try to overthink this. The third pipe step will use the count function with the word column and also the argument sort=TRUE. Note this is just to look at your result, we aren’t assigning it to an object yet. ja_book %>% ? %>% ? The following shows my negative evaluation of Mansfield Park. # A tibble: 4,204 x 3 # Groups: chapter [48] chapter word n <int> <chr> <int> 1 24 feeling 35 2 7 ill 25 3 46 evil 25 4 26 cross 24 5 27 cross 24 6 48 punishment 24 7 7 cutting 20 8 19 feeling 20 9 33 feeling 20 10 34 feeling 20 # ... with 4,194 more rows Step 4: Visualize Now let’s do a visualization for sentiment. So redo your inner join, but we’ll create a data frame that has the information we need. plot_data = ja_book %>% inner_join(nrc_bad) %>% group_by(chapter, line_book, line_chapter) %>% count() %>% group_by(chapter) %>% mutate(negativity = cumsum(n), mean_chapter_negativity=mean(negativity)) %>% group_by(line_chapter) %>% mutate(mean_line_negativity=mean(n)) plot_data # A tibble: 4,398 x 7 # Groups: line_chapter [453] chapter line_book line_chapter n negativity mean_chapter_negativity mean_line_negativity <int> <int> <int> <int> <int> <dbl> <dbl> 1 1 17 7 2 2 111. 3.41 2 1 18 8 4 6 111. 2.65 3 1 20 10 1 7 111. 3.31 4 1 24 14 1 8 111. 2.88 5 1 26 16 2 10 111. 2.54 6 1 27 17 3 13 111. 2.67 7 1 28 18 3 16 111. 3.58 8 1 29 19 2 18 111. 2.31 9 1 34 24 3 21 111. 2.17 10 1 41 31 1 22 111. 2.87 # ... with 4,388 more rows At this point you have enough to play with, so I leave you to plot whatever you want. The following8 shows both the total negativity within a chapter, as well as the per line negativity within a chapter. We can see that there is less negativity towards the end of chapters. We can also see that there appears to be more negativity in later chapters (darker lines). I suggest not naming your column ‘text’ in practice. It is a base function in R, and using it within the tidyverse may result in problems distinguishing the function from the column name (similar to n() function and the n column created by count and tally). I only do so for pedagogical reasons.↩ There are almost always encoding issues in my experience.↩ This exercise is more or less taken directly from the tidytext book.↩ This depiction goes against many of my visualization principles. I like it anyway.↩ "],
["part-of-speech-tagging.html", "Part of Speech Tagging Basic idea POS Examples Tagging summary POS Exercise", " Part of Speech Tagging As an initial review of parts of speech, if you need a refresher, the following Schoolhouse Rocks videos should get you squared away: A noun is a person, place, or thing. Interjections Pronouns Verbs Unpack your adjectives Lolly Lolly Lolly Get Your Adverbs Here Conjunction Junction (personal fave) Aside from those, you can also learn how bills get passed, about being a victim of gravity, a comparison of the decimal to other numeric systems used by alien species (I recommend the Chavez remix), and a host of other useful things. Basic idea With part-of-speech tagging, we classify a word with its corresponding part of speech. The following provides an example. JJ JJ NNS VBP RB Colorless green ideas sleep furiously. We have two adjectives (JJ), a plural noun (NNS), a verb (VBP), and an adverb (RB). Common analysis may then be used to predict POS given the current state of the text, comparing the grammar of different texts, human-computer interaction, or translation from one language to another. In addition, using POS information would make for richer sentiment analysis as well. POS Examples The following approach to POS-tagging is very similar to what we did for sentiment analysis as depicted previously. We have a POS dictionary, and can use an inner join to attach the words to their POS. Unfortunately, this approach is unrealistically simplistic, as additional steps would need to be taken to ensure words are correctly classified. For example, without more information, we are unable to tell if some words are being used as nouns or verbs (human being vs. being a problematic part of speech). However, this example can serve as a starting point. Barthelme & Carver In the following we’ll compare three texts from Donald Barthelme: The Balloon The First Thing The Baby Did Wrong Some Of Us Had Been Threatening Our Friend Colby As another comparison, I’ve included Raymond Carver’s What we talk about when we talk about love, the unedited version. First we’ll load an unnested object from the sentiment analysis, the barth object. Then for each work we create a sentence id, unnest the data to words, join the POS data, then create counts/proportions for each POS. load('data/barth_sentences.RData') barthelme_pos = barth %>% mutate(work = str_replace(work, '.txt', '')) %>% # remove file extension group_by(work) %>% mutate(sentence_id = 1:n()) %>% # create a sentence id unnest_tokens(word, sentence, drop=F) %>% # get words inner_join(parts_of_speech) %>% # join POS count(pos) %>% # count mutate(prop=n/sum(n)) Next we read in and process the Carver text in the same manner. carver_pos = data_frame(file = dir('data/texts_raw/carver/', full.names = TRUE)) %>% mutate(text = map(file, read_lines)) %>% transmute(work = basename(file), text) %>% unnest(text) %>% unnest_tokens(word, text, token='words') %>% inner_join(parts_of_speech) %>% count(pos) %>% mutate(work='love', prop=n/sum(n)) This visualization depicts the proportion of occurrence for each part of speech across the works. It would appear Barthelme is fairly consistent, and also that relative to the Barthelme texts, Carver preferred nouns and pronouns. More taggin’ More sophisticated POS tagging would require the context of the sentence structure. Luckily there are tools to help with that here, in particular via the openNLP package. In addition, it will require a certain language model to be installed (English is only one of many available). I don’t recommend doing so unless you are really interested in this (the openNLPmodels.en package is fairly large). We’ll reexamine the Barthelme texts above with this more involved approach. Initially we’ll need to get the English-based tagger we need and load the libraries. # install.packages("openNLPmodels.en", repos = "http://datacube.wu.ac.at/", type = "source") library(NLP) library(tm) # make sure to load this prior to openNLP library(openNLP) library(openNLPmodels.en) Next comes the processing. This more or less follows the help file example for ?Maxent_POS_Tag_Annotator. Given the several steps involved I show only the processing for one text for clarity. Ideally you’d write a function, and use a group_by approach, to process each of the texts of interest. load('data/barthelme_start.RData') baby_string0 = barth0 %>% filter(id=='baby.txt') baby_string = unlist(baby_string0$text) %>% paste(collapse=' ') %>% as.String init_s_w = annotate(baby_string, list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator())) pos_res = annotate(baby_string, Maxent_POS_Tag_Annotator(), init_s_w) word_subset = subset(pos_res, type=='word') tags = sapply(word_subset$features , '[[', "POS") baby_pos = data_frame(word=baby_string[word_subset], pos=tags) %>% filter(!str_detect(pos, pattern='[[:punct:]]')) Let’s take a look. I’ve also done the other Barthelme texts as well for comparison. word pos text The DT baby first JJ baby thing NN baby the DT baby baby NN baby did VBD baby wrong JJ baby was VBD baby to TO baby tear VB baby pages NNS baby out IN baby of IN baby her PRP$ baby books NNS baby As we can see, we have quite a few more POS to deal with here. They come from the Penn Treebank. The following table notes what the acronyms stand for. I don’t pretend to know all the facets to this. Plotting the differences, we now see a little more distinction between The Balloon and the other two texts. It is more likely to use the determiners, adjectives, singular nouns, and less likely to use personal pronouns and verbs (including past tense). Tagging summary For more information, consult the following: Penn Treebank Maxent function As with the sentiment analysis demos, the above should be seen only starting point for getting a sense of what you’re dealing with. The ‘maximum entropy’ approach is just one way to go about things. Other models include hidden Markov models, conditional random fields, and more recently, deep learning techniques. Goals might include text prediction (i.e. the thing your phone always gets wrong), translation, and more. POS Exercise As this is a more involved sort of analysis, if nothing else in terms of the tools required, as an exercise I would suggest starting with a cleaned text, and seeing if the above code in the last example can get you to the result of having parsed text. Otherwise, assuming you’ve downloaded the appropriate packages, feel free to play around with some strings of your choosing as follows. string = 'Colorless green ideas sleep furiously' initial_result = string %>% annotate(list(Maxent_Sent_Token_Annotator(), Maxent_Word_Token_Annotator())) %>% annotate(string, Maxent_POS_Tag_Annotator(), .) %>% subset(type=='word') sapply(initial_result$features , '[[', "POS") %>% table "],
["topic-modeling.html", "Topic modeling Basic idea Steps Topic Model Example Extensions Topic Model Exercise", " Topic modeling Basic idea Topic modeling as typically conducted is a tool for much more than text. The primary technique of Latent Dirichlet Allocation (LDA) should be as much a part of your toolbox as principal components and factor analysis. It can be seen merely as a dimension reduction approach, but it can also be used for its rich interpretative quality as well. The basic idea is that we’ll take a whole lot of features and boil them down to a few ‘topics’. In this sense LDA is akin to discrete PCA. Another way to think about this is more from the perspective of factor analysis, where we are keenly interested in interpretation of the result, and want to know both what terms are associated with which topics, and what documents are more likely to present which topics. In the standard setting, to be able to conduct such an analysis from text one needs a document-term matrix, where rows represent documents, and columns terms. Each cell is a count of how many times the term occurs in the document. Terms are typically words, but could be any n-gram of interest. Outside of text analysis terms could represent bacterial composition, genetic information, or whatever the researcher is interested in. Likewise, documents can be people, geographic regions, etc. The gist is, despite the common text-based application, that what constitutes a document or term is dependent upon the research question, and LDA can be applied in a variety of research settings. Steps When it comes to text analysis, most of the time in topic modeling is spent on processing the text itself. Importing/scraping it, dealing with capitalization, punctuation, removing stopwords, dealing with encoding issues, removing other miscellaneous common words. It is a highly iterative process such that once you get to the document-term matrix, you’re just going to find the stuff that was missed before and repeat the process with new ‘cleaning parameters’ in place. So getting to the analysis stage is the hard part. See the Shakespeare section, which comprises 5 acts, of which the first four and some additional scenes represent all the processing needed to get to the final scene of topic modeling. In what follows we’ll start at the end of that journey. Topic Model Example Shakespeare In this example, we’ll look at Shakespeare’s plays and poems, using a topic model with 10 topics. For our needs, we’ll use the topicmodels package for the analysis, and mostly others for post-processing. Due to the large number of terms, this could take a while to run depending on your machine (maybe a minute or two). We can also see how things compare with the academic classifications for the texts. load('Data/shakes_dtm_stemmed.RData') library(topicmodels) shakes_10 = LDA(convert(shakes_dtm, to = "topicmodels"), k = 10) Examine Terms within Topics One of the first things to do is attempt to interpret the topics, and we can start by seeing which terms are most probable for each topic. get_terms(shakes_10, 20) We can see there is a lot of overlap in these topics for top terms. Just looking at the top 10, love occurs in all of them, god and heart as well, but we could have guessed this just looking at how often they occur in general. Other measures can be used to assess term importance, such as those that seek to balance the term’s probability of occurrence within a document, and term exclusivity, or how likely a term is to occur in only one particular topic. See the Shakespeare section for some examples of those. Examine Document-Topic Expression Next we can look at which documents are more likely to express each topic. t(topics(shakes_10, 2)) For example, based just on term frequency, Hamlet is most likely to be associated with Topic 1. That topic is affiliated with the (stemmed words) love, night, heaven, heart, natur, ey, hear, hand, life, fear, death, prai, poor, friend, soul, hold, word, live, stand, head. Sounds about right for Hamlet. The following visualization shows a heatmap for the topic probabilities of each document. Darker values mean higher probability for a document expressing that topic. I’ve also added a cluster analysis based on the cosine distance matrix, and the resulting dendrogram. The colored bar on the right represents the given classification of a work as history, tragedy, comedy, or poem. A couple things stand out. To begin with, most works are associated with one topic9. In terms of the discovered topics, traditional classification really probably only works for the historical works, as they cluster together as expected (except for Henry the VIII, possibly due to it being a collaborative work). Furthermore, tragedies and comedies might hit on the same topics, albeit from different perspectives. In addition, at least some works are very poetical, or at least have topics in common with the poems (love, beauty). If we take four clusters from the cluster analysis, the result boils down to Phoenix (on its own), standard poems, a mixed bag of more love-oriented works and the remaining poems, then everything else. Alternatively, one could merely classify the works based on their probable topics, which would make more sense if clustering of the works is in fact the goal. The following visualization attempts to order them based on their most probable topic. The order is based on the most likely topics across all documents. So we can see that topic modeling can be used to classify the documents themselves into groups of documents most likely to express the same sorts of topics. Extensions There are extensions of LDA used in topic modeling that will allow your analysis to go even further. Correlated Topic Models: the standard LDA does not estimate the topic correlation as part of the process. Supervised LDA: In this scenario, topics can be used for prediction, e.g. the classification of tragedy, comedy etc. (similar to PC regression) Structured Topic Models: Here we want to find the relevant covariates that can explain the topics (e.g. year written, author sex, etc.) Other: There are still other ways to examine topics. Topic Model Exercise Movie reviews Perform a topic model on the Cornell Movie review data. I’ve done some initial cleaning (e.g. removing stopwords, punctuation, etc.), and have both a tidy data frame and document term matrix for you to use. The former is provided if you want to do additional processing. But otherwise, just use the topicmodels package and perform your own analysis on the DTM. You can compare to this result. load('data/movie_reviews.RData') library(topicmodels) Associated Press articles Do some topic modeling on articles from the Associated Press data from the First Text Retrieval Conference in 1992. The following will load the DTM, so you are ready to go. See how your result compares with that of Dave Blei, based on 100 topics. library(topicmodels) data("AssociatedPress") There isn’t a lot to work within the realm of choosing an ‘optimal’ number of topics, but I investigated it via a measure called perplexity. It bottomed out at around 50 topics. Usually such an approach is done through cross-validation. However, the solution chosen has no guarantee to produce human interpretable topics.↩ "],
["word-embeddings.html", "Word Embeddings Shakespeare example Wikipedia", " Word Embeddings A key idea in the examination of text concerns representing words as numeric quantities. There are a number of ways to go about this, and we’ve actually already done so. In the sentiment analysis section words were given a sentiment score. In topic modeling, words were represented as frequencies across documents. Once we get to a numeric representation, we can then run statistical models. Consider topic modeling again. We take the document-term matrix, and reduce the dimensionality of it to just a few topics. Now consider a co-occurrence matrix, where if there are \\(k\\) words, it is a \\(k\\) x \\(k\\) matrix, where the diagonal values tell us how frequently wordi occurs with wordj. Just like in topic modeling, we could now perform some matrix factorization technique to reduce the dimensionality of the matrix10. Now for each word we have a vector of numeric values (across factors) to represent them. Indeed, this is how some earlier approaches were done, for example, using principal components analysis on the co-occurrence matrix. Newer techniques such as word2vec and GloVe use neural net approaches to construct word vectors. The details are not important for applied users to benefit from them. Furthermore, applications have been made to create sentence and other vector representations11. In any case, with vector representations of words we can see how similar they are to each other, and perform other tasks based on that information. A tired example from the literature is as follows: \\[\\mathrm{king - man + woman = queen}\\] So a woman-king is a queen. Here is another example: \\[\\mathrm{Paris - France + Germany = Berlin}\\] Berlin is the Paris of Germany. The idea is that with vectors created just based on co-occurrence we can recover things like analogies. Subtracting the man vector from the king vector and adding woman, the most similar word to this would be queen. For more on why this works, take a look here. Shakespeare example We start with some already tokenized data from the works of Shakespeare. We’ll treat the words as if they just come from one big Shakespeare document, and only consider the words as tokens, as opposed to using n-grams. We create an iterator object for text2vec functions to use, and with that in hand, create the vocabulary, keeping only those that occur at least 5 times. This example generally follows that of the package vignette, which you’ll definitely want to spend some time with. load('data/shakes_words_df_4text2vec.RData') library(text2vec) ## shakes_words shakes_words_ls = list(shakes_words$word) it = itoken(shakes_words_ls, progressbar = FALSE) shakes_vocab = create_vocabulary(it) shakes_vocab = prune_vocabulary(shakes_vocab, term_count_min = 5) Let’s take a look at what we have at this point. We’ve just created word counts, that’s all the vocabulary object is. shakes_vocab Number of docs: 1 0 stopwords: ... ngram_min = 1; ngram_max = 1 Vocabulary: term term_count doc_count 1: bounties 5 1 2: rag 5 1 3: merchant's 5 1 4: ungovern'd 5 1 5: cozening 5 1 --- 9090: of 17784 1 9091: to 20693 1 9092: i 21097 1 9093: and 26032 1 9094: the 28831 1 The next step is to create the token co-occurrence matrix (TCM). The definition of whether two words occur together is arbitrary. Should we just look at previous and next word? Five behind and forward? This will definitely affect results so you will want to play around with it. # maps words to indices vectorizer = vocab_vectorizer(shakes_vocab) # use window of 10 for context words shakes_tcm = create_tcm(it, vectorizer, skip_grams_window = 10) Note that such a matrix will be extremely sparse. Most words do not go with other words in the grand scheme of things. So when they do, it usually matters. Now we are ready to create the word vectors based on the GloVe model. Various options exist, so you’ll want to dive into the associated help files and perhaps the original articles to see how you might play around with it. The following takes roughly a minute or two on my machine. I suggest you start with n_iter = 10 and/or convergence_tol = 0.001 to gauge how long you might have to wait. In this setting, we can think of our word of interest as the target, and any/all other words (within the window) as the context. Word vectors are learned for both. glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = shakes_vocab, x_max = 10) shakes_wv_main = glove$fit_transform(shakes_tcm, n_iter = 1000, convergence_tol = 0.00001) # dim(shakes_wv_main) shakes_wv_context = glove$components # dim(shakes_wv_context) # Either word-vectors matrices could work, but the developers of the technique # suggest the sum/mean may work better shakes_word_vectors = shakes_wv_main + t(shakes_wv_context) Now we can start to play. The measure of interest in comparing two vectors will be cosine similarity, which, if you’re not familiar, you can think of it similarly to the standard correlation12. Let’s see what is similar to Romeo. rom = shakes_word_vectors["romeo", , drop = F] # ham = shakes_word_vectors["hamlet", , drop = F] cos_sim_rom = sim2(x = shakes_word_vectors, y = rom, method = "cosine", norm = "l2") # head(sort(cos_sim_rom[,1], decreasing = T), 10) romeo juliet tybalt benvolio nurse iago friar mercutio aaron roderigo 1 0.78 0.72 0.65 0.64 0.63 0.61 0.6 0.6 0.59 Obviously Romeo is most like Romeo, but after that comes the rest of the crew in the play. As this text is somewhat raw, it is likely due to names associated with lines in the play. As such, one may want to narrow the window13. Let’s try love. love = shakes_word_vectors["love", , drop = F] cos_sim_rom = sim2(x = shakes_word_vectors, y = love, method = "cosine", norm = "l2") # head(sort(cos_sim_rom[,1], decreasing = T), 10) x love 1.00 that 0.80 did 0.72 not 0.72 in 0.72 her 0.72 but 0.71 so 0.71 know 0.71 do 0.70 The issue here is that love is so commonly used in Shakespeare, it’s most like other very common words. What if we take Romeo, subtract his friend Mercutio, and add Nurse? This is similar to the analogy example we had at the start. test = shakes_word_vectors["romeo", , drop = F] - shakes_word_vectors["mercutio", , drop = F] + shakes_word_vectors["nurse", , drop = F] cos_sim_test = sim2(x = shakes_word_vectors, y = test, method = "cosine", norm = "l2") # head(sort(cos_sim_test[,1], decreasing = T), 10) x nurse 0.87 juliet 0.72 romeo 0.70 It looks like we get Juliet as the most likely word (after the ones we actually used), just as we might have expected. Again, we can think of this as Romeo is to Mercutio as Juliet is to the Nurse. Let’s try another like that. test = shakes_word_vectors["romeo", , drop = F] - shakes_word_vectors["juliet", , drop = F] + shakes_word_vectors["cleopatra", , drop = F] cos_sim_test = sim2(x = shakes_word_vectors, y = test, method = "cosine", norm = "l2") # head(sort(cos_sim_test[,1], decreasing = T), 3) x cleopatra 0.81 romeo 0.70 antony 0.70 One can play with stuff like this all day. For example, you may find that a Romeo without love is a Tybalt! Wikipedia The following shows the code for analyzing text from Wikipedia, and comes directly from the text2vec vignette. Note that this is a relatively large amount of text (100MB), and so will take notably longer to process. text8_file = "data/texts_raw/text8" if (!file.exists(text8_file)) { download.file("http://mattmahoney.net/dc/text8.zip", "data/text8.zip") unzip("data/text8.zip", files = "text8", exdir = "data/texts_raw/") } wiki = readLines(text8_file, n = 1, warn = FALSE) tokens = space_tokenizer(wiki) it = itoken(tokens, progressbar = FALSE) vocab = create_vocabulary(it) vocab = prune_vocabulary(vocab, term_count_min = 5L) vectorizer = vocab_vectorizer(vocab) tcm = create_tcm(it, vectorizer, skip_grams_window = 5L) glove = GlobalVectors$new(word_vectors_size = 50, vocabulary = vocab, x_max = 10) wv_main = glove$fit_transform(tcm, n_iter = 100, convergence_tol = 0.001) wv_context = glove$components word_vectors = wv_main + t(wv_context) Let’s try our Berlin example. berlin = word_vectors["paris", , drop = FALSE] - word_vectors["france", , drop = FALSE] + word_vectors["germany", , drop = FALSE] berlin_cos_sim = sim2(x = word_vectors, y = berlin, method = "cosine", norm = "l2") head(sort(berlin_cos_sim[,1], decreasing = TRUE), 5) paris berlin munich germany at 0.7575511 0.7560328 0.6721202 0.6559778 0.6519383 Success! Now let’s try the queen example. queen = word_vectors["king", , drop = FALSE] - word_vectors["man", , drop = FALSE] + word_vectors["woman", , drop = FALSE] queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2") head(sort(queen_cos_sim[,1], decreasing = TRUE), 5) king son alexander henry queen 0.8831932 0.7575572 0.7042561 0.6769456 0.6755054 Not so much, though it is still a top result. Results are of course highly dependent upon the data and settings you choose, so keep the context in mind when trying this out. Now that words are vectors, we can use them in any model we want, for example, to predict sentimentality. Furthermore, extensions have been made to deal with sentences, paragraphs, and even lda2vec! In any event, hopefully you have some idea of what word embeddings are and can do for you, and have added another tool to your text analysis toolbox. You can imagine how it might be difficult to deal with the English language, which might be something on the order of 1 million words.↩ Simply taking the average of the word vector representations within a sentence to represent the sentence as a vector is surprisingly performant.↩ It’s also used in the Shakespeare Start to Finish section.↩ With a window of 5, Romeo’s top 10 includes others like Troilus and Cressida.↩ "],
["summary.html", "Summary", " Summary It should be clear at this point that text can be seen as amenable to analysis as anything else in statistics. Depending on the goals, the exploration of text can take on one of many forms. In most situations, at least some preprocessing may be required, and often it will be quite an undertaking to make the text amenable to analysis. However, this is often rewarded by interesting insights and a better understanding of the data at hand, and makes possible what otherwise would not be if only human-powered analysis were applied. For more natural language processing tools in R, one should consult the corresponding task view. However, one should be aware that it doesn’t take much to strain one’s computing resources with R’s tools and standard approach. As an example, the Shakespeare corpus is very small by any standard, and even then it will take some time for certain statistics or topic modeling to be conducted. As such, one should be prepared to also spend time learning ways to make computing more efficient. Luckily, many aspects of the process may be easily distributed/parallelized. Much natural language processing is actually done with deep learning techniques, which generally requires a lot of data, notable computing resources, copious amounts of fine tuning, and often involves optimization towards a specific task. Most of the cutting-edge work there is done in Python, and as a starting point for more common text-analytic approaches, you can check out the Natural Language Toolkit. Dealing with text is not always easy, but it’s definitely easier than it ever has been. The number of tools at your disposal is vast, and more are being added all the time. One of the main take home messages is that text analysis can be a lot of fun, so enjoy the process! Best of luck with your data! \\(\\qquad\\sim\\mathbb{M}\\) "],
["shakespeare.html", "Shakespeare Start to Finish ACT I. Scrape MIT and Gutenberg Shakespeare ACT II. Preliminary Cleaning ACT III. Stop words ACT IV. Other fixes ACT V. Fun stuff", " Shakespeare Start to Finish The following attempts to demonstrate the usual difficulties one encounters dealing with text by procuring and processing the works of Shakespeare. The source is MIT, which has made the ‘complete’ works available on the web since 1993, plus one other from Gutenberg. The initial issue is simply getting the works from the web. Subsequently there is metadata, character names, stopwords etc. to be removed. At that point, we can stem and count the words in each work, which, when complete, puts us at the point we are ready for analysis. The primary packages used are tidytext, stringr, and when things are ready for analysis, quanteda. ACT I. Scrape MIT and Gutenberg Shakespeare Scene I. Scrape main works Initially we must scrape the web to get the documents we need. The rvest package will be used as follows. Start with the url of the site Get the links off that page to serve as base urls for the works Scrape the document for each url Deal with the collection of Sonnets separately Write out results library(rvest); library(tidyverse); library(stringr) page0 = read_html('http://shakespeare.mit.edu/') works_urls0 = page0 %>% html_nodes('a') %>% html_attr('href') main = works_urls0 %>% grep(pattern='index', value=T) %>% str_replace_all(pattern='index', replacement='full') other = works_urls0[!grepl(works_urls0, pattern='index|edu|org|news')] works_urls = c(main, other) works_urls[1:3] Now we just paste the main site url to the work urls and download them. Here is where we come across our first snag. The html_text function has what I would call a bug but what the author feels is a feature. Basically, it ignores line breaks of the form <br> in certain situations. This means it will smash text together that shouldn’t be, thereby making any analysis of it fairly useless14. Luckily, @rentrop provided a solution, which is in r/fix_read_html.R. works0 = lapply(works_urls, function(x) read_html(paste0('http://shakespeare.mit.edu/', x))) source('r/fix_read_html.R') html_text_collapse(works0[[1]]) #works works = lapply(works0, html_text_collapse) names(works) = c("All's Well That Ends Well", "As You Like It", "Comedy of Errors" "Cymbeline", "Love's Labour's Lost", "Measure for Measure" "The Merry Wives of Windsor", "The Merchant of Venice", "A Midsummer Night's Dream" "Much Ado about Nothing", "Pericles Prince of Tyre", "The Taming of the Shrew" "The Tempest", "Troilus and Cressida", "Twelfth Night" "The Two Gentlemen of Verona", "The Winter's Tale", "King Henry IV Part 1" "King Henry IV Part 2", "Henry V", "Henry VI Part 1" "Henry VI Part 2", "Henry VI Part 3", "Henry VIII" "King John", "Richard II", "Richard III" "Antony and Cleopatra", "Coriolanus", "Hamlet" "Julius Caesar", "King Lear", "Macbeth" "Othello", "Romeo and Juliet", "Timon of Athens" "Titus Andronicus", "Sonnets", "A Lover's Complaint" "The Rape of Lucrece", "Venus and Adonis", "Elegy") Scene II. Sonnets We now hit a slight nuisance with the Sonnets. The Sonnets have a bit of a different structure than the plays. All links are in a single page, with a different form for the url, and each sonnet has its own page. sonnet_urls = paste0('http://shakespeare.mit.edu/', grep(works_urls0, pattern='sonnet', value=T)) %>% read_html() %>% html_nodes('a') %>% html_attr('href') sonnet_urls = grep(sonnet_urls, pattern = 'sonnet', value=T) # remove amazon link # read the texts sonnet0 = purrr::map(sonnet_urls, function(x) read_html(paste0('http://shakespeare.mit.edu/Poetry/', x))) # collapse to one 'Sonnets' work sonnet = sapply(sonnet0, html_text_collapse) works$Sonnets = sonnet Scene III. Save and write out Now we can save our results so we won’t have to repeat any of the previous scraping. We want to save the main text object as an RData file, and write out the texts to their own file. When dealing with text, you’ll regularly want to save stages so you can avoid repeating what you don’t have to, as often you will need to go back after discovering new issues further down the line. save(works, file='data/texts_raw/shakes/moby_from_web.RData') Scene IV. Read text from files After the above is done, it’s not required to redo, so we can always get what we need. I’ll start with the raw text as files, as that is one of the more common ways one deals with documents. When text is nice and clean, this can be fairly straightforward. The function at the end comes from the tidyr package. Up to that line, each element in the text column is the entire text, while the column itself is thus a ‘list-column’. In other words, we have a 42 x 2 matrix. But to do what we need, we’ll want to have access to each line, and the unnest function unpacks each line within the title. The first few lines of the result are shown after. library(tidyverse); library(stringr) shakes0 = data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>% transmute(id = basename(file), text) %>% unnest(text) save(shakes0, file='data/initial_shakes_dt.RData') # Alternate that provides for more options # library(readtext) # shakes0 = # data_frame(file = dir('data/texts_raw/shakes/moby/', full.names = TRUE)) %>% # mutate(text = map(file, readtext, encoding='UTF8')) %>% # unnest(text) Scene V. Add additional works It is typical to be gathering texts from multiple sources. In this case, we’ll get The Phoenix and the Turtle from the Project Gutenberg website. There is an R package that will allow us to work directly with the site, making the process straightforward15. I also considered two other works, but I refrained from “The Two Noble Kinsmen” because like many other of Shakespeare’s versions on Gutenberg, it’s basically written in a different language. I also refrained from The Passionate Pilgrim because it’s mostly not Shakespeare. When first doing this project, I actually started with Gutenberg, but it became a notable PITA. The texts were inconsistent in source, and sometimes reproduced printing errors purposely, which would have compounded typical problems. I thought it could have been solved by using the Complete Works of Shakespeare but the download only came with that title, meaning one would have to hunt for and delineate each separate work. This might not have been too big of an issue, except that there is no table of contents, nor consistent naming of titles across different printings. The MIT approach, on the other hand, was a few lines of code. This represents a common issue in text analysis when dealing with sources, a different option may save a lot of time in the end. The following code could be more succinct to deal with one text, but I initially was dealing with multiple works, so I’ve left it in that mode. In the end, we’ll have a tibble with an id column for the file/work name, and another column that contains the lines of text. library(gutenbergr) works_not_included = c("The Phoenix and the Turtle") # add others if desired gute0 = gutenberg_works(title %in% works_not_included) gute = lapply(gute0$gutenberg_id, gutenberg_download) gute = mapply(function(x, y) mutate(x, id=y) %>% select(-gutenberg_id), x=gute, y=works_not_included, SIMPLIFY=F) shakes = shakes0 %>% bind_rows(gute) %>% mutate(id = str_replace_all(id, " |'", '_')) %>% mutate(id = str_replace(id, '.txt', '')) %>% arrange(id) # shakes %>% split(.$id) # inspect save(shakes, file='data/texts_raw/shakes/shakes_df.RData') ACT II. Preliminary Cleaning If you think we’re even remotely getting close to being ready for analysis, I say Ha! to you. Our journey has only just begun (cue the Carpenters). Now we can start thinking about prepping the data for eventual analysis. One of the nice things about having the data in a tidy format is that we can use string functionality over the column of text in a simple fashion. Scene I. Remove initial text/metadata First on our to-do list is to get rid of all the preliminary text of titles, authorship, and similar. This is fairly straightforward when you realize the text we want will be associated with something like ACT I, or in the case of the Sonnets, the word Sonnet. So, the idea it to drop all text up to those points. I’ve created a function that will do that, and then just apply it to each works tibble16. For the poems and A Funeral Elegy for Master William Peter, we look instead for the line where his name or initials start the line. source('r/detect_first_act.R') shakes_trim = shakes %>% split(.$id) %>% lapply(detect_first_act) %>% bind_rows shakes %>% filter(id=='Romeo_and_Juliet') %>% head # A tibble: 6 x 2 id text <chr> <chr> 1 Romeo_and_Juliet Romeo and Juliet: Entire Play 2 Romeo_and_Juliet " " 3 Romeo_and_Juliet "" 4 Romeo_and_Juliet "" 5 Romeo_and_Juliet "" 6 Romeo_and_Juliet Romeo and Juliet shakes_trim %>% filter(id=='Romeo_and_Juliet') %>% head # A tibble: 6 x 2 id text <chr> <chr> 1 Romeo_and_Juliet "" 2 Romeo_and_Juliet "" 3 Romeo_and_Juliet PROLOGUE 4 Romeo_and_Juliet "" 5 Romeo_and_Juliet "" 6 Romeo_and_Juliet "" Scene II. Miscellaneous removal Next, we’ll want to remove empty rows, any remaining titles, lines that denote the act or scene, and other stuff. I’m going to remove the word prologue and epilogue as a stopword later. While some texts have a line that just says that (PROLOGUE), others have text that describes the scene (Prologue. Blah blah) and which I’ve decided to keep. As such, we just need the word itself gone. titles = c("A Lover's Complaint", "All's Well That Ends Well", "As You Like It", "The Comedy of Errors", "Cymbeline", "Love's Labour's Lost", "Measure for Measure", "The Merry Wives of Windsor", "The Merchant of Venice", "A Midsummer Night's Dream", "Much Ado about Nothing", "Pericles Prince of Tyre", "The Taming of the Shrew", "The Tempest", "Troilus and Cressida", "Twelfth Night", "The Two Gentlemen of Verona", "The Winter's Tale", "King Henry IV, Part 1", "King Henry IV, Part 2", "Henry V", "Henry VI, Part 1", "Henry VI, Part 2", "Henry VI, Part 3", "Henry VIII", "King John", "Richard II", "Richard III", "Antony and Cleopatra", "Coriolanus", "Hamlet", "Julius Caesar", "King Lear", "Macbeth", "Othello", "Romeo and Juliet", "Timon of Athens", "Titus Andronicus", "Sonnets", "The Rape of Lucrece", "Venus and Adonis", "A Funeral Elegy", "The Phoenix and the Turtle") shakes_trim = shakes_trim %>% filter(text != '', # remove empties !text %in% titles, # remove titles !str_detect(text, '^ACT|^SCENE|^Enter|^Exit|^Exeunt|^Sonnet') # remove acts etc. ) shakes_trim %>% filter(id=='Romeo_and_Juliet') # we'll get prologue later # A tibble: 3,992 x 2 id text <chr> <chr> 1 Romeo_and_Juliet PROLOGUE 2 Romeo_and_Juliet Two households, both alike in dignity, 3 Romeo_and_Juliet In fair Verona, where we lay our scene, 4 Romeo_and_Juliet From ancient grudge break to new mutiny, 5 Romeo_and_Juliet Where civil blood makes civil hands unclean. 6 Romeo_and_Juliet From forth the fatal loins of these two foes 7 Romeo_and_Juliet A pair of star-cross'd lovers take their life; 8 Romeo_and_Juliet Whose misadventured piteous overthrows 9 Romeo_and_Juliet Do with their death bury their parents' strife. 10 Romeo_and_Juliet The fearful passage of their death-mark'd love, # ... with 3,982 more rows Scene III. Classification of works While we’re at it, we can save the classical (sometimes arbitrary) classifications of Shakespeare’s works for later comparison to what we’ll get in our analyses. We’ll save them to call as needed. shakes_types = data_frame(title=unique(shakes_trim$id)) %>% mutate(class = 'Comedy', class = if_else(str_detect(title, pattern='Adonis|Lucrece|Complaint|Turtle|Pilgrim|Sonnet|Elegy'), 'Poem', class), class = if_else(str_detect(title, pattern='Henry|Richard|John'), 'History', class), class = if_else(str_detect(title, pattern='Troilus|Coriolanus|Titus|Romeo|Timon|Julius|Macbeth|Hamlet|Othello|Antony|Cymbeline|Lear'), 'Tragedy', class), problem = if_else(str_detect(title, pattern='Measure|Merchant|^All|Troilus|Timon|Passion'), 'Problem', 'Not'), late_romance = if_else(str_detect(title, pattern='Cymbeline|Kinsmen|Pericles|Winter|Tempest'), 'Late', 'Other')) save(shakes_types, file='data/shakespeare_classification.RData') # save for later ACT III. Stop words As we’ve noted before, we’ll want to get rid of stop words, things like articles, possessive pronouns, and other very common words. In this case, we also want to include character names. However, the big wrinkle here is that this is not English as currently spoken, so we need to remove ‘ye’, ‘thee’, ‘thine’ etc. In addition, there are things that need to be replaced, like o’er to over, which may then also be removed. In short, this is not so straightforward. Scene I. Character names We’ll get the list of character names from opensourceshakespeare.org via rvest, but I added some from the poems and others that still came through the processing one way or another, e.g. abbreviated names. shakes_char_url = 'https://www.opensourceshakespeare.org/views/plays/characters/chardisplay.php' page0 = read_html(shakes_char_url) tabs = page0 %>% html_table() shakes_char = tabs[[2]][-(1:2), c(1,3,5)] # remove header and phantom columns colnames(shakes_char) = c('Nspeeches', 'Character', 'Play') shakes_char = shakes_char %>% distinct(Character,.keep_all=T) save(shakes_char, file='data/shakespeare_characters.RData') A new snag is that some characters with multiple names may be represented (typically) by the first or last name, or in the case of three, the middle, e.g. Sir Toby Belch. Others are still difficultly named e.g. RICHARD PLANTAGENET (DUKE OF GLOUCESTER). The following should capture everything by splitting the names on spaces, removing parentheses, and keeping unique terms. # remove paren and split chars = shakes_char$Character chars = str_replace_all(chars, '\\\\(|\\\\)', '') chars = str_split(chars, ' ') %>% unlist # these were found after intial processsing chars_other = c('enobarbus', 'marcius', 'katharina', 'clarence','pyramus', 'andrew', 'arcite', 'perithous', 'hippolita', 'schoolmaster', 'cressid', 'diomed', 'kate', 'titinius', 'Palamon', 'Tarquin', 'lucrece', 'isidore', 'tom', 'thisbe', 'paul', 'aemelia', 'sycorax', 'montague', 'capulet', 'collatinus') chars = unique(c(chars, chars_other)) chars = chars[chars != ''] sample(chars)[1:3] [1] "Children" "Dionyza" "Aaron" Scene II. Old, Middle, & Modern English While Shakespeare is considered Early Modern English, some text may be more historical, so I include Middle and Old English stopwords, as they were readily available from the cltk Python module (link). I also added some things to the modern English list like “thou’ldst” that I found lingering after initial passes. I first started using the works from Gutenberg, and there, the Old English might have had some utility. As the texts there were inconsistently translated and otherwise problematic, I abandoned using them. Here, the Old English vocabulary applied to these texts it only removes ‘wit’, so I refrain from using it. # old and me from python cltk module; # em from http://earlymodernconversions.com/wp-content/uploads/2013/12/stopwords.txt; # I also added some to me old_stops0 = read_lines('data/old_english_stop_words.txt') # sort(old_stops0) old_stops = data_frame(word=str_conv(old_stops0, 'UTF8'), lexicon = 'cltk') me_stops0 = read_lines('data/middle_english_stop_words') # sort(me_stops0) me_stops = data_frame(word=str_conv(me_stops0, 'UTF8'), lexicon = 'cltk') em_stops0 = read_lines('data/early_modern_english_stop_words.txt') # sort(em_stops0) em_stops = data_frame(word=str_conv(em_stops0, 'UTF8'), lexicon = 'emc') Scene III. Remove stopwords We’re now ready to start removing words. However, right now, we have lines not words. We can use the tidytext function unnest_tokens, which is like unnest from tidyr, but works on different tokens, e.g. words, sentences, or paragraphs. Note that by default, the function will make all words lower case to make matching more efficient. library(tidytext) shakes_words = shakes_trim %>% unnest_tokens(word, text, token='words') save(shakes_words, file='data/shakes_words_df_4text2vec.RData') We also will be doing a little stemming here. I’m getting rid of suffixes that end with the suffix after an apostrophe. Many of the remaining words will either be stopwords or need to be further stemmed later. I also created a middle/modern English stemmer for words that are not caught otherwise (me_st_stem). Again, this is the sort of thing you discover after initial passes (e.g. ‘criedst’). After that, we can use the anti_join remove the stopwords. source('r/st_stem.R') shakes_words = shakes_words %>% mutate(word = str_trim(word), # remove possible whitespace word = str_replace(word, "'er$|'d$|'t$|'ld$|'rt$|'st$|'dst$", ''), # remove me style endings word = str_replace_all(word, "[0-9]", ''), # remove sonnet numbers word = vapply(word, me_st_stem, 'a')) %>% anti_join(em_stops) %>% anti_join(me_stops) %>% anti_join(data_frame(word=str_to_lower(c(chars, 'prologue', 'epilogue')))) %>% anti_join(data_frame(word=str_to_lower(paste0(chars, "'s")))) %>% # remove possessive names anti_join(stop_words) As before, you should do a couple spot checks. any(shakes_words$word == 'romeo') any(shakes_words$word == 'prologue') any(shakes_words$word == 'mayst') [1] FALSE [1] FALSE [1] FALSE ACT IV. Other fixes Now we’re ready to finally do the word counts. Just kidding! There is still work to do for the remainder, and you’ll continue to spot things after runs. One remaining issue is the words that end in ‘st’ and ‘est’, and others that are not consistently spelled or otherwise need to be dealt with. For example, ‘crost’ will not be stemmed to ‘cross’, as ‘crossed’ would be. Finally, I limit the result to any words that have more than two characters, as my inspection suggested these are left-over suffixes, or otherwise would be considered stopwords anyway. # porter should catch remaining 'est' add_a = c('mongst', 'gainst') # words to add a to shakes_words = shakes_words %>% mutate(word = if_else(word=='honour', 'honor', word), word = if_else(word=='durst', 'dare', word), word = if_else(word=='wast', 'was', word), word = if_else(word=='dust', 'does', word), word = if_else(word=='curst', 'cursed', word), word = if_else(word=='blest', 'blessed', word), word = if_else(word=='crost', 'crossed', word), word = if_else(word=='accurst', 'accursed', word), word = if_else(word %in% add_a, paste0('a', word), word), word = str_replace(word, "'s$", ''), # strip remaining possessives word = if_else(str_detect(word, pattern="o'er"), # change o'er over str_replace(word, "'", 'v'), word)) %>% filter(!(id=='Antony_and_Cleopatra' & word == 'mark')) %>% # mark here is almost exclusively the character name filter(str_count(word)>2) At this point we could still maybe add things to this list of additional fixes, but I think it’s time to actually start playing with the data. ACT V. Fun stuff We are finally ready to get to the fun stuff. Finally! And now things get easy. Scene I. Count the terms We can get term counts with standard dplyr approaches, and packages like tidytext will take that and also do some other things we might want. Specifically, we can use the latter to create the document-term matrix (DTM) that will be used in other analysis. The function cast_dfm will create a dfm class object, or ‘document-feature’ matrix class object (from quanteda), which is the same thing but recognizes this sort of stuff is not specific to words. With word counts in hand, would be good save to save at this point, since they’ll serve as the basis for other processing. term_counts = shakes_words %>% group_by(id, word) %>% count term_counts %>% arrange(desc(n)) library(quanteda) shakes_dtm = term_counts %>% cast_dfm(document=id, term=word, value=n) ## save(shakes_words, term_counts, shakes_dtm, file='data/shakes_words_df.RData') # A tibble: 115,954 x 3 # Groups: id, word [115,954] id word n <chr> <chr> <int> 1 Sonnets love 195 2 The_Two_Gentlemen_of_Verona love 171 3 Romeo_and_Juliet love 150 4 As_You_Like_It love 118 5 Love_s_Labour_s_Lost love 118 6 A_Midsummer_Night_s_Dream love 114 7 Richard_III god 111 8 Titus_Andronicus rome 103 9 Much_Ado_about_Nothing love 92 10 Coriolanus rome 90 # ... with 115,944 more rows Now things are looking like Shakespeare, with love for everyone17. You’ll notice I’ve kept place names such as Rome, but this might be something you’d prefer to remove. Other candidates would be madam, woman, man, majesty (as in ‘his/her’) etc. This sort of thing is up to the researcher. Scene II. Stemming Now we’ll stem the words. This is actually more of a pre-processing step, one that we’d do along with (and typically after) stopword removal. I do it here to mostly demonstrate how to use quanteda to do it, as it can also be used to remove stopwords and do many of the other things we did with tidytext. Stemming will make words like eye and eyes just ey, or convert war, wars and warring to war. In other words, it will reduce variations of a word to a common root form, or ‘word stem’. We could have done this in a step prior to counting the terms, but then you only have the stemmed result to work with for the document term matrix from then on. Depending on your situation, you may or may not want to stem, or maybe you’d want to compare results. The quanteda package will actually stem with the DTM (i.e. work on the column names) and collapse the word counts accordingly. I note the difference in words before and after stemming. shakes_dtm ncol(shakes_dtm) shakes_dtm = shakes_dtm %>% dfm_wordstem() shakes_dtm ncol(shakes_dtm) Document-feature matrix of: 43 documents, 22,052 features (87.8% sparse). [1] 22052 Document-feature matrix of: 43 documents, 13,325 features (83.8% sparse). [1] 13325 The result is notably fewer columns, which will speed up any analysis, as well as produce a slightly more dense matrix. Scene III. Exploration Top features Let’s start looking at the data more intently. The following shows the 10 most common words and their respective counts. This is also an easy way to find candidates to add to the stopword list. Note that dai and prai are stems for day and pray. Love occurs 2.15 times as much as the most frequent word! top10 = topfeatures(shakes_dtm, 10) top10 love heart eye god day hand hear live death night 2918 1359 1300 1284 1229 1226 1043 1015 1010 1001 The following is a word cloud. They are among the most useless visual displays imaginable. Just because you can, doesn’t mean you should. If you want to display relative frequency do so. Similarity The quanteda package has some built in similarity measures such as cosine similarity, which you can think of similarly to the standard correlation (also available as an option). I display it visually to better get a sense of things. ## textstat_simil(shakes_dtm, margin = "documents", method = "cosine") We can already begin to see the clusters of documents. For example, the more historical are the clump in the upper left. The oddball is The Phoenix and the Turtle, though Lover’s Complaint and the Elegy are also less similar than standard Shakespeare. The Phoenix and the Turtle is about the death of ideal love, represented by the Phoenix and Turtledove, for which there is a funeral. It actually is considered by scholars to be in stark contrast to his other output. Elegy itself is actually written for a funeral, but probably not by Shakespeare. A Lover’s Complaint is thought to be an inferior work by the Bard by some critics, and maybe not even authored by him, so perhaps what we’re seeing is a reflection of that lack of quality. In general, we’re seeing things that we might expect. Readability We can examine readability scores for the texts, but for this we’ll need them in raw form. We already had them from before, I just added Phoenix from the Gutenberg download. raw_texts # A tibble: 43 x 2 id text <chr> <list> 1 A_Lover_s_Complaint.txt <chr [813]> 2 A_Midsummer_Night_s_Dream.txt <chr [6,630]> 3 All_s_Well_That_Ends_Well.txt <chr [10,993]> 4 Antony_and_Cleopatra.txt <chr [14,064]> 5 As_You_Like_It.txt <chr [9,706]> 6 Coriolanus.txt <chr [13,440]> 7 Cymbeline.txt <chr [11,388]> 8 Elegy.txt <chr [1,316]> 9 Hamlet.txt <chr [13,950]> 10 Henry_V.txt <chr [9,777]> # ... with 33 more rows With raw texts, we need to convert them to a corpus object to proceed more easily. The corpus function from quanteda won’t read directly from a list column or a list at all, so we’ll convert it via the tm package, which more or less defeats the purpose of using the quanteda package, except that the textstat_readability function gives us what we want, but I digress. Unfortunately, the concept of readability is ill-defined, and as such, there are dozens of measures available dating back nearly 75 years. The following is based on the Coleman-Liau grade score (higher grade = more difficult). The conclusion here is first, Shakespeare isn’t exactly a difficult read, and two, the poems may be more so relative to the other works. library(tm) raw_text_corpus = corpus(VCorpus(VectorSource(raw_texts$text))) shakes_read = textstat_readability(raw_text_corpus) Lexical diversity There are also metrics of lexical diversity. As with readability, there is no one way to measure ‘diversity’. Here we’ll go back to using the standard DTM, as the focus is on the terms, whereas readability is more at the sentence level. Most standard measures of lexical diversity are variants on what is called the type-token ratio, which in our setting is the number of unique terms (types) relative to the total terms (tokens). We can use textstat_lexdiv for our purposes here, which will provide several measures of diversity by default. ld = textstat_lexdiv(shakes_dtm) This visual is based on the (absolute) scaled values of those several metrics, and might suggest that the poems are relatively more diverse. This certainly might be the case for Phoenix, but it could also be a reflection of the limitation of several of the measures, such that longer works are seen as less diverse, as tokens are added more so than types the longer the text goes. As a comparison, the following shows the results of the ‘Measure of Textual Diversity’ calculated using the koRpus package18. It is notably less affected by text length, though the conclusions are largely the same. There is notable correlation between the MTLD and readability as well19. In general, Shakespeare tends to be more expressive in poems, and less so with comedies. Scene IV. Topic model I’d say we’re now ready for topic model. That didn’t take too much did it? Running the model and exploring the topics We’ll run one with 10 topics. As in the previous example in this document, we’ll use topicmodels and the LDA function. Later, we’ll also compare our results with the traditional classifications of the texts. Note that this will take a while to run depending on your machine (maybe a minute or two). Faster implementation can be found with text2vec. library(topicmodels) shakes_10 = LDA(convert(shakes_dtm, to = "topicmodels"), k = 10, control=list(seed=1234)) One of the first things to do is to interpret the topics, and we can start by seeing which terms are most probable for each topic. get_terms(shakes_10, 20) We can see there is a lot of overlap in these topics for top terms. Just looking at the top 10, love occurs in all of them, god and heart are common as well, but we could have guessed this just looking at how often they occur in general. Other measures can be used to assess term importance, such as those that seek to balance the term’s probability of occurrence within a document, and term exclusivity, or how likely a term is to occur in only one particular topic. See the stm package and corresponding labelTopics function as a way to get several alternatives. As an example, I show the results of their version of the following20: FREX: FRequency and EXclusivity, it is a weighted harmonic mean of a term’s rank within a topic in terms of frequency and exclusivity. lift: Ratio of the term’s probability within a topic to its probability of occurrence across all documents. Overly sensitive to rare words. score: Another approach that will give more weight to more exclusive terms. prob: This is just the raw probability of the term within a given topic. As another approach, consider the saliency and relevance of term via the LDAvis package. While you can play with it here, it’s probably easier to open it separately. Note that this has to be done separately from the model, and may have topic numbers in a different order. Your browser does not support iframes. Given all these measures, one can assess how well they match what topics the documents would be most associated with. t(topics(shakes_10, 3)) For example, based just on term frequency, Hamlet is most likely to be associated with Topic 1. That topic is affiliated with the (stemmed words) love, night, heaven, heart, natur, ey, hear, hand, life, fear, death, prai, poor, friend, soul, hold, word, live, stand, head. The other measures pick up on words like Dane and Denmark. Sounds about right for Hamlet. The following visualization shows a heatmap for the topic probabilities of each document. Darker values mean higher probability for a document expressing that topic. I’ve also added a cluster analysis based on the cosine distance matrix, and the resulting dendrogram21. The colored bar on the right represents the given classification of a work as history, tragedy, comedy, or poem. A couple things stand out. To begin with, most works are associated with one topic22. In terms of the discovered topics, traditional classification really probably only works for the historical works, as they cluster together as expected (except for Henry the VIII, possibly due to it being a collaborative work). Furthermore, tragedies and comedies might hit on the same topics, albeit from different perspectives. In addition, at least some works are very poetical, or at least have topics in common with the poems (love, beauty). If we take four clusters from the cluster analysis, the result boils down to Phoenix, Complaint, standard poems, a mixed bag of more romance-oriented works and the remaining poems, then everything else. Alternatively, one could merely classify the works based on their probable topics, which would make more sense if clustering of the works is in fact the goal. The following visualization attempts to order them based on their most probable topic. The order is based on the most likely topics across all documents. The following shows the average topic probability for each of the traditional classes. Topics are represented by their first five most probable terms. Aside from the poems, the classes are a good mix of topics, and appear to have some overlap. Tragedies are perhaps most diverse. Summary of Topic Models This is where the summary would go, but I grow weary… FIN If you can think of a use case where x<br>y<br>z leading to xyz would be both expected as default behavior and desired please let me know.↩ If this surprises you, let me remind you that there are over 10k packages on CRAN alone.↩ I found it easier to work with the entire data frame for the function, hence splitting it on id and recombining. Some attempt was made to work within the tidyverse, but there were numerous issues to what should have been a fairly easy task.↩ Love might as well be a stopword for Shakespeare.↩ I don’t show this as I actually did it in parallel due to longer works taking a notable time to calculate MTLD.↩ The Pearson correlation between MTLD and the Coleman Liau grade readability depicted previously was .87.↩ These descriptions are from Sievert and Shirley 2014.↩ If you are actually interested in clustering the documents (or anything for that matter in my opinion), this would not be the way to do so. For one, the documents are already clustered based on most probable topic. Second, cosine distance isn’t actually a proper distance. Third, as shocking as it may seem, newer methods have been developed since the hierarchical clustering approach, which basically has a dozen arbitrary choices to be made at each step. However, as a simple means to a visualization, the method is valuable if it helps with understanding the data.↩ There isn’t a lot to work within the realm of choosing an ‘optimal’ number of topics, but I investigated it via a measure called perplexity. It bottomed out at around 50 topics. Usually such an approach is done through cross-validation. However, the solution chosen has no guarantee to produce human interpretable topics.↩ "],
["appendix.html", "Appendix Texts R Python A Faster LDA", " Appendix Texts Donald Barthelme “I have to admit we are mired in the most exquisite mysterious muck. This muck heaves and palpitates. It is multi-directional and has a mayor.” “You may not be interested in absurdity, but absurdity is interested in you.” The First Thing the Baby Did Wrong This short story is essentially a how-to on parenting. link The Balloon This story is about a balloon that can represent whatever you want it to. link Some of Us Had Been Threatening Our Friend Colby A brief work about etiquette and how to act in society. link Raymond Carver “It ought to make us feel ashamed when we talk like we know what we’re talking about when we talk about love.” “That’s all we have, finally, the words, and they had better be the right ones.” What We Talk About When We Talk About Love The text we use is actually Beginners, or the unedited version. A drink is required in order to read it with the proper context. Probably several. No. Definitely several. link Billy Dee Shakespeare “It works every time.” These old works have pretty much no relevance today, and are mostly forgotten by everyone except humanities faculty. The analysis of them depicted in this document is pretty much definitive, and leaves little else to say regarding them, so don’t bother reading them if you haven’t already. R Up until even a couple years ago, R was terrible at text. You really only had base R for basic processing and a couple packages that were not straightforward to use. There was little for scraping the web. Nowadays, I would say it’s probably easier to deal with text in R than it is elsewhere, including Python. Packages like rvest, stringr/stringi, and tidytext and more make it almost easy enough to jump right in. One can peruse the Natural Language Processing task view to start getting a sense of what all is available in R. NLP task view The one drawback with R is that most of the dealing with text is slow and/or memory intensive. The Shakespeare texts are only a few dozen and not very long works, and yet your basic LDA might still take a minute or so. Most text analysis situations might have thousands to millions of texts, such that the corpus itself may be too much to hold in memory, and thus R, at least on a standard computing device or with the usual methods, might not be viable for your needs. Python While R has done a lot to catch up, more advanced text analysis techniques are developed in Python (if not lower level languages), and so the state of the art may be found there. Furthermore, much of text analysis is a high volume affair, and that means it will likely be done much more efficiently in the Python environment if so, though one still might need a high performance computing environment. Here are some of the popular modules in Python. nltk textblob (the tidytext for Python) gensim (topic modeling) spaCy A Faster LDA We noted in the Shakespeare start to finish example that there are faster alternatives than the standard LDA in topicmodels. In particular, the powerful text2vec package contains a faster and less memory intensive implementation of LDA and dealing with text generally. Both of which are very important if you’re wanting to use R for text analysis. The other nice thing is that it works with LDAvis for visualization. For the following, we’ll use one of the partially cleaned document term matrix for the Shakespeare texts. One of the things to get used to is that text2vec uses the newer R6 classes of R objects, hence the $ approach you see to using specific methods. library(text2vec) load('data/shakes_dtm_stemmed.RData') # load('data/shakes_words_df.RData') # non-stemmed # convert to the sparse matrix representation using Matrix package shakes_dtm = as(shakes_dtm, 'CsparseMatrix') # setup the model lda_model = LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01) # fit the model doc_topic_distr = lda_model$fit_transform(x = shakes_dtm, n_iter = 1000, convergence_tol = 0.0001, n_check_convergence = 25, progressbar = FALSE) INFO [2018-03-06 19:16:15] iter 25 loglikelihood = -1746173.024 INFO [2018-03-06 19:16:16] iter 50 loglikelihood = -1683541.903 INFO [2018-03-06 19:16:17] iter 75 loglikelihood = -1660985.396 INFO [2018-03-06 19:16:17] iter 100 loglikelihood = -1648984.411 INFO [2018-03-06 19:16:18] iter 125 loglikelihood = -1641481.467 INFO [2018-03-06 19:16:19] iter 150 loglikelihood = -1638983.461 INFO [2018-03-06 19:16:20] iter 175 loglikelihood = -1636730.733 INFO [2018-03-06 19:16:20] iter 200 loglikelihood = -1636356.883 INFO [2018-03-06 19:16:21] iter 225 loglikelihood = -1636487.222 INFO [2018-03-06 19:16:21] early stopping at 225 iteration lda_model$get_top_words(n = 10, topic_number = 1:10, lambda = 1) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "prai" "hear" "ey" "love" "word" "natur" "night" "god" "friend" "death" [2,] "honor" "madam" "sweet" "dai" "letter" "fortun" "fear" "dai" "hand" "grace" [3,] "heaven" "bring" "fair" "true" "hous" "world" "ear" "england" "nobl" "soul" [4,] "life" "sea" "heart" "wit" "prai" "power" "sleep" "crown" "word" "live" [5,] "matter" "bear" "light" "fair" "sweet" "poor" "death" "war" "stand" "blood" [6,] "honest" "seek" "desir" "live" "husband" "set" "dead" "arm" "rome" "life" [7,] "fellow" "heard" "beauti" "youth" "woman" "nobl" "bid" "majesti" "honor" "dai" [8,] "hear" "lose" "black" "heart" "reason" "truth" "bed" "fight" "leav" "hope" [9,] "heart" "strang" "kiss" "marri" "hand" "leav" "mad" "sword" "deed" "heaven" [10,] "friend" "sister" "sun" "night" "talk" "command" "hand" "heart" "tear" "die" which.max(doc_topic_distr['Hamlet', ]) [1] 7 # top-words could be sorted by “relevance” which also takes into account # frequency of word in the corpus (0 < lambda < 1) lda_model$get_top_words(n = 10, topic_number = 1:10, lambda = 0.2) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] "honest" "madam" "ey" "love" "letter" "natur" "ear" "england" "rome" "bloodi" [2,] "beseech" "sea" "cheek" "youth" "merri" "report" "sleep" "majesti" "deed" "royal" [3,] "knave" "water" "black" "wit" "woo" "spirit" "beat" "field" "banish" "graciou" [4,] "warrant" "sister" "wretch" "signior" "jest" "judgment" "night" "uncl" "countri" "high" [5,] "glad" "women" "flower" "count" "finger" "worst" "air" "march" "citi" "subject" [6,] "action" "hair" "sweet" "lover" "choos" "author" "soft" "lieg" "son" "sovereign" [7,] "worship" "lose" "vow" "danc" "ring" "qualiti" "knock" "fight" "rise" "foe" [8,] "matter" "entreat" "mortal" "song" "horn" "virgin" "poison" "battl" "kneel" "flourish" [9,] "fellow" "seek" "wing" "paint" "bond" "wine" "shake" "harri" "fly" "king" [10,] "walk" "passion" "short" "wed" "troth" "direct" "move" "crown" "wert" "tide" # ldavis not shown # lda_model$plot() Given that most text analysis can be very time consuming for a model, consider any approach that might give you more efficiency. "]
]