Wouter van Atteveldt & Kasper Welbers 2019-04
For text analysis it is often useful to POS tag and lemmatize your text, especially with non-English data. Lemmatizing generally works much better than stemming, especially for a richly inflected language such as German or French. Part-of-Speech (POS) tags identify the type of word (noun, verb, etc) so it can be used to e.g. analyse only the verbs (actions) or adjectives (descriptions). Finally, you can often automatically extract named entities such as people or organizations, making it easy to e.g. generate a list of actors in a corpus,
There are two packages that do this that are both easy to use and
support multiple languages: spacyr
and udpipe
. This handout will
review both packages and show how they can be used to analyse text and
convert the results to a tcorpus and/or dfm.
Spacy is a python package with processing models for a number of different languages, which makes it attractive to use if you need e.g. French or German lemmatizing.
Since it is natively made in python, it uses the reticulate
package to
communicate between python and R. Fortunately, installing it has become
a lot easier by using miniconda
, basically a prepackaged python
environment that can be installed directly from R. According to the
documentation, on Windows you need to run R as administrator to be able
to install spacyr from R.
To install spacy and spacyr, first install the package as usual, and then use spacyr to download and install spacy in the miniconda environment:
If asked whether to install miniconda, just answer yes.
Before using spacyr, you initialize it with the name of the language model you want to use. By default, the English model is installed and will be used if you don’t specify another option:
If all is well, this should give a informational message that it found the environment and successfully initialized spacy. You are now ready to use spacy for parsing an English sentence:
txt = c("Spacy was successfully installed", "Is'nt it miraculous?")
tokens = spacy_parse(txt)
This yields a dataset with one word on each row, and the columns
containing the original word (token
), it’s lemma (i.e. dictionary
stem, so the lemma of ‘was’ is (to) ‘be’) and it’s part of speech tag
, so adverb, verb, proper name, etc.). The final column identifies
named entities, i.e. persons, organizations, and locations.
As a slightly bigger example, this lists all the most common adjectives in the two inaugural speeches of Obama:
speeches_obama = add_column(sotu_meta, text=sotu_text) |>
as_tibble() |>
rename(doc_id=X) |>
filter(president == 'Barack Obama')
tokens = spacy_parse(speeches_obama)
As the tokens are just a data frame with one row per token (word), they are already in the format required for tidytext and we can use regular tidyverse functions to inspect and manipulate the outcome.
For example, to list the nouns used by Obama:
tokens |>
as_tibble() |>
filter(pos == "NOUN") |>
group_by(lemma) |>
summarize(n=n()) |>
arrange(desc(n)) |>
There are also some built-in functions to help deal with multi-word entities or noun phrases:
entities = entity_extract(tokens)
As you can see, this ‘merges’ words that form a name together such as ‘Justice Roberts’. You can also consolidate these so they the original tokens are replaced by the new merged token:
tokens2 = entity_consolidate(tokens)
tokens2 |>
as_tibble() |>
filter(entity_type == "LOC") |>
group_by(lemma) |>
summarize(n=n()) |>
arrange(desc(n)) |>
A similar function pair exists to deal with noun phrases, but this
requires you to enable noun phrase parsing in the original parse call:
(you could enable dependency parsing in a similar fashion with
nptokens = spacy_parse(speeches_obama, nounphrase = T)
nps = nounphrase_extract(nptokens)
As you can see, this detects a phrase such as president Carter as well as fellow Americans.
Spacyr was developed by the same people that made quanteda, so as you
can guess they collaborate quite well. In fact, the data frame returned
by spacyr can be directly used in most quanteda functions. Note that the
function itself does not accept a tokens data frame, but there is
an as.tokens function that does:
tokens %>% as.tokens(include_pos="pos", use_lemma=TRUE) %>% dfm() %>% textplot_wordcloud()
Spacyr keeps a python process running, which can consume quite a lot of memory. When you are done with spacy (but want to continue with R), you can finalize spacy:
This saves some memory and allows you to re-initialize it, i.e. with a different language model.
By default, other language models are not included with spacy. You can download these models using the built-in download function, for example for German:
To use it, initialize spacy with this model (note that you need to finalize an existing session before you can do this):
spacy_parse("Ich bin ein Berliner")
Note that if you prefer to install spacy yourself rather than use the
, you can pass the virtual environment location to the
initialize and download functions. We would recommend just using the
miniconda environment though unless for some reason that doesn’t work.
See https://spacy.io/usage/models for an overview of available languages.
is an R package that is quite similar to spacy in many regards.
It is slightly easier to install (but doesn’t collaborate as well with
quanteda) and both should be comparable in performance.
What is nice about UDPipe is that you can directly call it after installing it, and if you use a language for which the model is not already install it will automatically download the language model.
So, we can simply call udpipe directly:
txt = c("UDPipe was successfully installed as well", "It doesn't even need to be initialized")
tokens = udpipe(txt, "english", parser="none") %>% as_tibble()
tokens %>% select(doc_id, token_id:xpos)
Or with german text:
udpipe("Ich bin ein Berliner", "german", parser="none") %>% select(doc_id, token_id:xpos)
As you can see, the output is very similar to spacy’s output, although
confusingly they use different naming conventions. The xpos
will depend on the language model, but the upos
part-of-speech) will be the same for all languages.
Note that we specified parser="none"
to disable dependency parsing,
which would make it quite a lot slower.
Similar to spacy, the output of udpipe
is a data frame with a word per
row, so it can be directly used in tidytext.
To preserve document identifiers (so results can be merged back with
metadata), it is best to call udpipe on a data frame rather than on a
character vector. Again similar to spacy, this assumes that the data
frame contains rows called doc_id
and text
speeches_obama = add_column(sotu_meta, text=sotu_text) |>
as_tibble() |>
rename(doc_id=X) |>
filter(president == 'Barack Obama')
tokens = udpipe(speeches_obama, "english", parser="none")
tokens = tokens |>
as_tibble() |>
select(doc_id, paragraph_id, sentence_id, token, lemma, upos)
To use udpipe output in quanteda, you need to first convert it into a list of tokens per document:
tokens = udpipe(speeches_obama, "english", parser="none")
tokenlist = split(tokens$lemma, tokens$doc_id)
Now, you can use as.tokens
and proceed as above:
tokenlist %>% as.tokens() %>% dfm() %>% textplot_wordcloud()
Note that we used the lemma above. We can also add the POS tags to the text so we get the same output as quanteda:
split(str_c(tokens$lemma, tokens$upos, sep = "/"), tokens$doc_id) %>% as.tokens() %>% tokens_select("*/NOUN") %>% dfm() %>% textplot_wordcloud()
Of course,for udpipe as well as for spacy we can also do filtering or other operations on the tokens data frame before converting it to a quanteda object:
nouns = tokens %>% filter(upos == "NOUN")
split(nouns$lemma, nouns$doc_id) %>% as.tokens() %>% dfm() %>% textplot_wordcloud()