-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
18 changed files
with
343 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
.Rproj.user | ||
.Rhistory | ||
.RData | ||
.Ruserdata | ||
*_book/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Introduction | ||
|
||
This work is in progress. Until this sentence is deleted you should probably ignore everything in it. | ||
|
||
## Overview | ||
|
||
Dealing with text is typically not even considered in the applied statistical training of most disciplines. This is in direct conflict with how often it has to be dealt with prior to analysis, or how interesting it might be to have text be the focus of analysis. This document and corresponding workshop will aim to provide a sense of the things one can do with text, and the sorts of analyses that might be useful. It must be stressed that this is only a starting point. | ||
|
||
### Goals | ||
|
||
The goal of this workshop is primarily to provide a sense of common tasks related to dealing with text as part of the data or the focus of analysis. Additionally, we'll have exercises to practice. | ||
|
||
|
||
### Prerequisites | ||
|
||
The document is for the most part very applied in nature, and doesn't assume much beyond familiarity with the R statistical computing environment. | ||
|
||
Note the following color coding used in this document: | ||
|
||
- <span class="emph">emphasis</span> | ||
- <span class="pack">package</span> | ||
- <span class="func">function</span> | ||
- <span class="objclass">object/class</span> | ||
- [link]() | ||
|
||
|
||
## Initial Steps | ||
|
||
0. Download the zip file at . Be mindful of where you put it. | ||
1. Unzip it. Be mindful of where you put the resulting folder. | ||
2. Open RStudio. | ||
3. File/Open Project and click on the blue icon in the folder you just created. | ||
4. File/Open Click on the ReadMe file and do what it says. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# String Theory | ||
|
||
regex and related, factors |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Sentiment |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# POS tagging |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Topic modeling |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
book_filename: "text_analysis" | ||
language: | ||
ui: | ||
chapter_name: "Chapter " |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
#!/bin/sh | ||
|
||
Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
#!/bin/sh | ||
|
||
set -e | ||
|
||
[ -z "${GITHUB_PAT}" ] && exit 0 | ||
[ "${TRAVIS_BRANCH}" != "master" ] && exit 0 | ||
|
||
git config --global user.email "[email protected]" | ||
git config --global user.name "Michael Clark" | ||
|
||
git clone -b gh-pages https://${GITHUB_PAT}@github.com/${TRAVIS_REPO_SLUG}.git book-output | ||
cd book-output | ||
cp -r ../_book/* ./ | ||
git add --all * | ||
git commit -m"Update the book" || true | ||
git push origin gh-pages |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
|
||
bookdown::gitbook: | ||
css: [css/standard_html.css, css/book.css] | ||
font-import: http://fonts.googleapis.com/css?family=Roboto|Open+Sans | ||
font-family: 'Roboto' | ||
config: | ||
toc: | ||
before: | | ||
<li><a href="weblink_to_file"><span style="font-size:125%; font-variant:small-caps; font-style:italic; color:#ff5503">Text Analysis in R</span></a></li> | ||
after: | | ||
<li><a href="https://m-clark.github.io" target="blank" style="font-size:150%; font-variant:small-caps; color:#ff5500">Michael Clark</a></li> | ||
edit: | ||
link: https://github.com/m-clark/workshops/ | ||
highlight: pygments | ||
# bookdown::pdf_book: | ||
# includes: | ||
# in_header: preamble.tex | ||
# latex_engine: xelatex | ||
# citation_package: natbib | ||
# bookdown::epub_book: default |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
@import url("http://fonts.googleapis.com/css?family=Roboto"); | ||
|
||
.book { | ||
background-color: #fffff8; | ||
color: red; | ||
} | ||
|
||
.section { | ||
background-color: #fffff8; | ||
} | ||
|
||
.book .book-body .page-wrapper .page-inner section.normal { | ||
display: block; | ||
background-color: #fffff8; | ||
color: #595959; | ||
font-family: Roboto; | ||
word-wrap: break-word; | ||
overflow: hidden; | ||
line-height: 1.7; | ||
text-size-adjust: 100%; | ||
-ms-text-size-adjust: 100%; | ||
-webkit-text-size-adjust: 100%; | ||
-moz-text-size-adjust: 100%; | ||
} | ||
|
||
.body-inner { | ||
background-color: #fffff8; | ||
} | ||
|
||
.book .book-body .page-wrapper { | ||
position: relative; | ||
outline: 0; | ||
background-color: #fffff8; | ||
} | ||
|
||
.book .book-header { | ||
overflow: visible; | ||
height: 50px; | ||
padding: 0 8px; | ||
z-index: 2; | ||
font-family: Roboto sans-serif; | ||
font-size: .85em; | ||
color: #7e888b; | ||
background-color: #fffff8; | ||
/* background: 0 0; */ | ||
} | ||
|
||
element .style { | ||
background-color: #fffff8; | ||
} | ||
|
||
.book.with-summary .book-header.fixed { | ||
background-color: #fffff8; | ||
} | ||
|
||
i { | ||
background-color: #fffff8; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
@import url("http://fonts.googleapis.com/css?family=Roboto+Mono"); | ||
@import url("http://fonts.googleapis.com/css?family=Roboto"); | ||
@import url("http://fonts.googleapis.com/css?family=Roboto+Condensed"); | ||
@import url("http://fonts.googleapis.com/css?family=Open+Sans"); | ||
@import url('https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300'); | ||
@import url('https://fonts.googleapis.com/css?family=Source+Code+Pro'); | ||
|
||
|
||
.emph { | ||
color: #ff5500; | ||
font-weight: 500; | ||
} | ||
|
||
/* pack func and objclass colors come from hcl(seq(90,360, length.out=4), c=80, l=80)*/ | ||
.pack { | ||
color: #AC9CFF; /*#e41a1c*/ | ||
font-weight: 500; | ||
} | ||
|
||
.func { /*#984ea3;can just use `` instead*/ | ||
color: #00CBB6; | ||
font-weight: 500; | ||
} | ||
|
||
.objclass { | ||
color: #AAB400; /*#4daf4a; #FFC5D0*/ | ||
font-weight: 500; | ||
} | ||
|
||
a { | ||
color: #1e90ff; /*dodgerblue*/ | ||
} | ||
|
||
body { | ||
color: #595959; /*#595959 (i.e. gray35) provides recommended contrast ratio 7:1; previous #7f7f7f (4:1); */ | ||
background-color: #fffff8; | ||
} | ||
|
||
p { | ||
|
||
} | ||
|
||
.author { | ||
font-variant: small-caps; | ||
} | ||
|
||
code { | ||
box-sizing: border-box; | ||
left: 0; /* This changes where the code chunk box actually starts */ | ||
/* padding: 10px 0 10px 60px; /* Change the last value here to move the text left or right */ | ||
position: relative; | ||
width: 100%; /* This changes where the code chunk box ends on the right side */ | ||
font-family: 'Roboto Mono' 'Lucida Sans' Consolas monospace; | ||
font-size: 100%; /*changes output size; and comments*/ | ||
/*border: 10px solid #ff5500;*/ /*code block and results border*/ | ||
/*background-color:#ff5500;*/ /* results background*/ | ||
|
||
} | ||
pre .r { | ||
box-sizing: border-box; | ||
left: 0; /* This changes where the code chunk box actually starts */ | ||
/* padding: 10px 0 10px 60px; /* Change the last value here to move the text left or right */ | ||
position: relative; | ||
/*width: 100%; /* This changes where the code chunk box ends on the right side */ | ||
font-family: 'Roboto Mono' 'Lucida Sans' Consolas monospace; | ||
/*font-size: 100%; /*changes output size; and comments*/ | ||
/*border: 10px solid #ff5500;*/ /*code block and results border*/ | ||
/*background-color:#ff5500;*/ /* results background*/ | ||
|
||
} | ||
|
||
#TOC{ | ||
font-size: 75%; | ||
|
||
} | ||
|
||
.table { | ||
margin: 0 auto; | ||
} | ||
|
||
|
||
.tocify ul, .tocify li { | ||
list-style: none; | ||
margin: 0; | ||
padding: 0; | ||
border: none; | ||
width: 100%; | ||
line-height: 30px; | ||
background-color: #fffff9; | ||
} | ||
|
||
li.tocify-item.list-group-item.active { | ||
background-color: rgb(97,151,213); /* background of current section in toc */ | ||
} | ||
|
||
h1, h2, h3, h4, h5 { | ||
color:rgb(97,151,213); | ||
} | ||
|
||
.col2 { | ||
columns: 2 200px; /* number of columns and width in pixels*/ | ||
-webkit-columns: 2 200px; /* chrome, safari */ | ||
-moz-columns: 2 200px; /* firefox */ | ||
} | ||
.col3 { | ||
columns: 3 100px; | ||
-webkit-columns: 3 100px; | ||
-moz-columns: 3 100px; | ||
} |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
--- | ||
title: "An Introduction to Text Analysis with R" | ||
author: | | ||
<div class="title"><span style="font-size:125%; font-variant:small-caps; ">Michael Clark</span><br><br> | ||
<span class="" style="font-size:75%">http://m-clark.github.io/workshops/</span><br><br> | ||
<img src="img/signature-acronym.png" style="width:30%; padding:10px 0;"> <br> | ||
<img src="img/ARC-acronym-signature.png" style="width:21%; padding:10px 0;"> </div> | ||
date: "`r Sys.Date()`" | ||
site: bookdown::bookdown_site | ||
output: | ||
bookdown::gitbook: | ||
css: [css/standard_html.css, css/book.css] | ||
hightlight: pygments | ||
number_sections: false | ||
# split_by: section | ||
config: | ||
toc: | ||
collapse: subsection | ||
scroll_highlight: yes | ||
before: null | ||
after: null | ||
toolbar: | ||
position: fixed | ||
edit : null | ||
download: null | ||
search: yes | ||
# fontsettings: | ||
# theme: white | ||
# family: sans | ||
# size: 2 | ||
sharing: | ||
facebook: yes | ||
twitter: yes | ||
google: no | ||
weibo: no | ||
instapper: no | ||
vk: no | ||
all: ['facebook', 'google', 'twitter', 'weibo', 'instapaper'] | ||
always_allow_html: yes | ||
font-import: http://fonts.googleapis.com/css?family=Roboto|Open+Sans | ||
font-family: 'Roboto' | ||
documentclass: book | ||
# bibliography: refs.bib | ||
biblio-style: apalike | ||
link-citations: yes | ||
description: "An Introduction to Text Analysis with R" | ||
cover-image: img/nineteeneightyR.png | ||
url: 'https\://m-clark.github.io/Workshops/' # evidently the \: is required or you'll get text in the title/toc area | ||
github-repo: m-clark/ | ||
--- | ||
|
||
```{r setup, include=FALSE} | ||
knitr::opts_chunk$set(echo = TRUE, message=F, error=F, comment=NA, R.options=list(width=120), # code | ||
dev.args=list(bg = 'transparent'), dev='svg', # viz | ||
cache.rebuild=T, cache=T) # cache | ||
``` | ||
|
||
|
||
# |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
|
||
- dealing with character strings | ||
- regex, factor labels, standard modeling | ||
- sentiment analysis | ||
- https://github.com/mjockers/syuzhet; https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html | ||
- https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html | ||
- https://uc-r.github.io/sentiment_analysis | ||
- pos tagging | ||
- http://martinschweinberger.de/docs/articles/PosTagR.pdf | ||
- topic modeling | ||
- your stuff | ||
- https://cran.r-project.org/web/packages/tidytext/vignettes/topic_modeling.html |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Version: 1.0 | ||
|
||
RestoreWorkspace: Default | ||
SaveWorkspace: Default | ||
AlwaysSaveHistory: Default | ||
|
||
EnableCodeIndexing: Yes | ||
UseSpacesForTab: Yes | ||
NumSpacesForTab: 2 | ||
Encoding: UTF-8 | ||
|
||
RnwWeave: knitr | ||
LaTeX: pdfLaTeX | ||
|
||
StripTrailingWhitespace: Yes | ||
|
||
BuildType: Website |