Skip to content

Commit

Permalink
getting started
Browse files Browse the repository at this point in the history
  • Loading branch information
m-clark committed Jul 2, 2017
1 parent f278d0d commit d58d938
Show file tree
Hide file tree
Showing 18 changed files with 343 additions and 0 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
*_book/
34 changes: 34 additions & 0 deletions 00_intro.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Introduction

This work is in progress. Until this sentence is deleted you should probably ignore everything in it.

## Overview

Dealing with text is typically not even considered in the applied statistical training of most disciplines. This is in direct conflict with how often it has to be dealt with prior to analysis, or how interesting it might be to have text be the focus of analysis. This document and corresponding workshop will aim to provide a sense of the things one can do with text, and the sorts of analyses that might be useful. It must be stressed that this is only a starting point.

### Goals

The goal of this workshop is primarily to provide a sense of common tasks related to dealing with text as part of the data or the focus of analysis. Additionally, we'll have exercises to practice.


### Prerequisites

The document is for the most part very applied in nature, and doesn't assume much beyond familiarity with the R statistical computing environment.

Note the following color coding used in this document:

- <span class="emph">emphasis</span>
- <span class="pack">package</span>
- <span class="func">function</span>
- <span class="objclass">object/class</span>
- [link]()


## Initial Steps

0. Download the zip file at . Be mindful of where you put it.
1. Unzip it. Be mindful of where you put the resulting folder.
2. Open RStudio.
3. File/Open Project and click on the blue icon in the folder you just created.
4. File/Open Click on the ReadMe file and do what it says.

3 changes: 3 additions & 0 deletions 01_strings.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# String Theory

regex and related, factors
1 change: 1 addition & 0 deletions 02_sentiment.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Sentiment
1 change: 1 addition & 0 deletions 03_pos.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# POS tagging
1 change: 1 addition & 0 deletions 04_topic.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Topic modeling
4 changes: 4 additions & 0 deletions _bookdown.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
book_filename: "text_analysis"
language:
ui:
chapter_name: "Chapter "
3 changes: 3 additions & 0 deletions _build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh

Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')"
16 changes: 16 additions & 0 deletions _deploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/sh

set -e

[ -z "${GITHUB_PAT}" ] && exit 0
[ "${TRAVIS_BRANCH}" != "master" ] && exit 0

git config --global user.email "[email protected]"
git config --global user.name "Michael Clark"

git clone -b gh-pages https://${GITHUB_PAT}@github.com/${TRAVIS_REPO_SLUG}.git book-output
cd book-output
cp -r ../_book/* ./
git add --all *
git commit -m"Update the book" || true
git push origin gh-pages
20 changes: 20 additions & 0 deletions _output.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@

bookdown::gitbook:
css: [css/standard_html.css, css/book.css]
font-import: http://fonts.googleapis.com/css?family=Roboto|Open+Sans
font-family: 'Roboto'
config:
toc:
before: |
<li><a href="weblink_to_file"><span style="font-size:125%; font-variant:small-caps; font-style:italic; color:#ff5503">Text Analysis in R</span></a></li>
after: |
<li><a href="https://m-clark.github.io" target="blank" style="font-size:150%; font-variant:small-caps; color:#ff5500">Michael Clark</a></li>
edit:
link: https://github.com/m-clark/workshops/
highlight: pygments
# bookdown::pdf_book:
# includes:
# in_header: preamble.tex
# latex_engine: xelatex
# citation_package: natbib
# bookdown::epub_book: default
58 changes: 58 additions & 0 deletions css/book.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
@import url("http://fonts.googleapis.com/css?family=Roboto");

.book {
background-color: #fffff8;
color: red;
}

.section {
background-color: #fffff8;
}

.book .book-body .page-wrapper .page-inner section.normal {
display: block;
background-color: #fffff8;
color: #595959;
font-family: Roboto;
word-wrap: break-word;
overflow: hidden;
line-height: 1.7;
text-size-adjust: 100%;
-ms-text-size-adjust: 100%;
-webkit-text-size-adjust: 100%;
-moz-text-size-adjust: 100%;
}

.body-inner {
background-color: #fffff8;
}

.book .book-body .page-wrapper {
position: relative;
outline: 0;
background-color: #fffff8;
}

.book .book-header {
overflow: visible;
height: 50px;
padding: 0 8px;
z-index: 2;
font-family: Roboto sans-serif;
font-size: .85em;
color: #7e888b;
background-color: #fffff8;
/* background: 0 0; */
}

element .style {
background-color: #fffff8;
}

.book.with-summary .book-header.fixed {
background-color: #fffff8;
}

i {
background-color: #fffff8;
}
109 changes: 109 additions & 0 deletions css/standard_html.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
@import url("http://fonts.googleapis.com/css?family=Roboto+Mono");
@import url("http://fonts.googleapis.com/css?family=Roboto");
@import url("http://fonts.googleapis.com/css?family=Roboto+Condensed");
@import url("http://fonts.googleapis.com/css?family=Open+Sans");
@import url('https://fonts.googleapis.com/css?family=Open+Sans+Condensed:300');
@import url('https://fonts.googleapis.com/css?family=Source+Code+Pro');


.emph {
color: #ff5500;
font-weight: 500;
}

/* pack func and objclass colors come from hcl(seq(90,360, length.out=4), c=80, l=80)*/
.pack {
color: #AC9CFF; /*#e41a1c*/
font-weight: 500;
}

.func { /*#984ea3;can just use `` instead*/
color: #00CBB6;
font-weight: 500;
}

.objclass {
color: #AAB400; /*#4daf4a; #FFC5D0*/
font-weight: 500;
}

a {
color: #1e90ff; /*dodgerblue*/
}

body {
color: #595959; /*#595959 (i.e. gray35) provides recommended contrast ratio 7:1; previous #7f7f7f (4:1); */
background-color: #fffff8;
}

p {

}

.author {
font-variant: small-caps;
}

code {
box-sizing: border-box;
left: 0; /* This changes where the code chunk box actually starts */
/* padding: 10px 0 10px 60px; /* Change the last value here to move the text left or right */
position: relative;
width: 100%; /* This changes where the code chunk box ends on the right side */
font-family: 'Roboto Mono' 'Lucida Sans' Consolas monospace;
font-size: 100%; /*changes output size; and comments*/
/*border: 10px solid #ff5500;*/ /*code block and results border*/
/*background-color:#ff5500;*/ /* results background*/

}
pre .r {
box-sizing: border-box;
left: 0; /* This changes where the code chunk box actually starts */
/* padding: 10px 0 10px 60px; /* Change the last value here to move the text left or right */
position: relative;
/*width: 100%; /* This changes where the code chunk box ends on the right side */
font-family: 'Roboto Mono' 'Lucida Sans' Consolas monospace;
/*font-size: 100%; /*changes output size; and comments*/
/*border: 10px solid #ff5500;*/ /*code block and results border*/
/*background-color:#ff5500;*/ /* results background*/

}

#TOC{
font-size: 75%;

}

.table {
margin: 0 auto;
}


.tocify ul, .tocify li {
list-style: none;
margin: 0;
padding: 0;
border: none;
width: 100%;
line-height: 30px;
background-color: #fffff9;
}

li.tocify-item.list-group-item.active {
background-color: rgb(97,151,213); /* background of current section in toc */
}

h1, h2, h3, h4, h5 {
color:rgb(97,151,213);
}

.col2 {
columns: 2 200px; /* number of columns and width in pixels*/
-webkit-columns: 2 200px; /* chrome, safari */
-moz-columns: 2 200px; /* firefox */
}
.col3 {
columns: 3 100px;
-webkit-columns: 3 100px;
-moz-columns: 3 100px;
}
Binary file added img/ARC-acronym-signature.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/nineteeneightyR.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/signature-acronym.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
59 changes: 59 additions & 0 deletions index.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
---
title: "An Introduction to Text Analysis with R"
author: |
<div class="title"><span style="font-size:125%; font-variant:small-caps; ">Michael Clark</span><br><br>
<span class="" style="font-size:75%">http://m-clark.github.io/workshops/</span><br><br>
<img src="img/signature-acronym.png" style="width:30%; padding:10px 0;"> <br>
<img src="img/ARC-acronym-signature.png" style="width:21%; padding:10px 0;"> </div>
date: "`r Sys.Date()`"
site: bookdown::bookdown_site
output:
bookdown::gitbook:
css: [css/standard_html.css, css/book.css]
hightlight: pygments
number_sections: false
# split_by: section
config:
toc:
collapse: subsection
scroll_highlight: yes
before: null
after: null
toolbar:
position: fixed
edit : null
download: null
search: yes
# fontsettings:
# theme: white
# family: sans
# size: 2
sharing:
facebook: yes
twitter: yes
google: no
weibo: no
instapper: no
vk: no
all: ['facebook', 'google', 'twitter', 'weibo', 'instapaper']
always_allow_html: yes
font-import: http://fonts.googleapis.com/css?family=Roboto|Open+Sans
font-family: 'Roboto'
documentclass: book
# bibliography: refs.bib
biblio-style: apalike
link-citations: yes
description: "An Introduction to Text Analysis with R"
cover-image: img/nineteeneightyR.png
url: 'https\://m-clark.github.io/Workshops/' # evidently the \: is required or you'll get text in the title/toc area
github-repo: m-clark/
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message=F, error=F, comment=NA, R.options=list(width=120), # code
dev.args=list(bg = 'transparent'), dev='svg', # viz
cache.rebuild=T, cache=T) # cache
```


#
12 changes: 12 additions & 0 deletions stuff
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@

- dealing with character strings
- regex, factor labels, standard modeling
- sentiment analysis
- https://github.com/mjockers/syuzhet; https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html
- https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html
- https://uc-r.github.io/sentiment_analysis
- pos tagging
- http://martinschweinberger.de/docs/articles/PosTagR.pdf
- topic modeling
- your stuff
- https://cran.r-project.org/web/packages/tidytext/vignettes/topic_modeling.html
17 changes: 17 additions & 0 deletions text-analysis-with-R.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: knitr
LaTeX: pdfLaTeX

StripTrailingWhitespace: Yes

BuildType: Website

0 comments on commit d58d938

Please sign in to comment.