Load / Install Packages

title

author

date

output

Web Scraping Tutorial

Dusty Turner

April 27, 2018

html_document

theme	highlight
united	tango

knitr::opts_chunk$set(
	echo = TRUE,
	message = FALSE,
	warning = FALSE
)

Load / Install Packages


# install.packages("tidyverse")
library(tidyverse)
# install.packages("rvest")
library(rvest)

Just a refresher on the '%>%' operator

multiplyfunction = function(x,y) {
  z=x*y
  return(z)
}

multiplyfunction(3,4)

3 %>% multiplyfunction(4)

Why do this?

subtractfunction = function(i,j) {
  k=i-j
  return(k)
}

3 %>% multiplyfunction(4) %>% subtractfunction(2)

subtractfunction(multiplyfunction(3,4),2)

It provides the same answer but is more 'readable'. To quote Hadley Wickham:

"R is optimized for human performance not computer performance."

Lets Get Scraping

To find the css selector, you'll need to download Inspector Gadget. Follow the instructions on the website to select the elements you want to 'scrape' and paste that css selection into the "html_nodes" part of the command below.

url = "http://www.espn.com/mens-college-basketball/bpi/_/view/overview/page/1"

stuff = url %>% read_html() %>% html_nodes(".bpi__table td") %>% html_text()

espnBPI = url %>%
  read_html() %>%
  html_nodes(".bpi__table td") %>%
  html_text() %>%
  matrix(ncol = 8, byrow = TRUE)

as.tibble(espnBPI)

Lets Extend This

url = c("http://www.espn.com/mens-college-basketball/bpi/_/view/overview/page/1",
        "http://www.espn.com/mens-college-basketball/bpi/_/view/overview/page/2")

espnBPITotal = NA

for (i in 1:length(url)) {
  espnBPI = url[i] %>%
    read_html() %>%
    html_nodes(".bpi__table td") %>%
    html_text() %>%
    matrix(ncol = 8, byrow = TRUE)
  espnBPITotal = rbind(espnBPITotal,espnBPI)
}

as.tibble(espnBPITotal)

Lets Extend This More

for (i in 1:15) {
  url[i]=paste0("http://www.espn.com/mens-college-basketball/bpi/_/view/overview/page/", i)
}

espnBPITotal = NA

for (i in 1:length(url)) {
  espnBPI = url[i] %>%
    read_html() %>%
    html_nodes(".bpi__table td") %>%
    html_text() %>%
    matrix(ncol = 8, byrow = TRUE)
  espnBPITotal = rbind(espnBPITotal,espnBPI)
}

as.tibble(espnBPITotal)

One More Extention: Ethics

Build in a pause to not frustrate the site owners.

for (i in 1:15) {
  url[i]=paste0("http://www.espn.com/mens-college-basketball/bpi/_/view/overview/page/", i)
}

paste0("http://www.espn.com/mens-college-basketball/bpi/_/view/overview/page/", i)

espnBPITotal = NA

for (i in 1:length(url)) {
  espnBPI = url[i] %>%
    read_html() %>%
    html_nodes(".bpi__table td") %>%
    html_text() %>%
    matrix(ncol = 8, byrow = TRUE)
  espnBPITotal = rbind(espnBPITotal,espnBPI)
  Sys.sleep(runif(1,.001,.01))
}

as.tibble(espnBPITotal)

Clean up the output:

espnBPITotal = espnBPITotal[-1,]

colnames(espnBPITotal) = c("Rank", "Team", "Conf", "W-L", "BPI", "SOS", "SOR", "RPI")

as.tibble(espnBPITotal)

Write the file to a CSV if you like

write.csv(espnBPITotal, "ESPNBPI.csv")

Filter to look at your data in specific ways:

espnBPITotal %>%
  as.tibble() %>%
  filter(Team=="UMBCUMBC"|Team=="VirginiaUVA")

##Now for a note about APIs

ESPN Fantasy API Google Places, Geocoding, Email, Books, Calendar, many others Yahoo Answers, Flicr, Maps Yelp Zillow Facebook and

TWITTER!!

Load twitteR Package

**Note, since I gave this presentation, I've found the rtweet packages is much better.

# install.packages("twitteR")
library(twitteR)

# Set API Keys
api_key <- "xxx"
api_secret <- "xxx"
access_token <- "xxx"
access_token_secret <- "xxx"
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

My API key is hidden for privacy. You can get this through the Twitter Developer API

# Set API Keys
api_key <- "key"
api_secret <- "secret"
access_token <- "token"
access_token_secret <- "token secret"
setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

tweets <- searchTwitter(searchString =  "West Point", 
  n = 100
  # since = since,
  # until = until
  , geocode = '41.3915,-73.956,10mi'
  )
  
cleantweets = tweets %>%
  twListToDF()

See All Data

as.tibble(cleantweets)

When were they created?

head(cleantweets$created)

What is the text?

head(cleantweets$text)

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
InteractiveFlowChart		InteractiveFlowChart
Thesis		Thesis
(Distro A) AORS Brief_Turner_COVID-19 Modeling R0 PDF.pdf		(Distro A) AORS Brief_Turner_COVID-19 Modeling R0 PDF.pdf
ACFT_Body_Type_PLOS.pdf		ACFT_Body_Type_PLOS.pdf
ASCNO_WG_9_Outbrief.ppt.pptx		ASCNO_WG_9_Outbrief.ppt.pptx
Bordeaux Puerto Rico Presentation.pptx		Bordeaux Puerto Rico Presentation.pptx
CAA_PANDAS_Annotated Brief Presentation.pptx		CAA_PANDAS_Annotated Brief Presentation.pptx
Causal Inference in Introductory Statistics Courses.pdf		Causal Inference in Introductory Statistics Courses.pdf
Cognitive_Function_Gerontological.pdf		Cognitive_Function_Gerontological.pdf
ESPNBPI.csv		ESPNBPI.csv
Fantasy Football Demo.R		Fantasy Football Demo.R
J cachexia sarcopenia muscle - 2022 - Heymsfield - Phenotypic differences between people varying in muscularity.pdf		J cachexia sarcopenia muscle - 2022 - Heymsfield - Phenotypic differences between people varying in muscularity.pdf
JDMS_2019.pdf		JDMS_2019.pdf
JSM Poster Turner.ppt		JSM Poster Turner.ppt
JSM Presentation Turner.pptx		JSM Presentation Turner.pptx
March Madness Nicosia.pdf		March Madness Nicosia.pdf
MutualInformationPRGarrett.pdf		MutualInformationPRGarrett.pdf
NCAA2018_Presentation_Club.html		NCAA2018_Presentation_Club.html
New ACFT.pdf		New ACFT.pdf
Overflowing_Tables.pdf		Overflowing_Tables.pdf
README.md		README.md
Regressiontothemean.pdf		Regressiontothemean.pdf
Scandinavian Med Sci Sports - 2024 - Aguiar - Daily and Peak Monitor Independent Movement Summary MIMS Values Associated.pdf		Scandinavian Med Sci Sports - 2024 - Aguiar - Daily and Peak Monitor Independent Movement Summary MIMS Values Associated.pdf
The Impact of Academically Homogeneous Classrooms for Undergraduate Statistics.pdf		The Impact of Academically Homogeneous Classrooms for Undergraduate Statistics.pdf
Using Cadence to Predict the Walk to Run Transition in Children and Adolescents A Logistic Regression Approach.pdf		Using Cadence to Predict the Walk to Run Transition in Children and Adolescents A Logistic Regression Approach.pdf
WG9Summary.docx		WG9Summary.docx
Web Scraping Tutorial - Release.R		Web Scraping Tutorial - Release.R
Web Scraping Tutorial.pptx		Web Scraping Tutorial.pptx
Who Owns NYC Costa Turner.pdf		Who Owns NYC Costa Turner.pdf
_config.yml		_config.yml
overflowing_tables.pdf		overflowing_tables.pdf
r_for_the_student_PUBLISHED.pdf		r_for_the_student_PUBLISHED.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Load / Install Packages

Just a refresher on the '%>%' operator

Why do this?

Lets Get Scraping

Lets Extend This

Lets Extend This More

One More Extention: Ethics

Load twitteR Package

About

Releases

Packages

Languages

dusty-turner/Presentations

Folders and files

Latest commit

History

Repository files navigation

Load / Install Packages

Just a refresher on the '%>%' operator

Why do this?

Lets Get Scraping

Lets Extend This

Lets Extend This More

One More Extention: Ethics

Load twitteR Package

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages