-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbuildCatalog.Rmd
106 lines (74 loc) · 5.16 KB
/
buildCatalog.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "merge-nsgl-records"
author: "Amanda Whitmire"
date: "`r Sys.Date()`"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library("tidyverse")
library("purrr")
```
# Goal: merge CSV catalog records into one data table
When the National Sea Grant Library in Rhode Island was set to close, the librarian tried to find a home for collection materials that otherwise would have been sent to the trash. The Harold A. Miller Library, a branch of the Stanford Libraries, volunteered to take the full collection of materials for California (composed of the California Sea grant and Southern California Sea Grant Programs). For various reasons, the National Sea Grant librarian was not able to send us catalog records for these materials. To solve the problem, we scraped catalog records from the online National Sea Grant Library Catalog using Python, which resulted in a CSV file being created for each record we scraped. The purpose of the code in this Rmd file is to concatenate the information in each text file into one spreadsheet.
Post-script: after completing the process for CA, we decided to download records for OR and WA, too.
## Explore the data so we know what to expect with the final data table
Start by creating a list of the file names (it's length will tell us how many rows we should have in our final spreadsheet.
Then look at the first 200 csv files to determine how many possible metadata fields are. This will tell us how many columns we should expect.
```{r message=FALSE, warning=FALSE, paged.print=TRUE, eval=FALSE}
# look at the first 200 csv files to determine what the possible metadata fields are.
# change file path for each state.
fnames <-list.files(path = "~/github/nsgl/itemRecords-or-csv/", # get list of files
pattern = "*.csv", full.names = TRUE)
dat <- as.tibble(read_csv(fnames[1],col_select = c(MetadataField,Metadata), show_col_types = FALSE)) # read in CSV data, but only 2 columns of it
dat <- fill(dat, MetadataField) # fill in blanks in the metadata field column
for (i in 2:200){ # loop through 200 files to get a bunch of metadata field samples
data <- as.tibble(read_csv(fnames[i],col_select = c(MetadataField,Metadata), show_col_types = FALSE))
data <- fill(data, MetadataField)
dat <- bind_rows(dat, data)
}
# find out how mnay unique headers there will be
headers <- unique(dat[[1]]) # There are 17 unique metadata fields in this dataset
rm(dat, data) # clean up after yourself
```
Now we will load in the csv files, grouping them by their file name (a unique identifier), and then pivoting out to a data table
```{r message=FALSE, warning=FALSE}
fnames <-list.files(path = "~/github/nsgl/itemRecords-wa-csv", # get list of files
pattern = "*.csv", full.names = TRUE)
records <- fnames %>% # using this list of file names
map_dfr(~read_csv(.x) %>% # read in the files using the map function in purrr "_df" returns a data frame.
mutate(id = .x)) %>% # here we add a column called "id" using "fnames" as the text to add in each row
fill(MetadataField) %>% # fill blanks in the MetadataField column
group_by(id,MetadataField) %>% # group the records using the id (file name)
summarise(records = str_c(Metadata, collapse = ";")) %>% # this collapses duplicate metadata entries, e.g. when there are multiple authors
pivot_wider(names_from = "MetadataField", values_from = "records") # this converts our long column arrays into a proper data table, using MetadataField as column headers
# save the data to a CSV file. You can now move to OpenRefine
write_csv(records,"nsgl-wa-records.csv")
```
Now that the scraping and concatenating of data from 3 states is done, now it's time to merge all three CSV files into to giant spreadsheet.
```{r echo=TRUE, message=FALSE, warning=FALSE}
data_all <- list.files(path = "~/github/nsgl/catalogRecords-raw", # Identify all CSV files
pattern = "*.csv", full.names = TRUE) %>%
lapply(read_csv) %>% # Store all files in list
bind_rows # Combine data sets into one data set
data_all
data_all <- data_all %>%
mutate(id = str_replace_all(id, "/Users/thalassa/github/nsgl/itemRecords-csv/",""))
data_all <- data_all %>%
mutate(id = str_replace_all(id, "/Users/thalassa/github/nsgl/itemRecords-or-csv/",""))
data_all <- data_all %>%
mutate(id = str_replace_all(id, "/Users/thalassa/github/nsgl/itemRecords-wa-csv/",""))
# remove duplicates
records <- data_all %>%
distinct_at(vars("NSGL Document #:"), .keep_all = TRUE)
# save the data to a CSV file. You can now move to OpenRefine
write_csv(records,"nsgl-all-records.csv")
```
I wasn't convinced that all of the CA item ids made it into my first round of processing, so I added this and re-did the catalog web scraping. Added ~600 records.
```{r}
fnams <-list.files(path = "~/github/nsgl/item-ids", # get list of files
pattern = "*.txt", full.names = TRUE)
itemids <- fnams %>% # using this list of file names
map_dfr(~read_csv(.x, col_names = FALSE))
write_csv(itemids, "seagrant-ids.txt",col_names = FALSE)
```