-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathReport.Rmd
335 lines (258 loc) · 16.5 KB
/
Report.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
title: "Analysing Coffee"
subtitle: ETC5513 Assignment 4, Master of Business Analytics
author: Yiwen Liu, Panagiotis Stylianos, Sahinya Akila
date: '`r Sys.Date()`'
output:
bookdown::html_document2:
css: monashreport.css
includes:
before_body: header.html
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
out.width = "100%",
fig.width=10,
fig.height=7,
fig.align = "center")
```
```{r warning=FALSE, message=FALSE}
library(tidyverse)
library(readr)
library(kableExtra)
library(maps)
library(knitr)
library(dplyr)
library(hrbrthemes)
library(viridis)
library(plotly)
library(bookdown)
library(leaflet)
library(ggridges)
```
```{r reading-data, echo = FALSE, message = FALSE, warning = FALSE}
# Get the Data
coffee_data <- read_csv(here::here("Data/coffee_ratings.csv"))
```
# Introduction
All through our lives, there definitely has been a coffee lover we know or we will be coffee lovers ourselves. Coffee has helped a lot of us to stay awake to complete our assignments or any other important tasks. It has also worked adversely for many of us, where we drink coffee at very late hours and find it difficult to fall asleep. Coffee was first exported from Ethiopia to Yemen in the late 15th Century @wikipedia_2021. This analysis will help in understanding the nature of the different coffee beans (Arabica and Robusta), the regions where it is grown, the ratings for the different types of coffee beans and also the processing methods.
# Analysis by Panagiotis Stylianos
## How many bags of coffee from each country were sampled?
We begin our analysis by providing a summary of the samples used to provide the coffee ratings. For each country different varieties of coffee were sampled from different regions and companies. The samples from each country can be summarised using the number_of_bags variable.
It is important to know the total quantity for each country to possibly identify a relationship between the number of samples used and the countries rating.
```{r bags, echo = FALSE, message = FALSE, warning = FALSE}
# count bags per country
country_bags <- coffee_data %>%
group_by(country_of_origin) %>%
summarise(total_bags = sum(number_of_bags)) %>%
arrange(desc(total_bags))
# obtain coordinates
country_coord <- map_data("world") %>%
group_by(region) %>%
summarise(mean_long = mean(long, na.rm = TRUE),
mean_lat = mean(lat, na.rm = TRUE))
# join data
country_bags <- country_bags %>%
left_join(
country_coord,
by = c("country_of_origin" = "region")
)
mapCountry<- maps::map("world", fill = TRUE, plot = FALSE)
pal_fun <- colorQuantile("YlOrRd", NULL, n = 7)
total_bags <- country_bags$total_bags[match(mapCountry$names, country_bags$country_of_origin)]
```
```{r, bagmap, echo = FALSE, message = FALSE, warning = FALSE}
leaflet(mapCountry) %>% # create a blank canvas
addProviderTiles("NASAGIBS.ViirsEarthAtNight2012") %>%
addPolygons( # draw polygons on top of the base map (tile)
stroke = FALSE,
smoothFactor = 0.2,
fillOpacity = 1,
color = ~pal_fun(total_bags)
) %>%
addCircles(data = country_bags,
lat = ~mean_lat,
lng =~mean_long,
radius = total_bags,
layerId = country_bags$country_of_origin,
weight = log(total_bags),
stroke = T,
fillColor = "white",
color = "black",
fillOpacity = 3,
label = paste("Total bags of coffe tested in",
country_bags$country_of_origin, ":", country_bags$total_bags),
labelOptions = labelOptions(noHide = F, textsize = "15px"),
popupOptions = country_bags$country_of_origin)
```
The below table summarises the total bags counted for each country.
```{r bagtable, echo = FALSE, message = FALSE, warning = FALSE}
DT::datatable(
country_bags %>%
select(country_of_origin, total_bags),
colnames = c("Country", "Total Bags"),
caption = "Total Bags of Coffee sampled by Country"
)
```
We observe that more than 40000 samples of Colombian coffee were used for grading.
```{r cor, fig.cap="Scatterplot of average coffee rating and number of testing samples", echo = FALSE, message = FALSE, warning = FALSE}
coffee_data %>%
group_by(country_of_origin) %>%
summarise(total_bags = sum(number_of_bags),
avg_rating = mean(total_cup_points, na.rm = TRUE)) %>%
ggplot(aes(total_bags, avg_rating)) +
geom_point() +
theme_classic() +
labs(
x = "Total Bags",
y = "Average Coffee Rating",
title = "Association between Rating and Number of Testing Samples"
)
```
Figure \@ref(fig:cor) indicates that there is not an apparent association between the average coffee rating and the number of testing samples.
## Which country produces the highest rated coffee?
After we identified that the number of testing samples doesn't influence the coffee rating, we can answer which countries produce the best quality of coffee.
```{r rating, fig.cap="Coffee Rating Distribution by Country", echo = FALSE, message = FALSE, warning = FALSE}
ggplot(coffee_data %>% filter(total_cup_points > 50), aes(x = total_cup_points, y = country_of_origin, fill = stat(x))) +
geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01, gradient_lwd = 1.) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_discrete(expand = expand_scale(mult = c(0.01, 0.25))) +
scale_fill_viridis_c(name = "Rating", option = "C") +
labs(
title = 'Coffee Rating Distribution'
) +
theme_ridges(font_size = 10, grid = TRUE) +
theme(axis.title.y = element_blank())
```
From Figure \@ref(fig:rating) we notice that the distribution of Ethiopian coffee rating is highly skewed to the right indicating that the coffee quality is excellent.
# Analysis by Yiwen Liu
In this section, I want to analyze some interesting content about Arabica coffee beans and Robusta coffee beans.
```{r out.width = "30%", fig.width=5, fig.height=5, echo = FALSE, message = FALSE, warning = FALSE}
knitr::include_graphics(here::here("image/coffee-bean.jpg"))
```
## Which top3 countries cultivated most kinds of Arabica coffee beans and Robusta coffee beans respectively?
```{r, echo = FALSE, message = FALSE, warning = FALSE}
species_country <- coffee_data %>%
select(country_of_origin, species)
```
```{r, echo = FALSE, message = FALSE, warning = FALSE}
# filter the top3 countries
species_country_count <- species_country %>%
group_by(country_of_origin, species) %>%
summarise(n=n()) %>%
arrange(-n) %>%
ungroup() %>%
group_by(species) %>%
slice(1:3)
```
The Table \@ref(tab:species) shows that Mexico, Colombia and Guatemala cultivated the most kinds of Arabica coffee beans and India, Uganda and Ecuador cultivated the most kinds of Robusta coffee beans. Also it could find that there are much more types of Arabica coffee beans compared to Robusta coffee beans, which conforms to the content given by @bunn2015bitter.
```{r species, echo = FALSE, message = FALSE, warning = FALSE}
# make a table
species_country_count %>%
kable(caption = "The top3 countries which cultivated most kinds of Arabica coffee beans and Robusta coffee beans respectively") %>%
kable_styling()
```
Now, the Figure \@ref(fig:speciescountry) shows the geographical location of this 6 countries. It could easily find that it seems to be an obvious coffee production zone, which is between the equator and 30 degrees north latitude. In these zones, the annual average temperature and rainfall are in line with the coffee bean growing conditions.
Besides, it indicates that the countries that cultivated Arabica coffee beans are all located in Central and South America, while the countries that cultivated Robusta coffee beans are located in several continents like South America, Eastern Africa and India. It is related to the environment required for the growth of different coffee beans.
```{r, echo = FALSE, message = FALSE, warning = FALSE}
# world map
world_map <- map_data("world")
# top3 country geo data of two species
top3country <- map_data("world", region = species_country_count$country_of_origin)
# combine top3 country geo data of two species with count data
top3country_geo_count <- top3country %>%
left_join(species_country_count, by = c("region"="country_of_origin"))
```
```{r speciescountry, fig.cap= "The geographical location of the top3 countries which cultivated most kinds of Arabica coffee beans and Robusta coffee beans respectively", echo = FALSE, message = FALSE, warning = FALSE}
# make a map
ggplot(world_map, aes(x = long, y = lat, group = group)) +
geom_polygon(fill="white", colour = "gray50") +
geom_polygon(data =top3country_geo_count, aes(x=long, y = lat, group=group,fill=species))+
scale_x_continuous(breaks = seq(-180, 210, 45), labels = function(x){paste0(x, "°")}) +
scale_y_continuous(breaks = seq(-60, 100, 30), labels = function(x){paste0(x, "°")}) +
annotate("text", label = "Colombia", x = -74.2973, y = 4.5709, size = 3)+
annotate("text", label = "Mexico", x = -102.5528, y = 23.6345, size = 3)+
annotate("text", label = "Guatemala", x = -90.2308, y = 15.7835, size = 3)+
annotate("text", label = "India", x = 78.9629, y = 20.5937, size = 3)+
annotate("text", label = "Uganda", x = 32.2903, y = 1.3733, size = 3)+
annotate("text", label = "Ecuador", x = -78.1834, y = -1.8312, size = 3)+
theme_light()
```
## What is the difference of altitude of Arabica coffee beans and Robusta coffee beans production areas?
Figure \@ref(fig:altitude) indicates that the mean altitude of Arabica coffee beans production areas is concentrated from 1000 to 1800 meters. Besides, people could surprisingly find that there exists two peaks about the mean altitude of Robusta coffee Beans production areas, and the ranges are concentrated from 500 to 1600 meters and 2800 to 3400 meters respectively. However, the probability of the second peak is much less than the first one.
So it could say that the mean altitude of Arabica coffee beans production areas is higher than that of many Robusta coffee Beans production areas even though there are some exceptions.
```{r altitude, fig.cap="The mean altitude of Arabica coffee beans and Robusta coffee beans production areas", echo = FALSE, message = FALSE, warning = FALSE}
coffee_data %>%
select(species, altitude_mean_meters) %>%
drop_na() %>%
filter(altitude_mean_meters<=5900) %>%
pivot_longer(-species, names_to = "altitude", values_to = "meter") %>%
ggplot(aes(x = meter, color = species)) +
geom_density()
```
## In which species Arabica coffee beans or Robusta coffee beans has higher grades?
```{r, echo = FALSE, message = FALSE, warning = FALSE}
species_grades <- coffee_data %>%
select(total_cup_points,species,aroma,flavor,aftertaste,acidity,sweetness) %>%
filter(total_cup_points != 0)
```
Figure \@ref(fig:grades) shows the scores of several primary different aspects of Arabica coffee beans and Robusta coffee beans and their total points. It is obvious that in acidity, aftertaste, aroma and flavor aspects, the median score of Robusta coffee beans is higher than that of Arabica coffee beans, which means Robusta coffee beans have a better performance than Arabica coffee beans. As to sweetness, Arabica coffee beans is much better than Robusta coffee beans. In the end, total point, which combines these primary aspects and some other aspects, shows that Arabica coffee beans has a better quality. Maybe these grades would give some help when people choose coffee beans.
```{r grades, fig.cap="Scores of several different aspects of Arabica coffee beans and Robusta coffee beans and their total points", echo = FALSE, message = FALSE, warning = FALSE}
species_grades %>%
pivot_longer(-species, names_to = "measure", values_to = "grades") %>%
mutate(species = as.factor(species),
measure = as.factor(measure)) %>%
ggplot(aes(x = species, y = grades, color=species)) +
geom_boxplot()+
facet_wrap(~measure, scales = "free_y") +
labs(x = "")
```
# Analysis by Sahinya Akila
## Which processing method leads to better rating
```{r expiration-date, echo = FALSE, message = FALSE, warning = FALSE, fig.cap="Distribution of Coffee Ratings based on Processing method"}
processing_method <- coffee_data %>%
select(processing_method:cupper_points) %>%
filter(!is.na(processing_method))
processing_method$mean <- rowMeans(subset(processing_method, select = c(aroma, flavor, aftertaste, acidity, body, balance, uniformity, clean_cup, sweetness, cupper_points)), na.rm = TRUE)
graph <- ggplot(processing_method, aes(x=processing_method, y=mean, fill=processing_method)) +
geom_violin() +
scale_fill_viridis(discrete = TRUE, alpha=0.6, option="A") +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
xlab("Processing Method")
ggplotly(graph)
```
In Figure \@ref(fig:expiration-date) the ratings for Semi-washed/Semi-pulped and Pulped natural honey is better, as the average rating does not go below 8 for them. Pulped Natural honey process allows the coffee beans to be dried after removing the skin of the fruit when all the is still in the beans.It’s essentially a middle ground between the dry and wet processing methods. During the natural (or dry) method, the beans are dried entirely in their natural form, while the washed (or wet) process sees all of the soft fruit residue, both skin and pulp, removed before the coffee is dried @costa_2020. This can also be deduced from the graph above where the ratings for the Washed/Wet processing method has the least rating and suggests that it is not one of the best processing methods.
## Which harvest year produced the best coffee?
```{r harvest-year, echo = FALSE, message = FALSE, warning = FALSE, fig.cap="Coffee Ratings in each harvest year"}
best_coffee <- coffee_data %>%
filter(str_detect(harvest_year, "^[1-9][0-9][0-9][0-9]$"))
best_coffee$mean <- best_coffee$mean <- rowMeans(subset(best_coffee, select = c(aroma, flavor, aftertaste, acidity, body, balance, uniformity, clean_cup, sweetness, cupper_points)), na.rm = TRUE)
ggplot(best_coffee, aes(harvest_year, mean)) +
geom_col(fill = "#8B4513") +
theme_ipsum() +
theme(
legend.position="none",
plot.title = element_text(size=11)
)+
xlab("Harvest Year") +
ylab("Average Coffee Rating")
```
It can be observed from Figure \@ref(fig:harvest-year) that the data available is for the years 2010 - 2018. Among this, 2012 has the best ratings for the harvest. 2018 had the least ratings. This is due to the favorable weather conditions in almost all the countries that produce coffee @robinson_2012.
```{r harvesttable}
table <- best_coffee %>% group_by(harvest_year) %>% summarise(count = n()) %>% arrange(desc(count))
knitr::kable(table, caption = "Number of records for each year") %>% kable_styling()
```
As it can be seen in Table \@ref(tab:harvesttable), there is only one record for 2018. Which implies that there is some missing data in the data set.
# Conclusion
From the analysis above, we could find that there is no relationship between the number of test samples and average coffee rating. The highest number of samples is 41204 bag samples from Colombia and the least is 1 bag sample from Mauritius. The rating for Ethiopian Coffee is the highest.
Besides, we find that Arabica coffee beans are cultivated mostly by Mexico, Colombia and Guatemala (South American Region). Whereas Robusta coffee beans are cultivated mostly by India, Uganda and Ecuador. In addition, the mean altitude of Arabica coffee beans production areas is higher than that of many Robusta coffee Beans production areas even though there are some exceptions. What's more, Arabica coffee beans has a higher median total point than Robusta coffee Beans, which means it has a better performance.
Finally, it indicates that ratings for Semi-washed/Semi-pulped and Pulped natural honey is better when compared to Washed/Wet processing method. And it tells us that 2012 has been the best year in terms of harvesting the coffee beans.
# Acknowledgements
This report was written using R[@R]. The following R packages were used to produce this report:`tidyverse`[@tidyverse], `readr`[@readr], `kableExtra`[@kableExtra], `bookdown`[@bookdown], `maps`[@maps], `knitr`[@knitr], `dplyr`[@dplyr], `hrbrthemes`[@hrbrthemes], `viridis`[@viridis], `plotly`[@plotly], `leaflet`[@leaflet] and `ggridges`[@ggridges].
The origin data used for analysis comes from [this github website](https://github.com/rfordatascience/tidytuesday/tree/master/data/2020/2020-07-07).
# References