forked from alabeli/spotify-eda
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.Rmd
586 lines (439 loc) · 30.2 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
---
title: "Exploratory data analysis"
output:
html_document:
code_folding: "hide"
---
<style type="text/css">
body{
font-size: 14pt;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
```{r}
library(tidyverse)
library(plotly)
library(gridExtra)
library(reshape2)
library(DT)
```
```{r}
data <- read.csv("data.csv", encoding = 'UTF-8') %>% filter(year <= 2020)
data <- data %>%
select(id, name, artists, year, release_date, everything())
```
The catchphrase **Data is the new currency** has become quite prevalent in our day to day conversations and news feed. As per the [statista reoprt]([https://www.statista.com/statistics/871513/worldwide-data-created/) released in May 2020, this year, 2021, the world would consume 74 zettabytes (1 zettabyte = trillion gigabytes) of data worldwide. This number is projected to almost double in just 3 years by 2024.
```{r , echo=FALSE, fig.cap="Statista report on data creation/consumption", out.width = '100%'}
knitr::include_graphics("./imgs/statista-data-consumption.JPG")
```
Review: Add better transition from the previous paragraph before introducing the next section.
The fascinating thing about having data on subjects ranging from the absurd, squirrel census in Central Park, to the most arcane, dermatoscopic images, or even to the personal ,Whatsapp chats/location history, is that you can develop familiarity with an unfamiliar subject by following the data. Of course, you could also stumble upon new insights on familiar subjects too.
Having said that, following data requires understanding the definition of data fields, the quality of the data and methodical analysis to arrive at the correct conclusions. To avoid getting lost in the state of an analysis paralysis, one should have a sense of the questions that need to be answered through data analysis.
Exploratory data analysis primarily enable stakeholders to take data driven decisions
and understand the interplays of various attributes that affect decisions.
This post, I would be using exploratory data analysis to understand to learn about the worldwide music interest using data from a global audio streaming service,Spotify.
Some of the questions of interest are:
* how audio tracks on Spotify evolved over a century?
* what are the features associated with a particular track and their impact on popularity of a track?
* who's hot on Spotify?
Let's first get an overview of Spotify data and assess the quality of the same.
# Overview of Spotify data
I have obtained all Spotify tracks ranging from 1920 to 2020 (100 years!) from a [Kaggle dataset](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks). Each row ideally represents a track, with a Spotify track id, artist's name, track's name, and a bunch of audio features associated with the track. A sample of the data from year 2020 is shown below.
```{r}
datatable(
head(
data %>%
filter(year==2020)
),
rownames = FALSE,
options = list(dom = 'tp',
pageLength = 5,
scrollX = TRUE)
)
```
## Data types
Review: Explain, type of variable before, using them. In general , refrain from using data fields or type of fields etc before introducing the concept. The audience might not know about this.
We have few categorical fields in the data like `id`, `name`, `artists`, `release_date`, and `key` that can take finite number of values. We also have many numerical fields, mostly the audio features, like `acousticness`, `danceability` etc., that can take infinite number of values within a defined range. Notice, that two fields - `mode` and `explicit` - are logical/binary in nature.
## Data definitions
We now need to understand the definition of each field to then use them for our analyses. More specifically, I was interested to understand some definitions of audio features like `loudness`, `energy`, `danceability`, `popularity`, etc. I was also the most intrigued by the definition of `popularity` measure, which is derived from a Spotify algorithm that relies on number of plays a track has had and the recency of those plays. The table below has the definitons of all the fields.
```{r}
data_dict <- tibble("field" = names(data),
"definition" = c("Track unique id",
"Track name",
"Artist name",
"Year the track was released",
"Date the track was released",
"A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.",
"Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.",
"Duration of the song in milliseconds",
"Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.",
"If the track contains explicit content",
"Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.",
"The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C#/Db, 2 = D, and so on. If no key was detected, the value is -1.",
"Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.",
"The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.",
"Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.",
"The popularity of the track. The value will be between 0 and 100, with 100 being the most popular. The popularity of a track is a value between 0 and 100, with 100 being the most popular. The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Generally speaking, songs that are being played a lot now will have a higher popularity than songs that were played a lot in the past.",
"Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.",
"The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.",
"A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry)."))
datatable(
data_dict,
rownames = FALSE,
options = list(dom = 'tp',
pageLength = 5,
scrollX = TRUE),
caption = "Sourced from https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-01-21/readme.md"
)
```
Review( Rephrase a) validation check; )
After performing a validation check on the ideal state of this data (one record per track), certain issues appeared in terms of how it is curated. There were few records where all the fields were duplicated multiple times. Moreover, a track can be added to Spotify as a single entry and/or as a part of an album with separate ids for each addition. The same track can also be added multiple times with different release dates under different licenses and in different markets. Therefore, we need to overcome this issue in the data to not double count the statistics which can mislead our conclusions.
One way to tackle this issue is to define the track as a combination of an artist and a track name and keep the one that was released the earliest. For example, **Rain On Me (with Ariana Grande)** track has two entries in the data with release dates of `r format(as.Date("2020-05-22"), "%b %d %Y")` and `r format(as.Date("2020-05-29"), "%b %d %Y")` from which we would choose the one released on `r format(as.Date("2020-05-22"), "%b %d %Y")`. By following this logic, we will not be favoring the popularity measure (correlated with recent number of plays) for any track that could have same track + artist combination released across various years with varying popularity measures.
After cleaning the data, we now have total of **158,581** tracks - a reduction of ~14,000 records with duplicate entries.
```{r}
datatable(
data %>%
filter(name=="Rain On Me (with Ariana Grande)"),
rownames = FALSE,
options = list(dom = 'tp',
pageLength = 5,
scrollX = TRUE)
)
```
```{r}
key_map <- rev(c("0" = "C", "1" = "C#", "2" = "D", "3" = "D#", "4" = "E", "5" = "F", "6" = "F#", "7" = "G", "8" = "G#", "9" = "A", "10" = "A#", "11" = "B"))
data_prc <- data %>%
mutate(id = as.character(id)) %>%
#select(-id, -release_date) %>% # duplicate records for different ids and release date
distinct() %>%
# same song with same artist have multiple records with slight changes in audio featues and year
arrange(name, artists, year, release_date) %>%
group_by(name, artists) %>%
filter(id==min(id)) %>%
#filter(popularity == max(popularity)) %>%
ungroup() %>%
mutate(popularity_category = ifelse(popularity >= 80, "80+", "<80"),
valence_bin = cut(valence, seq(0,1,0.1), right = FALSE),
duration_min = duration_ms/(1000*60),
mode_type = case_when(mode==0 ~ "minor",
mode==1 ~ "major"),
key_str = as.character(key),
key_group = str_replace_all(key_str, key_map))
```
# Evolution of Spotify music
We have a fair understanding of the data and have processed it appropriately. Let's observe some trends!
## Tracks and artists over years on Spotify
The number of tracks released on Spotify fluctuates a lot between 1920 and 1950, probably becuase of the limited capacity for production. Since 1950, around 2,000 tracks were released consistently until 1999. The number dropped by almost a half in 2000-2001 and is steadily increasing since 2004. Interestingly, the number of tracks shot up by 1.5 in 2020. I am reigning my temptation to associate reasons with the high level fluctuations in number of tracks over years since Spotify has not been the most exhaustive platform to host all the music that gets created across the century. However, the trends are certainly intriguing and demands further understanding and investigation of the data.
The number of unique artists who released tracks by years has an increasing trend across all years, including the years when number of tracks were stable around 2,000.
```{r}
ggplotly(
data_prc %>%
group_by(year) %>%
summarise(tracks = n(),
artists = n_distinct(artists)) %>%
ungroup() %>%
ggplot(aes(x = year, y = n, group = 1)) +
geom_line(aes(y = tracks, color = "tracks")) +
geom_point(aes(y = tracks, color = "tracks")) +
geom_line(aes(y = artists, color = "artists")) +
geom_point(aes(y = artists, color = "artists")) +
scale_x_continuous(breaks = seq(1910, 2020, 10)) +
scale_color_manual(values = c("tracks" = "#8d52eb", "artists" = "#ec576c")) +
theme(axis.text.x = element_text(angle = 90),
legend.title = element_blank()) +
labs(y = "tracks vs artists")
)
```
## Popularity over years on Spotify
The mean popularity has gone down since 2000 while the maximum popularity has been on the rise since 2000 suggesting that despite having more tracks in the recent years, not all gained higher popularity. 2020 recorded the highest maximum popularity score of 96 out of 100. This also could be partly because of the method of popularity calculation which weighs more on number of plays from the recent time period. Therefore, making the tracks released in the recent years more popular.
```{r}
ggplotly(
data_prc %>%
group_by(year) %>%
summarise(mean_popularity = mean(popularity),
max_popularity = max(popularity)) %>%
ggplot(aes(x = year, group = 1)) +
geom_line(aes(y = mean_popularity, color = "mean_popularity")) +
geom_point(aes(y = mean_popularity, color = "mean_popularity")) +
geom_line(aes(y = max_popularity, color = "max_popularity")) +
geom_point(aes(y = max_popularity, color = "max_popularity")) +
scale_x_continuous(breaks = seq(1910, 2020, 10)) +
scale_color_manual(values = c("mean_popularity" = "#aaf6b1", "max_popularity" = "#019875")) +
labs(y = "mean and max popularity") +
theme(axis.text.x = element_text(angle = 90),
legend.title = element_blank())
)
```
### Anatomy of the most popular track(s)
Who does not enjoy popularity? Let's see if we can find the secret behind the most popular track on Spotify.
There are two most popular tracks released in 2020 that gained popularity of 96 - **positions** by _Ariana Grande_ and **Mood (feat. iann dior)** by _24kGoldn_.
Both the tracks are on the higher end of valence (happy mood), energy, and danceability. Both have explicit content present and have 0 instrumentalness. **Positions** is set in a major mode while the other is set in a minor mode.
Can we conclude from these two tracks that having high energy, danceability, and explicit content is the key to higher popularity? Probably we can analyze further.
```{r fig.dim=c(10,10)}
audio_features <- c("acousticness", "danceability", "duration_min", "energy", "instrumentalness", "explicit", "liveness", "loudness", "key", "mode", "speechiness", "tempo", "valence")
ggplotly(
data_prc %>%
filter(popularity==max(popularity)) %>%
pivot_longer(cols = all_of(audio_features), names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = name, y = feature_value, fill = name)) +
geom_col() +
facet_wrap(~feature_name, scales = "free_y") +
theme(axis.text.x = element_text(angle = 45))
)
```
We will investigate the recipe of popularity further in a little bit. Before that, let's check out the trends in audio features over 100 years. You might have noticed that all the audio features have different scales. For example, loudness ranges from 0 (most loud) to -60 (least loud) while popularity ranges from 0 to 100. Therefore, to compare them on a standard scale, we need to normalize their numerical values between a standard range, say 0 to 1.
## Evolution of audio features
The average acousticness has drastically reduced over the years, probably because of the inventions of more electronic instruments. The average energy of tracks has drmamatically increased over the years, almost at the same time acousticness started reducing. Could there be a correlation there?
The average loudness of the tracks has slightly increased. The average valence has slightly decreased on the other hand. Instrumentalness was more present in earlier years although there has been an uptick in the recent years for the same.
```{r}
rescale <- function(x) (x-min(x))/(max(x) - min(x))
scales_data_prc <- data_prc %>%
mutate(year = as.character(year)) %>%
mutate_if(is.numeric, ~rescale(.)) %>%
mutate(year = as.integer(year))
```
```{r}
audio_features_2 <- c("acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence")
ggplotly(
scales_data_prc %>%
pivot_longer(cols = all_of(audio_features_2), names_to = "feature_name", values_to = "feature_value") %>%
group_by(year, feature_name) %>%
summarise(mean_feature_value = mean(feature_value)) %>%
ungroup() %>%
ggplot(aes(x = year, y = mean_feature_value, color = feature_name)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = seq(1910, 2020, 10)) +
scale_color_brewer(palette = "Set3")
)
```
We need to analyze trends for audio features that are either logical or categorical separately becuase taking an average for them would not be appropriate.
### Major/minor mode
There has been more tracks set in minor mode in recent years. Generally speaking, minor mode tracks tend to have sad/gloomhy mood. This is in alignmnet with decreasing valence in the recent years.
```{r}
ggplotly(
data_prc %>%
count(year, mode_type) %>%
group_by(year) %>%
mutate(perc_tracks = n/sum(n)) %>%
ungroup() %>%
ggplot(aes(x = year, y = perc_tracks, fill = mode_type, group = 1)) +
geom_col() +
scale_x_continuous(breaks = seq(1910, 2020, 10))
)
```
### Explicit content
Percentage of tracks with explicit conent has been increasing since 1980s with two peaks around 2000 and 2018.
```{r}
ggplotly(
data_prc %>%
count(year, explicit) %>%
group_by(year) %>%
mutate(perc_tracks = n/sum(n)) %>%
ungroup() %>%
ggplot(aes(x = year, y = perc_tracks, fill = factor(explicit))) +
geom_col() +
scale_x_continuous(breaks = seq(1910, 2020, 10))
)
```
<!-- # key -->
<!-- ```{r} -->
<!-- data_prc %>% -->
<!-- count(year, key_group) %>% -->
<!-- group_by(year) %>% -->
<!-- mutate(perc = n/sum(n)) %>% -->
<!-- ggplot(aes(x = year, y = perc, fill = key_group)) + -->
<!-- geom_col() + -->
<!-- scale_fill_brewer(palette = "Paired") + -->
<!-- scale_x_continuous(breaks = seq(1910, 2020, 10)) -->
<!-- ``` -->
### Duration of tracks
Average duarion of the tracks in minutes has been fluctuating between 4 to 4.5 minutes since 1970. Spotify has collected **`r round(sum(data_prc$duration_min)/(60*24*365), 2)`** years worth of tracks over 100 years!! If you would rather not spend that much time listening to each and every song but dance on the most "danceable" song, get wild listening to **Funky Cold Medina** by _Tone-Loc_.
```{r}
ggplotly(
data_prc %>%
group_by(year) %>%
summarise(mean_dur = mean(duration_min)) %>%
ungroup() %>%
ggplot(aes(x = year, y = mean_dur)) +
geom_line(color = "#800000") + geom_point(color = "#800000") +
labs(y = "mean duration (mins)") +
scale_x_continuous(breaks = seq(1910, 2020, 10))
)
```
# Correlation of audio features
Plotting the trends of audio features over years and analyzing audio features of the most popular tracks invoked some curiosity around the correlation of these features among themselves and with the popularity measure.
Below graph shows corrleation of each audio feature with other audio features. Note that mode and key are excluded from this matrix as they are kind of categorical variables. Colors closer to orange showcase higher positive correlation, colors closer to purple showcase no correlation, while colors closer to dark blue show negative correlation. It is quite trivial to point out that each audio feature is correlated to itself with 100% positive correlation.
Energy is negatively correlated with acousticness which aligns with what we observed in the trends earlier. Energy increased at the same time acousticness decreased.
Loudness is highly correlated with energy which makes loudness negatively correlated with acousticness as well.
Danceability is positively correlated with valence of the track, not surprising!
Popularity is negatively correlated with acousticness, instrumentalness, and speechiness while positively correlated with loudness and energy.
```{r}
ggplotly(
melt(cor(data_prc %>%
select(c(audio_features_2, "popularity")))) %>%
ggplot(aes(x = Var1, y = Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "#003366", high = "orange", mid = "purple") +
theme(axis.text.x = element_text(angle = 90)) +
labs(x = "", y = "")
)
```
Another intersting factor affecting popularity is presence of explicit content in tracks. Since it is a logical variable, plotting the popularity distribution by each value of `explicit` field is more appropriate to find out its impact on popularity. The violin chart below shows the density of popularity distribution in the vertical direction while the horizontal lines show 25th, 50th, and 75th percentile of the distribution. 75% of tracks with explicit content had popularity of no more than 65 while 75% of tracks without explicit content had popularity of no more than 42. This gives a slight hint of more explicit content being more popular.
```{r}
data_prc %>%
ggplot(aes(x = factor(explicit), y = popularity)) +
geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) +
labs(x = "explicit content")
```
Some of the above mentioned correlation observations are visualized in one chart below to fit linear trends. Such correlation plots could be very useful to understand the importance of each feature in predicting a variable, for example popularity.
It seems that tracks with more energy, loudness, explicit content and less instrumentalness, acousticness tend to be more popular. After all, this conclusion is not very far from our earlier conclusion on popularity from the most popular tracks.
```{r}
p_acoustic <- data_prc %>%
mutate(acoustic_bin = cut(acousticness, seq(0,1.1,0.0001), right = FALSE)) %>%
group_by(acoustic_bin) %>%
summarise(mean_popularity = mean(popularity),
mean_acoustic = mean(acousticness)) %>%
ungroup() %>%
ggplot(aes(x = mean_acoustic, y = mean_popularity)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$acousticness, data_prc$popularity),2)}"))
```
```{r}
p_energy <- data_prc %>%
mutate(energy_bin = cut(energy, seq(0,1.1,0.0001), right = FALSE)) %>%
group_by(energy_bin) %>%
summarise(mean_popularity = mean(popularity),
mean_energy = mean(energy)) %>%
ungroup() %>%
ggplot(aes(x = mean_energy, y = mean_popularity)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$energy, data_prc$popularity),2)}"))
```
```{r}
p_energy_loudness <- data_prc %>%
mutate(loudness_bin = cut(loudness, seq(0,-60,-0.01), right = FALSE)) %>%
group_by(loudness_bin) %>%
summarise(mean_energy = mean(energy),
mean_loudness = mean(loudness)) %>%
ungroup() %>%
ggplot(aes(x = mean_loudness, y = mean_energy)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$energy, data_prc$loudness),2)}"))
```
```{r}
p_valence_dance <- data_prc %>%
mutate(dance_bin = cut(danceability, seq(0,1,0.0001), right = FALSE)) %>%
group_by(dance_bin) %>%
summarise(mean_valence = mean(valence),
mean_dance = mean(danceability)) %>%
ungroup() %>%
ggplot(aes(x = mean_dance, y = mean_valence)) +
geom_point(alpha = 0.2, size = 3) +
geom_smooth(method = "lm") +
labs(title = str_glue("correlation = {round(cor(data_prc$valence, data_prc$danceability),2)}"))
```
```{r}
grid.arrange(p_acoustic, p_energy, p_energy_loudness, p_valence_dance, nrow = 2)
```
# Top 20 most productive artists
After visualizing trends and correlation of audio features, let's find out which artists have been the most productive on Spotify 100 years.
The chart below shows top 20 artists that released the most number of songs on Spotify. The shading represents time between release of their first track and last track. Top 4 most productive artists by a bigger margin have been active for < 30 years between 1920 to 1950. Higher number of active years may have some correlation with popularity as they could consistently release tracks over many years and those tracks have been played more times in recent times. **Ella Fitzgerald** is one of the early artists (from 1920s) whose last track was added in 1999 yet ranks pretty high on the popularity spectrum (73). Quite incredible! **Frank Sinatra** has been active for most amount of years (80 years) among the top 20 artists.
I could not resist but notice our beloved **Lata Mangeshkar ji's** name sitting nicely between **The Beatles** and **Queeen**.
```{r}
ggplotly(
data_prc %>%
group_by(artists) %>%
summarise(n_songs = n(),
first_activity = min(year),
last_activity = max(year)) %>%
ungroup() %>%
mutate(years_active = last_activity - first_activity + 1) %>%
arrange(desc(n_songs)) %>%
head(20) %>%
ggplot(aes(x = reorder(artists, n_songs), y = n_songs, fill = years_active)) +
geom_col() +
labs(y = "artist", x = "tracks") +
coord_flip()
)
```
Let's quickly go through audio features of the tracks by few well known artists.
## Audio features of tracks by Lata Mangeshkar
The most popular song of Lata Mangeshkar on Spotify is **Aaj Phir Jeene Ki Tamanna Hai** released in 1965. Below chart shows that her tracks are on the high spectrum of acousticness (not surprising). Danceability is in the medium range. Her tracks on Spotify are mostly on the higher end of valence. Since Spotify might have limited tracks from this prolific artist, it might be best to no conclude anything in particular.
```{r}
# data_prc %>%
# filter(artists=="['Lata Mangeshkar']") %>%
# arrange(desc(popularity))
ggplotly(
scales_data_prc %>%
filter(artists=="['Lata Mangeshkar']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.4) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3")
)
```
## Audio features of tracks by few other popular artists
**The Beatles**: The most popular song recorded on Spotify is **Come Together - Remastered 2009** with popularity score of 78
**Queen**: The most popular song recorded on Spotify is **Don't Stop Me Now - Remastered 2011** with popularity score of 73
**Coldplay**: The most popular song recorded on Spotify is **Yellow** with popularity score of 85
Below is their anatomy of audio features.
**The Beatles** and **Queen** are not too different from each other in terms of their audio feature profile. **Coldplay** is high on loudness, medium on danceability, and low on valence.
```{r}
# data_prc %>%
# filter(artists=="['The Beatles']") %>%
# arrange(desc(popularity))
ggplotly(scales_data_prc %>%
filter(artists=="['The Beatles']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3")+
labs(title = "The Beatles"))
```
```{r}
# data_prc %>%
# filter(artists=="['Queen']") %>%
# arrange(desc(popularity))
ggplotly(scales_data_prc %>%
filter(artists=="['Queen']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3") +
labs(title = "Queen"))
```
```{r}
# data_prc %>%
# filter(artists=="['Coldplay']") %>%
# arrange(desc(popularity))
ggplotly(scales_data_prc %>%
filter(artists=="['Coldplay']") %>%
pivot_longer(audio_features_2, names_to = "feature_name", values_to = "feature_value") %>%
ggplot(aes(x = feature_name, y = feature_value, color = feature_name)) +
geom_jitter(size = 3, alpha = 0.5) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer(palette = "Set3") +
labs(title = "Coldplay"))
```
One question that came to mind while exploring the data is what can the most commonly used words across all tracks tell us about the core inspiration of music.
# Common words in track names
Top 100 words from 100 years have **love** and **live** as the most common words in all track names after excluding commonly used words like _a/an/the/you/me/I/mixed/remaster_ etc. and years. I wonder which **live** - the verb or the noun - has been used the most commonly in track names. One could dissect the keyword **live** by the `liveness` audio feature to get a better distinction or it might require another blog on sentiment analysis. For now, let's wish it is live - the verb.
As a side note, I also checked the most commonly used word by each year from 1990 and **love** appeared for most of the years. Love never looses!
```{r , echo=FALSE, fig.cap="Top 100 words over 100 years", out.width = '70%'}
knitr::include_graphics("./imgs/word-cloud-100.JPG")
```
It is quite interesting to see words related to work out as more common words in tracks of 2020.
```{r , echo=FALSE, fig.cap="Top 100 words in 2020", out.width = '70%'}
knitr::include_graphics("./imgs/word-cloud-100-2020.JPG")
```
I Hope you had some of your musical and analytical curiosiry satisfied through this blog. As an avid listener of Indian classical and pop music, I got a good introduction to music and artists from other genres. I am certainly going to listen to tracks from all 20 most productive artists, specially **Ella Fitzgerald** and the most popular tracks ever recorded on Spotify **positions** and **Mood (feat. iann dior)**.
Have a lovely day!