-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathspeedtest-analysis.Rmd
1091 lines (769 loc) · 64.3 KB
/
speedtest-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Independent Speed test Analysis of 4G Mobile Networks \ Performed by DIKW Consulting"
author: "Hugo Koopmans"
date: "03-11-2015"
output:
pdf_document:
fig_height: 4
keep_tex: yes
number_sections: yes
template: ./templates/mytemplatedikw.tex
toc: yes
word_document: default
---
\newpage
Colophon
-------------
```{r echo=FALSE }
# all parameters should go in here
source("global.R")
```
This analysis is performed by DIKW Consulting.
DIKW Consulting is a consulting firm that takes her customers on the path from Data to information to Knowledge to Wisdom. Our expertise is in the field of data logistics, data warehousing, data mining and machine learning.
T-Mobile has asked DIKW Consulting to perform this test as an independent third party. DIKW Consulting was paid to perform this test by T-mobile and has no other intentions then to perform this test by it's own high quality standards. The analysis was performed by generally accepted and approved standards and statistical methods using open source tools.
We let the data speak for itself.
If you have questions you can contact [DIKW Consulting](http://www.dikw.nl). If you want to repeat this test by yourself you are welcome to do so, all necessary scripts are available on GitHub. The data is commercially available at [Ookla](http://www.ookla.com/).
This analysis, method, tools and scripts are open sourced and placed on GitHub, see the read-me on the GitHub [repository](https://github.com/hugokoopmans/ookla-speedtest-analysis).
For questions contact Hugo Koopmans at
+31 6 43106780
Code generation
---------------
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
\newpage
Abstract
========
In this document we have conducted a statistical analysis of Ookla’s Speedtest Intelligence data from [speedtest.net](http://www.speedtest.net/). Ookla provides commercially available speed test data collected by their mobile app on the three main mobile platforms, Android, iOS and Windows mobile. We will load all the raw speed test data into a database, analyse the data ot the top three operators(T-Mobile, Vodafone and KPN) and perform a test on how fast their respective 4G networks are. Our test will be on download speed, upload speed and latency(or ping).
The speed test data that is provided by Ookla’s Speedtest Intelligence data from [speedtest.net](http://www.speedtest.net/) and obtained by T-Mobile is of `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2]` and `r p.ListQuarterMonths[3]` `r p.Year`. We will analyse and validate the data step by step. After investigation on suspicious testing circumstances, such as, but not limited to, devices, location, theoretical maximum speeds and specific dates the data is ready to be subjected to a significance test. This test will be done for all the data in the coverage area and for the biggest twenty cities in the Netherlands.
The result of these tests will give an answer to whether or not T-Mobile has on average a faster 4G network than Vodafone and/or KPN, based on the three metrics upload speed, download speed and latency.
\newpage
Introduction
===========
This document is a report of a statistical analysis of 4G network speed test data. The time period we consider is `r p.QuarterNumber` `r p.Year` so the months `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2]` and `r p.ListQuarterMonths[3]` of `r p.Year` are in scope.
We perform an analysis of [Ookla](http://www.ookla.com/) Speedtest Intelligence data on three different measures:
* download speed
* upload speed
* latency(or ping)
On these metrics we compare the three major providers of 4G mobile networks in the Netherlands, Vodafone, KPN and T-Mobile.
This analysis is set up as follows:
* Step 1: Data Collection
* Step 2: Data preprocessing
* Step 3: Data analysis
* Step 4: Test design
* Step 5: Test results coverage area
In section 8 we provide the conclusion, and additionally the analysis and results of the Top 20 cities per city is presented in section 9.
Step 1 : Data Collection
======
The data was downloaded from Ookla servers by T-Mobile. For this analysis we use a PostgreSQL database that is locally installed.
The data is loaded from three different files, each file resembling a mobile platform (Android,iPhone and Windows mobile). The data is loaded as-is as it was received from the Ookla server. Scripts to load this data directly into a PostgreSQL database can be found in the GitHub repository.
All scripts to process the data are in SQL(available on GitHub) or R(included in this document, thanks to [knitr](http://yihui.name/knitr/))
```{r options , echo=FALSE, cache=FALSE}
# set global chunk options
library(knitr)
opts_chunk$set(echo=FALSE, cache=FALSE, tidy=TRUE, warning=FALSE, message=FALSE, error=TRUE)
# load functions
source("statistical-analysis.R")
```
Let's get the data and do some basic counts.
```{r get-raw-data,cache=FALSE}
# postgreSQL
require("RPostgreSQL", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")
# Establish connection to PoststgreSQL using RPostgreSQL
drv <- dbDriver("PostgreSQL")
# read data from 4Gdb
con <- dbConnect(drv, dbname=db.name,host=db.hostname,port=db.portnumber,user=db.username,password=db.password )
# all data joined
df.ad <- dbReadTable(con, c("ookla_all_data"))
# disconnect
sink <- dbDisconnect(con)
```
The raw data
---------------
We load the raw files, downloaded from the Ookla server, into individual tables per mobile platform. In the table below we count the number of speed tests per mobile platform in `r p.QuarterNumber`/`r p.Year`.
```{r raw-table}
library(knitr)
tbl <- addmargins(table(df.ad$os))
kable(as.data.frame(tbl),col.names = c('','Counts'), caption="Raw test data counts")
```
\newpage
So we start this analysis with in total `r nrow(df.ad)` speed tests, which are represented as rows in the data set commercially downloaded from Ookla.
Speedtest.net data
----------------
Ookla designed their speed test in such a way that the results are as robust as possible. Ookla’s [speedtest.net](http://www.speedtest.net/) is the de-facto standard for internet speed testing. According to http://www.ookla.com/speedtest-intelligence, Speedtest Intelligence is the choice of nearly every Fortune 500 ISP and Mobile Provider in the world. For more information please visit [Speedtest.net](https://support.speedtest.net/hc/en-us).
Sample test data from Ookla
--------------------------
Ookla has some random sample data available, this data can be used to validate our method. To validate the test result one would need the specific data of the Netherlands.
A sample set of Ookla Speedtest Intelligence data can be found [here](http://www.ookla.com/speedtest-intelligence). The files differ per mobile platform. The file descriptors for all three mobile platforms are listed below.
### Android header descriptives
\footnotesize
```{}
test_id - unique id of test in our system
device_id - unique device id in our system
android_fingerprint
test_date - YYYY-MM-DD HH:MM:SS in Pacific time (we can accommodate different time zones if needed)
client_ip - ip of client
download_kbps - download speed in kilobits per second
upload_kbps - upload speed in kilobits per second
latency_ms - ping in milliseconds
server_name - name of server tested to (name of city it is located in)
server_country - country name of server
server_country_code - country code of server
server_latitude - latitude of server tested to
server_longitude - longitude of server tested to
server_sponsor_name - sponsor name of server
client_country - country name of the client
client_country_code - country code of the client
client_region - region name of client (this will be state in the US)
client_region_code - region code of client
client_city - city of client
client_latitude - latitude of client (GPS or Maxmind when location services disabled)
client_longitude - longitude of client (GPS or Maxmind when location services disabled)
miles_between - miles between the client and the server tested to
connection_type - http://developer.android.com/reference/android/telephony/TelephonyManager.html
0=unknown,1= Cell, 2=Wifi, 3=Gprs, 4=Edge, 5=Utms, 6=Cdma, 7=Evdo0, 8=EvdoA, 9=OnexRTT,
10=Hsdpa, 11=Hspa, 12=Iden, 13=Ehrpd, 14=EvdoB, 15=Lte, 16=Hsupa, 17=Hspap
isp_name - name of ISP (Maxmind)
is_isp - 0=Corporation/Academic, 1=ISP
network_operator_name - Mobile Carrier Name http://developer.android.com/reference/android
/telephony/TelephonyManager.html#getNetworkOperatorName()
network_operator_code - MCC + MNC http://developer.android.com/reference/android
/telephony/TelephonyManager.html#getNetworkOperator()
brand - http://developer.android.com/reference/android/os/Build.html#BRAND
device - http://developer.android.com/reference/android/os/Build.html#DEVICE
hardware - http://developer.android.com/reference/android/os/Build.html#HARDWARE
build_id - http://developer.android.com/reference/android/os/Build.html#ID
manufacturer - http://developer.android.com/reference/android/os/Build.html#MANUFACTURER
model - http://developer.android.com/reference/android/os/Build.html#MODEL
product - http://developer.android.com/reference/android/os/Build.html#PRODUCT
cdma_cell_id - http://developer.android.com/reference/android/telephony/cdma/package-summary.html
gsm_cell_id - http://developer.android.com/reference/android/telephony/gsm/package-summary.html
location_type - 0 = unknown, 1 = GPS, 2 = GeoIP
sim_network_operator_name - Mobile Carrier Name from the SIM
sim_network_operator_code - MCC + MNC from the SIM http://en.wikipedia.org/wiki/Mobile_Country_Code
```
\normalsize
### iOS header descriptives
\footnotesize
```{ }
test_id - unique id of test in our system
device_id - unique device id in our system
test_date - YYYY-MM-DD HH:MM:SS in Pacific time (we can accommodate different time zones if needed)
client_ip - ip of client
download_kbps - download speed in kilobits per second
upload_kbps - upload speed in kilobits per second
latency_ms - ping in milliseconds
server_name - name of server tested to (name of city it is located in)
server_country - country name of server
server_country_code - country code of server
server_latitude - latitude of server tested to
server_longitude - longitude of server tested to
server_sponsor_name - sponsor name of server
client_country - country name of the client
client_country_code - country code of the client
client_region - region name of client (this will be state in the US)
client_region_code - region code of client
client_city - city of client
client_latitude - latitude of client (GPS or Maxmind when location services disabled)
client_longitude - longitude of client (GPS or Maxmind when location services disabled)
miles_between - miles between the client and the server tested to
connection_type - 0=unknown, 1=cell, 2=wifi, 3=GPRS, 4=Edge, 5=WCDMA, 6=HSDPA,
7=HSUPA, 8=CDMA1x, 9=CDMAEVDORev0, 10=CDMAEVDORevB, 11=eHRPD, 12=LTE
isp_name - name of ISP (Maxmind)
is_isp - 0=Corporation/Academic, 1=ISP
carrier_name - http://developer.apple.com/library/ios/documentation/NetworkingInternet
/Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/carrierName
iso_country_code - http://developer.apple.com/library/ios/documentation/NetworkingInternet
/Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/isoCountryCode
mobile_country_code - http://developer.apple.com/library/ios/documentation/NetworkingInternet
/Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/mobileCountryCode
mobile_network_code - http://developer.apple.com/library/ios/documentation/NetworkingInternet
/Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/mobileNetworkCode
model - iPad, iPhone, iPod Touch
version - iOS version
location_type - 0 = unknown, 1 = GPS, 2 = GeoIP
```
\normalsize
### Windows Mobile header descriptives
\footnotesize
```{}
test_id - unique id of test in our system
device_id - unique device id in our system
test_date - YYYY-MM-DD HH:MM:SS in Pacific time (we can accommodate different timezones if needed)
client_ip - ip of client
download_kbps - download speed in kilobits per second
upload_kbps - upload speed in kilobits per second
latency_ms - ping in milliseconds
server_name - name of server tested to (name of city it is located in)
server_country - country name of server
server_country_code - country code of server
server_latitude - latitude of server tested to
server_longitude - longitude of server tested to
server_sponsor_name - sponsor name of server
client_country - country name of the client
client_country_code - country code of the client
client_region - region name of client (this will be state in the US)
client_region_code - region code of client
client_city - city of client
client_latitude - latitude of client (GPS or Maxmind when location services disabled)
client_longitude - longitude of client (GPS or Maxmind when location services disabled)
miles_between - miles between the client and the server tested to
connection_type - 0=unknown, 1=cell, 2=wifi, 3=GPRS, 4=1XRTT, 5=EVDO, 6=EDGE, 7=3G,
8=HSPA, 9=EVDV, 10=PassThru, 11=LTE, 12=EHRPD
isp_name - name of ISP (Maxmind)
is_isp - 0=Corporation/Academic, 1=ISP
carrier_name - AT&T, Verizon etc
manufacturer - Nokia, HTC, etc.
device_name - name of the device for e.g. "HD7 T9292"
hardware_version - device hardware version e.g. "1.0.0.0"
firmware_version - device firmware_version e.g. "1232.2107.1241.1001"
location_type - 0 = unknown, 1 = GPS, 2 = GeoIP
```
\normalsize
Step 2: Data preprocessing
==============
In order to compare the data from the three different mobile platforms, we need to perform basic data transformations and merge it into one table.
Following that, in this preprocessing and analysis step we validate the data on the following points: 1. Are there any specific individual devices that perform a suspiciously high number of tests?
2. We apply filters so only the tests from the three operators we are interested in remain.
3. We apply filters so only tests done on 4G technology remain.
4. We are only interested in the coverage area in which all three operators claim to have 4G coverage.
5. We look at speed test results that are “too good to be true” that is, measured speeds that are above the theoretical maximum possible for that specific technology. We remove these speed tests.
6. We look at specific coordinates that are very frequent, depending on the explanation as for why these coordinates are used to often we remove or delete the speed tests per coordinate.
7. We look at specific dates that have a high number of speed tests for that day.
After all these checks we end up with a data set that is cleaned and ready to perform a statistical significance test on the investigated metrics.
## Basic data transformations
As explained in section 3.3., the data from the three different mobile platforms comes in different formats and some basic data transformations are necessary before we can merge it into one table.
First of all, the names for the individual operators are spelled in various ways (e.g. ‘T-Mobile NL‘, ‘T-Mobile NL’). Next, we need to map connection types to the specific technology used (2G, 3G or 4G) depending on the operating system of the device. For more details on these transformations, please see the SQL script on GitHub.
After these transformations are performed, we proceed with the checks and cleaning steps explained at in the previous section.
## Suspicious devices
Are there any devices that perform tests very frequently?
In order to investigate this, let’s look at a frequency plot of devices that occur at least **ten** times in each month. On the right hand side we see devices that are used for testing very often, some of them even on an hourly basis. Obviously those devices are not in the hands of real customers so these will be removed from the data set.
```{r freq-plot-devices}
library(ggplot2)
# months
df.ad.m1 <- df.ad[as.Date(df.ad$test_date) >= as.Date('2015-07-01') & as.Date(df.ad$test_date) < as.Date('2015-08-01'),]
df.ad.m2 <- df.ad[as.Date(df.ad$test_date) >= as.Date('2015-08-01') & as.Date(df.ad$test_date) < as.Date('2015-09-01'),]
df.ad.m3 <- df.ad[as.Date(df.ad$test_date) >= as.Date('2015-09-01') & as.Date(df.ad$test_date) < as.Date('2015-10-01'),]
# plot grid
par(mfrow=c(3,1),mar=c(1,1,1,1),mai=c(0.1,1,0.1,0.1))
#Tabel voor maand 1
tbl <- table(df.ad.m1$device_id)
tbl <- sort(tbl[tbl>10]) # only keep > 10
nms <- rep("",dim(tbl))
barplot(tbl, names.arg = nms,main=paste(p.ListQuarterMonths[1],"2015"),ylab="Number of tests",xlab="Unique devices")
tbl.m1 <- tbl # TO DO aanpassen maand
#Tabel voor maand 2
tbl <- table(df.ad.m2$device_id)
tbl <- sort(tbl[tbl>10]) # only keep > 10
nms <- rep("",dim(tbl))
barplot(tbl, names.arg = nms,main=paste(p.ListQuarterMonths[2],"2015"),ylab="Number of tests",xlab="Unique devices")
tbl.m2 <- tbl
#Tabel voor maand 3
tbl <- table(df.ad.m3$device_id)
tbl <- sort(tbl[tbl>10]) # only keep > 10
nms <- rep("",dim(tbl))
barplot(tbl, names.arg = nms,main=paste(p.ListQuarterMonths[3],"2015"),ylab="Number of tests",xlab="Unique devices")
tbl.m3 <- tbl
# max number of test to call a device suspicious
max_susp_dev = 30
```
Based on these plots and the amount of data consumed when performing a speedtest, we decide to remove specific devices that test more than `r max_susp_dev` times a month. These devices are probably used by telecom professionals for testing purposes. The amount of data consumed per speedtest depends on the speed: the higher the speed, the more data is consumed. A test on 4G at very high speed can cost up to 100 MBytes per test out of the data bundle. So, 30 or more tests per month at high speed are equivalent to approximately 3GB of data usage just spend on tests alone and that is suspicious. The number 30 by itself is subjective, we could use 25 or 40 depending on what exact number of tests per device you would call suspicious.
Actually including these are not influencing the results significantly, but we want to use real customer data as much as possible, not affected by professionals testing their own (or others) network.
```{r get-clean-data, cache=FALSE}
# read data from 4Gdb
# Establish connection to PoststgreSQL using RPostgreSQL
drv <- dbDriver("PostgreSQL")
# Full version of connection setting
con <- dbConnect(drv, dbname=db.name,host=db.hostname,port=db.portnumber,user=db.username,password=db.password )
# all data joined and cleaned for suspicious devices in SQL
df.clnd <- dbReadTable(con, c("ookla_all_data_clean"))
# get data for tm coverage area only
df.4G <- dbReadTable(con, c("datatm4gcoverage"))
# disconnect
sink <- dbDisconnect(con)
```
We identify `r sum(tbl.m1>=max_susp_dev)` devices in `r p.ListQuarterMonths[1]`, `r sum(tbl.m2>=max_susp_dev)` devices in `r p.ListQuarterMonths[2]` and `r sum(tbl.m3>=max_susp_dev)` devices in `r p.ListQuarterMonths[3]`, which in total represent `r nrow(df.ad)-nrow(df.clnd)` speed tests.
After filtering these devices the data set has `r nrow(df.ad)`-`r nrow(df.ad)-nrow(df.clnd)`=`r nrow(df.clnd)` speed test cases.
Top three operators
-------------------
For this analysis we are only interested in the top three operators in the Netherlands. In the data set, at this point, there are `r length(levels(as.factor(df.clnd$operator)))` different operators identified. As we can see in the table below, in which we ranked the ten most used operators, most of the speed tests were performed by people using one of the three operators we are interested in. There are 7137 tests with no identifiable operator. We will filter out all but the top 3 operators and proceed with speed tests from these top three operators.
```{r top-operators}
top10 <- head(sort(table(df.clnd$operator),decreasing = TRUE),n=10)
kable(as.data.frame(cbind(rownames(top10),top10),row.names = FALSE),col.names = c('Operator','Number of speedtests'),align=c("l","r"),caption="Most frequent operators")
# filter on top three operators only
ops <- c("T-Mobile NL","Vodafone NL","KPN NL")
df.clnd.t3 <- df.clnd[df.clnd$operator %in% ops,]
```
The top three operators together are good for `r sum(tail(sort(table(df.clnd$operator)),n=3))` tests conducted all over the Netherlands in the test period of `r p.QuarterNumber` `r p.Year`. We keep only speed tests from the top three operators.
This leaves `r nrow(df.clnd)` - `r nrow(df.clnd)-nrow(df.clnd.t3)` = `r nrow(df.clnd.t3)` rows in the data set.
Focus on 4G technology
----------------
In the raw Ookla Speedtest Intelligence data the variable called 'connection_type' identifies which technology is used, this variable can be transformed into the network technology used while performing the test.
The variable Connection type defines 4G as connection type 15 for Android. For iOS connection type 12 is LTE, and for Windows Mobile connection type 11 is LTE.
Definition of 4G for Android OS from the SQL script(available on GitHub) :
```sql
Case WHEN CONNECTION_TYPE=0 THEN 'UNKNOWN'
WHEN CONNECTION_TYPE in (1,2) THEN 'WIFI/CELL'
WHEN CONNECTION_TYPE in (3,4) THEN '2G'
WHEN CONNECTION_TYPE=15 THEN '4G'
WHEN CONNECTION_TYPE between 5 and 17 THEN '3G'
ELSE 'UNKNOWN'
END AS TECHNOLOGY
```
Below we give an overview of the network technology types available in the data set.
```{r tech}
library(knitr)
tbl <- table(df.clnd.t3$technology)
kable(as.data.frame(cbind(tbl,100*prop.table(tbl))),col.names = c('Number of cases','Percentage'), digits = 2, caption="Technology used in tests")
# number of 4G tests to be used in document
n4g <- tbl["4G"]
```
**In the remainder of this analysis we will focus on 4G technology.**
Filtering on 4G technology leaves `r n4g` test cases in the data set.
Operating systems
------------------
For the top three operators we can look at the type of operating system used on these devices:
```{r top3-table}
library(knitr)
tbl <- table(df.clnd.t3$os,df.clnd.t3$operator)
kable(tbl)
```
Most of the tests were conducted on iOS closely followed by Android OS. Windows Mobile devices have limited representation in the data set. In this test we are not interested in testing the difference in performance per device or operating system.
Geographical coverage area for 4G
------------------
For this test to be fair to all three operators, we limit the comparison of the test to areas in which all three operators (KPN, Vodafone and T-Mobile) claim to have 4G coverage at the time of the measurements. While KPN and Vodafone already claim national 4G coverage, T-Mobile is still in the process of expanding their 4G network. Therefore, T-Mobile 4G coverage area is extended every month. This means that some areas only got 4G coverage during `r p.QuarterNumber` `r p.Year`, the period of the test. All of the top 21 cities in the Netherlands, which we will analyze per city, had 4G enabled prior to `r p.QuarterNumber ` `r p.Year`.
### Coordinates with very high number of tests
Are there any locations, or coordinates, that occur very often in the investigated 4G area?
If we join the coordinates latitude and longitude together and look at the most frequent occurrences we see that there are indeed some coordinates that are very frequent. How do these exact same coordinates end up in the data? To understand this we need to explain a bit more on how the Ookla Speed test application gets the coordinates from a mobile device([read more online](https://support.speedtest.net/hc/en-us/articles/203845480-Mobile-Test-Server-Selection)).
There are several scenario's that can be the case:
1)The customer has approved the application access to the GPS coordinates of his/her device.
2)For some reason the app cannot read the GPS coordinates from the device at the time of the test. This reason can be of different origins, the user has blocked access or we are in a building or there are other technical reasons why the exact GPS coordinates cannot be accessed.
Whenever the exact coordinates are not available, due to measurement issues or because the customer is not allowing the application to use the GPS coordinates Ookla uses GEO-IP. GEO-IP is a online service to estimate the physical location of an ip-internet address (more online [from maxmind](https://www.maxmind.com/en/geoip2-services-and-databases) ).
```{r susp-coord}
# combine coordinates in "lat-lon"
str.coord <- paste(df.4G$client_latitude,",",df.4G$client_longitude)
# top 10 most frequent locations
tbl <- head(sort(table(str.coord),decreasing = TRUE),n=15)
nms <- names(tbl)
df.tmp <- as.data.frame(cbind(nms,tbl))
names(df.tmp) <- c("Coordinates","Count")
kable(df.tmp, row.names = FALSE,align=c("l","r"))
coord <- names(sort(table(str.coord),decreasing = TRUE)[1])
n.test.at.coord <- sort(table(str.coord),decreasing = TRUE)[1]
```
We asked for an explanation from Ookla on frequent and rounded coordinates.
Coordinate (52.3667,4.9) : Response from Ookla: *Results are definitely coming from GEO-IP Coordinates as you know. 2. In a single day here, there is 191 results from the same IP block to this location. 3. Correct, this is similar to the 'Kansas' issue we discussed last year.*
Coordinate (52.35-4.9167) is in Amsterdam. Question to Ookla: Can we still assume the measurements are from Amsterdam?
Response Ookla: *Yes, GeoIP is used here. We can assume you are in Amsterdam but like any other GeoIP location result, the confidence level isn't as high as it would be if we were able to get location information directly from the device, meaning if we were able to obtain GPS instead of GEO-IP.*
Coordinate (52.3666-4.9027) is also in Amsterdam(Waterloo plein).
Coordinate (52.374-4.8897) is also in Amsterdam(Spuistraat).
Coordinate (51.9167-4.5) is in Rotterdam. Again with limited precision, same response as above from Ookla.
Coordinate (52.0666-4.3209) is in The Hague. All cases are tests with operator equal to "T-Mobile NL" also this location is close to the T-Mobile office in The Hague. We will exclude these tests as potentially being from T-Mobile employees.
Also see the [precision of the coordinates](https://en.wikipedia.org/wiki/Decimal_degrees) denotes the fact that we are unsure about the exact location.
**So what do we do with these suspicious coordinates?**
The "Unknown" location, which has coordinates (`r coord`) and is the default location from Ookla if the GPS cannot pick up the exact location during the test, we have `r n.test.at.coord` speed tests with this coordinate alone. We will exclude tests performed at this coordinate from the general analysis. Nevertheless, we will analyse this set the same way we analyse individual cities, the result for this set can be found as the last city labeled "unknown".
```{r default-location}
# filter unkown location
df.unkwnloc <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord,]
# filter out T-Mobile office 52.0666-4.3209
coord.tm <- "52.0666 , 4.3209"
df.tm <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord.tm,]
# filter out KPN office 52.0666 4.3479
coord.kpn <- "52.0666 , 4.3479"
df.kpn <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord.kpn,]
# filter out Vodafone office 52.3767 4.9061
coord.vf <- "52.3767 , 4.9061"
df.vf <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord.vf,]
# total number of tests around head offices
n.ho.tot = nrow(df.tm)+nrow(df.kpn)+nrow(df.vf)
# filter top 3 head offices location from 4G
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.tm)),]
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.kpn)),]
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.vf)),]
# replace ookla default location with UNKNOWN
df.unkwnloc$gm_naam <- "Unkown Location"
# filter unkown location from 4G as we want only known locations
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.unkwnloc)),]
```
**Head office locations**
We removed `r nrow(df.tm)` tests from coordinates "52.0666 , 4.3209" in The Hague. As they are close to the head-office of T-Mobile and indeed all tests from this location are done from a T-Mobile network.
We did the same for locations close to the head offices of KPN and Vodafone. We removed `r nrow(df.kpn)` tests from the co-ordinates "52.0666 4.3479" (KPN office) and we removed `r nrow(df.vf)` tests from the coordinates "52.3767 4.9061"" (Vodafone office Amsterdam).
After removing `r n.ho.tot` tests from the coordinates around the head-offices of the top three providers as described in the above paragraph, together with `r nrow(df.unkwnloc)` from the coordinates "`r coord`" the data set contains `r n4g` - `r n.ho.tot + nrow(df.unkwnloc)` = `r n4g - n.ho.tot - nrow(df.unkwnloc)` speed tests at this point.
### Mapping test coordiantes to city boundaries
To identify the exact 4G coverage area we will use in this analysis, we use data from CBS. CBS is the Central Bureau of Statistics in the Netherlands. They provide [publicly available](http://www.cbs.nl/nl-NL/menu/themas/dossiers/nederland-regionaal/publicaties/geografische-data/archief/2014/2013-wijk-en-buurtkaart-art.htm) polygon data on cities in the Netherlands. Based on these geographical city boundaries we map each latitude, longitude coordinate onto a city.
From T-Mobile we received a list of cities that, at the time of testing, have 4G coverage. From the data we can see that per city each provider has sufficient number of speed tests in the data set for the tests to be representative.
We use city boundaries to not be influenced by exact locations of network infrastructure.
Online we can get an up to date overview of network coverage for all three operators (see [here](http://www.4gdekking.nl/)).
This test is not about coverage but about speed of the 4G network.
T-Mobile has the least 4G coverage of the three operators and per end of `r p.QuarterNumber` `r p.Year` has actual coverage in the following area:
```{r area }
library(png)
library(grid)
img <- readPNG("./img/tm-coverage-q3-2015.png")
grid.raster(img)
```
In order to do a fair comparison, we focus only on the area presented in the picture above.
We do a Geo-spacial filter on the latitude and longitude coordinates provided in the data set and only keep speed tests that are in any of the above city boundaries. For the area, defined above, the following number of tests are available in the data set.
```{r}
tbl <- table(df.4G$operator)
mt <- addmargins(tbl)
pt <- c(as.numeric(prop.table(tbl)),1.0)
kable(as.data.frame(cbind(mt,100*pt)),col.names = c('Number of tests','percentage'), digits = 2,caption="Number of 4G speedtests in the selected coverage area.")
```
Areas where T-Mobile announced 4G coverage in `r p.QuarterNumber` `r p.Year`:
July: Heerenveen, Weststellingwerf, Hoogeveen, Meppel, Staphorst, Asten, Boekel, Deurne, Someren, Venray, Gemert-Bakel, Heeze-Leende, De Wolden, Westerveld and Midden-Drenthe
August: Leeuwarden, Graft-De Rijp, Beemster, Bergen (NH.), Castricum, Edam-Volendam, Enkhuizen, Heerhugowaard, Heiloo, Den Helder, Hoorn, Langedijk, Medemblik, Opmeer, Schagen, Texel, Schermer, Zeevang, Drechterland, Stede Broec, Waterland, Beesel, Nederweert, Venlo, Weert, Horst aan de Maas, Koggenland, Leudal, Cranendonck, Peel en Maas and Hollands Kroon
September: Emmen, Zeewolde, Skarsterlân, Lemsterland, Kampen, Noordoostpolder, Urk, Elburg, Oldebroek, Dronten, Baarle-Nassau, Etten-Leur, Rucphen, Steenbergen, Woensdrecht, Zundert, Halderberge, Roosendaal, Steenwijkerland, Alphen-Chaam and Zwartewaterland
All other cities in the TMNL coverage area had 4G enabled prior to Q3 2015.
Naturally, in the cities where 4G was announced in `r p.QuarterNumber`, almost all of T-Mobile’s 4G speedtests occur after announcing to customers that 4G is activated, so `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2]` or `r p.ListQuarterMonths[3]` respectively. Even though a very small number of tests were executed during the extension of the 4G coverage onto these cities, as 4G sites were being added (so before announcing 4G coverage in these cities; this is possible as for most 4G capable phones, the 4G network is selected automatically if available). In the same cities, KPN and Vodafone speedtests are distributed more evenly over the entire period of `r p.QuarterNumber`/`r p.Year`. However, this has no influence on the test results, as the test results per month do not differ substantially (please see section 4.6.4).
So we filtered the data set to include only speed tests from the coverage area, speed tests outside this area are neglected.
The data set now contains `r n4g - n.ho.tot - nrow(df.unkwnloc)` - `r n4g - n.ho.tot - nrow(df.unkwnloc) - nrow(df.4G)` = `r nrow(df.4G)` speed tests.
### Suspicious speeds
In the data we check for up and download speeds that are technically impossible.
*Download speeds* for 4G are limited to 150Mbps on the T-mobile technology.
KPN and Vodafone have a technology called LTE advanced which has a maximum download speed of 225Mbps.
Any speed tests that had a speed recorded above the technical maximum for that operator was removed from the data set.
```{r susp-down-speed}
# count number of cases that exceed theoretical max speed
n.max.tm <- nrow(subset(df.4G, download_kbps>150000 & operator == "T-Mobile NL"))
n.max.vf <- nrow(subset(df.4G, download_kbps>225000 & operator == "Vodafone NL"))
n.max.kpn <-nrow(subset(df.4G, download_kbps>225000 & operator == "KPN NL"))
# create filters
rw.fltr.1 <- rownames(subset(df.4G, download_kbps>150000 & operator == "T-Mobile NL"))
rw.fltr.2 <- rownames(subset(df.4G, download_kbps>225000 & operator == "Vodafone NL"))
rw.fltr.3 <- rownames(subset(df.4G, download_kbps>225000 & operator == "KPN NL"))
# filter out these extremes
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.1),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.2),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.3),]
# filter also applied to seperate data in df.unkwnloc ...
unk.fltr.1 <- rownames(subset(df.unkwnloc, download_kbps>150000))
df.unkwnloc <- df.unkwnloc[setdiff(rownames(df.unkwnloc),unk.fltr.1),]
```
So we remove suspicious measurements in which the **download** speed exceeded the maximum theoretical speed per individual operator.
For T-Mobile we removed `r n.max.tm` cases, for Vodafone we removed `r n.max.vf` cases and for KPN we removed `r n.max.kpn` cases, because they where above 150(or 225) MBps.
After removing in total `r n.max.tm+n.max.vf+n.max.kpn` of these suspicious measurements, the data set contains `r nrow(df.4G)` speed tests at this point.
Let's do the same for *upload speed*.
The maximum theoretical upload speed is for all operators the same: maximum upload speed of 50Mbps.
```{r susp-upload-speed}
# create filters
rw.fltr.4 <- rownames(subset(df.4G, upload_kbps>50000 & operator == "T-Mobile NL"))
rw.fltr.5 <-rownames(subset(df.4G, upload_kbps>50000 & operator == "Vodafone NL"))
rw.fltr.6 <-rownames(subset(df.4G, upload_kbps>50000 & operator == "KPN NL"))
# filter out these extremes
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.4),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.5),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.6),]
# filter also applied to seperate data in df.unkwnloc ...
unk.fltr.2 <- rownames(subset(df.unkwnloc, upload_kbps>50000))
df.unkwnloc <- df.unkwnloc[setdiff(rownames(df.unkwnloc),unk.fltr.2),]
```
So again we remove suspicious measurements in which the **upload** speed exceeded the maximum theoretical speed per individual operator. For T-Mobile we removed `r length(rw.fltr.4)` cases, for Vodafone we removed `r length(rw.fltr.5)` cases and for KPN we removed `r length(rw.fltr.6)` cases.
After removing in total `r length(rw.fltr.4)+length(rw.fltr.5)+length(rw.fltr.6)` of these suspicious measurements the data set contains `r nrow(df.4G)` speed tests at this point.
Let's look at **latency** also. We see the maximum value in latecy measurements is 65535, which is the maximum value for a variable of type unsigned short. This is clearly an outlier and should be removed.
Response Ookla: *"That issue was resolved in the latest Android release. As more and more clients download the latest version of the app, this will become less and less obvious. Just filter out for now until its cleaned up in the files."*
```{r susp-latency-speed}
# count number of cases that exceed theoretical max speed
n.l.mx.tm <- nrow(subset(df.4G, latency==65535 & operator == "T-Mobile NL"))
n.l.mx.vf <- nrow(subset(df.4G, latency==65535 & operator == "Vodafone NL"))
n.l.mx.kpn <-nrow(subset(df.4G, latency==65535 & operator == "KPN NL"))
# create filters
rw.fltr.7 <- rownames(subset(df.4G, latency==65535 & operator == "T-Mobile NL"))
rw.fltr.8 <- rownames(subset(df.4G, latency==65535 & operator == "Vodafone NL"))
rw.fltr.9 <- rownames(subset(df.4G, latency==65535 & operator == "KPN NL"))
# filter out these extremes
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.7),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.8),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.9),]
# filter also applied to seperate data in df.unkwnloc ...
unk.fltr.3 <- rownames(subset(df.unkwnloc, latency==65535))
df.unkwnloc <- df.unkwnloc[setdiff(rownames(df.unkwnloc),unk.fltr.3),]
```
So we remove `r n.l.mx.tm` measurements from the T-Mobile set , `r n.l.mx.kpn` measurements from the KPN set and `r n.l.mx.vf` measurements from the Vodafone set. After this the data set contains `r nrow(df.4G)` speed tests at this point.
\newpage
### Suspicious dates or times
We count the number of tests per day for the months `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2] ` and `r p.ListQuarterMonths[3]` of `r p.Year`. Again we are looking at any suspicious peaks in the data.
```{r timestamps, fig.align='center',fig.width=8}
library(ggplot2)
library(reshape2)
#Timeseries
counts <- table(df.4G$operator, df.4G$test_date)
df.cnts <- as.data.frame(counts)
names(df.cnts) <- c("operator","date","freq")
# Faceting is a good alternative:
ggplot(df.cnts, aes(x=date,y=freq)) + geom_bar(width=.5,stat="identity") +
facet_wrap(~ operator, nrow = 3) +
theme(axis.text.x = element_text(angle=90, vjust=0.5, size=6))
```
We find no disturbing or unknown peaks on a specific date, except for Vodafone on the 18/08/2015. This can be explained by the network incident that Vodafone had on this date, causing customers to experience slow or no internet. The problem was fixed the same day and it was communicated to customers, which is a stimulus to test.
```{r}
# table
tbl <- addmargins(table(df.4G$operator,paste(format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%m"), format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%B"))))
kable(tbl,caption="Counts per operator per month")
```
Above a count per month per operator in the 4G coverage area. This is the data set on which we conduct the remainder of the analysis.
In the tables below we list the averages for download speed, upload speed and latency per operator per month. For KPN we see a large increase in download and upload speed between August and September, as well as reduced latency. This is probably a result of KPN adding a substantial amount of sites in the large cities by activating 4G on the 1800Mhz frequency. This also enables them to activate LTE Advanced on those locations, which has an additional positive influence on average download speed, because it allows download speeds of up to 225Mbps, provided that the customer doing the tests has a device that can support this and the network conditions at the location allow this.
```{r}
library(plyr)
# add month
df.4G$month <- paste(format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%m"),format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%B"))
#df.4G$month <- format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%m")
df.t<-ddply(df.4G,.(operator,month),summarize, mean_dwn=round(mean(download_kbps),1))
# reshape df to matrix
library(reshape)
df.t2 <- acast(df.t,operator ~ month)
kable(df.t2, caption="Average download speed(Kbps) per operator per month")
```
Table below shows details for upload speed.
```{r}
library(plyr)
df.t<-ddply(df.4G,.(operator,month),summarize, mean_up=round(mean(upload_kbps),1))
# reshape df to matrix
df.t2 <- acast(df.t,operator ~ month)
kable(df.t2, caption="Average upload speed(Kbps) per operator per month")
```
Table below shows details for latency.
```{r}
library(plyr)
df.t<-ddply(df.4G,.(operator,month),summarize, mean_up=round(mean(latency),1))
# reshape df to matrix
library(reshape)
df.t2 <- acast(df.t,operator ~ month)
kable(df.t2, caption="Average latency(ms) per operator per month")
```
\newpage
Step 3: Data analysis
=============
So we have pre-processed the data and looked for anomalies in the data. If found, we have corrected them. Finally we are ready to compare speed test data between the three major telecom operators in Netherlands in the above defined coverage area for the period of `r p.QuarterNumber` `r p.Year`.
We will analyse three different metrics:
* Download speed
* Upload speed
* Latency
There is no useful way to aggregate these individual metrics into one overall ‘speed aggregated score’. Most customers are interested in download speeds, because it affects the most of their experience (browsing,streaming,downloading etc). Most network speed comparisons only focus on download speed. However, since upload speed is also important for posting video’s and photos on social media, and ping times are important for gaming and fast opening of websites these metrics are also analyzed.
So how are the different metrics distributed?
Histogram distributions
----------------------
A histogram is a graphical representation of the distribution of data. It is an estimate of the probability distribution of a continuous variable (quantitative variable), [more](https://en.wikipedia.org/wiki/Histogram) about histograms on Wikipedia.
### Download speed
In the histogram below we see download speed in Kbps on the horizontal axis. The number of test cases are plotted as bars, on the vertical axis we see the count of the number of speed tests in a specific bin (or range). The histogram gives a visual representation of the distribution of the data.
```{r downloadspeed, fig.cap='Histogram download speed per operator', fig.height=3, fig.pos='H' }
library(plyr)
cdf <- ddply(df.4G, "operator", summarise, download_kbps.mean=mean(download_kbps))
# With mean lines, using cdf from above
ggplot(df.4G, aes(x=download_kbps)) + geom_histogram( colour="black", fill="white", binwidth = 5000) +
facet_grid(operator ~ .) +
xlab("Download speed (Kbps)") +
ylab("Number of speedtests (count)") +
geom_vline(data=cdf, aes(xintercept=download_kbps.mean),
linetype="solid", size=1, colour="red")
```
We plot the three histograms, one for each operator, right above each other so the horizontal axes are aligned. The red lines denotes the mean (or average) speed of all speed tests for the specific operator. If the red line is placed to the righthandside in this histogram that means the average speed for this operator is faster. The red line to the left means the avarage speed is slower. We can see that the red line of T-Mobile is farthest to the right so T-Mobile apparently has the highest average download speed.
### Upload speed
In the histogram below we see upload speed in Kbps on the horizontal axis. The number of test cases are plotted as bars, on the vertical axis we see the count of the number of speed tests in a specific bin (or range). The histogram gives a visual representation of the distribution of the data.
```{r uploadspeed, fig.cap='Histogram upload speed per operator', fig.height=3 }
library(plyr)
cdf <- ddply(df.4G, "operator", summarise, upload_kbps.mean=mean(upload_kbps))
# With mean lines, using cdf from above
ggplot(df.4G, aes(x=upload_kbps)) + geom_histogram( colour="black", fill="white", binwidth = 1000) +
facet_grid(operator ~ .) +
xlab("Upload speed (Kbps)") +
ylab("Number of speedtests (count)") +
geom_vline(data=cdf, aes(xintercept=upload_kbps.mean),
linetype="solid", size=1, colour="red")
```
We plot the three histograms, one for each operator, right above each other so the horizontal axes are aligned. The red lines denotes the mean (or average) speed of all speed tests for the specific operator. If the red line is placed to the righthandside in this histogram that means the average speed for this operator is faster. The red line to the left means the avarage speed is slower. We can see that the red line of T-Mobile is farthest to the right so T-Mobile apparently has the highest average upload speed as well.
### Latency
In the histogram below we see latency speed on the horizontal axis on a logaritmic scale. The log transformation makes the figure more readable. The number of test cases are plotted as bars, on the vertical axis we see the count of the number of speed tests in a specific range. For latency we take the log so the outlines scale and we can have a look at the shape of the distribution.
```{r latency, fig.cap='Histogram latency per operator', fig.height=3 }
library(plyr)
cdf <- ddply(df.4G, "operator", summarise, latency.mean=mean(latency))
# With mean lines, using cdf from above
ggplot(df.4G, aes(x=latency)) + geom_histogram( colour="black", fill="white", binwidth = 0.02) +
scale_x_log10(limits=c(10, 1000)) +
facet_grid(operator ~ .) +
xlab("Latency (ms) log scale") +
ylab("Number of speedtests (count)") +
geom_vline(data=cdf, aes(xintercept=latency.mean),
linetype="solid", size=1, colour="red")
```
For latency smaller is better, so in this plot we are looking at which operator has the smallest latency. Again we see the red lines per operator. The average latency(red line) for T-Mobile is the most to the left, which means T-Mobile has the smallest average latency of the three operators.
Box-plot
---------
In descriptive statistics, a box plot or box-plot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. [More](https://en.wikipedia.org/wiki/Box_plot) about box-plots on Wikipedia.
### Download speed
We see a box plot for each of the operators on the x axis, the vertical axis shows the speed test values in Kbps. The daimond shape represents the mean, the thick line represents the median. the black dots on the top and bottom represent extreme cases. The data is split up into quartiles, that means four equally sized proportions. The first and fourth quarter are represented as a line, the second and third as a box.
```{r box-down, fig.cap='Boxplot download speed per operator',fig.height=3}
# boxplot
ggplot(df.4G, aes(x=operator, y=download_kbps, fill=operator)) +
geom_boxplot() +
ylab("Download speed (Kbps)") +
guides(fill=FALSE) + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
```
Also in the box-plot we see that for T-Mobile the diamond shape(average) and the thick line representing the median are higher then the same values for the other operators. So also in this box-plot for download speed we see that T-Mobile has the highest average and median download speed.
If we zoom in on percentiles we can look at the fastest 10%, 5% and 1% of speed tests:
```{r quant-down}
qdwn <- ddply(df.4G,.(operator), function(x) round(quantile(x$download,c(.9,.95,.99))))
names(qdwn) <- c("operator","10%","5%","1%")
kable(qdwn, caption="Top percentiles average download speed(Kbps)")
```
From the table we see that T-Mobile scores best in the 10% of download speed tests per operator. For the 5% percentile the T-Mobile outcome is very close to the KPN outcome but both better then Vodafone.
In the top 1% of download speed tests KPN scores best, followed by Vodafone. Both these operators they have the LTE-Advanced technology in some locations mostly in the largest cities. This technology is supported by a small number of devices. In excellent radio conditions and in case the user has a suitable device, download speeds higher than 150MBps (up to 225Mbps) can be achieved. This is the case in less than 1% of all tests of KPN and Vodafone, but also the main reason why the top 1% for these two operators is higher than T-Mobile’s. The LTE-advanced technology does not contribute to higher upload speeds nor lower latency.
### Upload speed
We see a box plot for each of the operators on the x axis, the vertical axis shows the speed test values in Kbps. The daimond shape represents the mean, the thick line represents the median. the black dots on the top and bottom represent extreme cases. The data is split up into quartiles, that means four equally sized proportions. The first and fourth quarter are represented as a line, the second and third as a box.
```{r box-up, fig.cap='Boxplot upload speed per operator',fig.height=3}
# boxplot
ggplot(df.4G, aes(x=operator, y=upload_kbps, fill=operator)) +
geom_boxplot() +
ylab("Upload speed (Kbps)") +
guides(fill=FALSE) + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
```
Also in the box-plot we see that for T-Mobile the diamond shape(average) and the thick line representing the median are higher then the same values for the other operators. So also in this box-plot for upload speed we see that T-Mobile has the highest average and median upload speed.
If we zoom in on percentiles we can look at the fastest 10%, 5% and 1% of speed tests:
```{r quant-up}
qdwn <- ddply(df.4G,.(operator), function(x) round(quantile(x$upload,c(.9,.95,.99))))
names(qdwn) <- c("operator","10%","5%","1%")
kable(qdwn, caption="Top percentiles average upload speed(Kbps)")
```
From the table we see that T-Mobile scores best in the fastest 10%,5% and 1% of upload speed tests per operator.
### Latency
For latency ( or ping) we take the log so the outliers scale and we can have a look at the shape of the distribution.
We see a box plot for each of the operators on the x axis, the vertical axis shows the speed test values in Kbps. The daimond shape represents the mean, the thick line represents the median. the black dots on the top and bottom represent extreme cases. The data is split up into quartiles, that means four equally sized proportions. The first and fourth quarter are represented as a line, the second and third as a box.
```{r box-latency, fig.cap='Boxplot latency speed per operator',fig.height=3}
# boxplot
ggplot(df.4G, aes(x=operator, y=latency, fill=operator)) +
geom_boxplot() +
scale_y_log10() + ylab("Latency (ms) log scale") +
guides(fill=FALSE) + stat_summary(fun.y=mean, geom="point", shape=5, size=4)
```
In the box-plot for latancy we see that for T-Mobile the diamond shape(average) and the thick line representing the median are lower then the same values for the other operators. Remember, for latancy lower is better.
If we look at the fastest 1%, 5% and 10% percentiles we see the following table:
```{r quant-lat}
qlat <- ddply(df.4G,.(operator), function(x) round(quantile(x$latency,c(.01,.05,.1)),2))
names(qlat) <- c("operator","1%","5%","10%")
kable(qlat, caption="Top percentiles average latency (ms)")
```
\newpage
Step 4: Test design
=========
What we want to test is if, on average, a customer that uses T-Mobile 4G Mobile Network gets a higher average speed(in terms of download speed, upload speed and latency) than a customer using KPN’s or Vodafone’s 4G Mobile Network, with all else equal. To do this we have collected thousands of Ookla Speedtest results taken from the three top operators (KPN, Vodafone and T-Mobile) which have been filtered as set out in the above to a final dataset consisting of `r nrow(df.4G)` data points. Now we want to compare T-Mobile with the other two operators. So we do two tests: the first is comparing T-Mobile with KPN and the second is comparing T-Mobile with Vodafone. In each test we compare all three metrics: Upload speed, Download speed and latency(or ping). For each operator we have a sample set available in the data. These sets are so called samples (from the Dutch population of mobile phone users) from which we calculate the sample means. Now our statistical test tests if these sample means are significantly different from one another.
In practice, the Central Limit Theorem assures us that, under a wide range of assumptions, the distributions of the two sample means being tested will themselves approach Normal distributions as the sample sizes get large, regardless (this is where the assumptions come in) of the distributions of the underlying data. As a consequence, as the sample size gets larger, the difference of the means becomes normally distributed, and the requirements necessary for the t-statistic of an unpaired t-test to have the nominal t distribution become satisfied.
Which statistical test do we need?
-----------
What we have here is a set of unpaired, independent, different sample size, different variance data. A suitable and powerful test for this kind of data is a [Welch t-test](https://en.wikipedia.org/wiki/Welch%27s_t_test).
In statistics, Welch's t-test (or Welch-Aspin Test) is a two-sample location test, and is used to check the hypothesis that two populations have equal means(our NULL hypothesis). Welch's t-test is an adaptation of Student's t-test, and is intended for use when the two samples have possibly unequal variances(which is the case here). These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping(in our case the units are different people performing the test with different devices on different networks).
```{r set-confidence-level}
conflevel <- 0.99
```
###Significance
So when is a test significant? And if so at what level? And furthermore can we qualify such a significant result as good or bad? To start with the last remark, all qualifications of a statistical result are subjective. One way of looking at 95% confidence is that 1 out of 20 trials (in 5% of the cases) you make a so called Type 1 error, in which you wrongly reject the null-hypothesis. So in this case, if the p-value would be 0.05(confidence level 95%) you would claim that operator x is faster then operator y while in fact they were not. In applied practice, confidence intervals are typically stated at the 95% or 99% confidence level (More on [significance](https://en.wikipedia.org/wiki/Statistical_significance) ).
In our test we will set the confidence level to be `r 100*conflevel`%, which is more strict then 95%. This means we will reject the Null Hypothesis only if we are `r 100*conflevel` % confident we do not make a mistake. From the test result we see that in most cases the calculated p-values are very much smaller then 1 - `r conflevel` = `r 1-conflevel`, so changes of making this type of error are even considerably smaller than the claimed confidence level of `r 100*conflevel`%.
### P-value
In statistics, the p-value is a function of the observed sample results (a statistic) that is used for testing a statistical hypothesis. Before performing the test a threshold value is chosen, called the significance level of the test, traditionally 5% or 1% and denoted as $\alpha$. If the p-value is equal or smaller than the significance level ($\alpha$), it suggests that the observed data is inconsistent with the assumption that the null hypothesis is true, and thus that hypothesis must be rejected and the alternative hypothesis is accepted as true ( see [wikipedia](https://en.wikipedia.org/wiki/P-value)).
### Confidence intervals
Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. The level of confidence of the confidence interval would indicate the probability that the confidence range captures this true population parameter given a distribution of samples. This value is represented by a percentage, so when we say, "we are `r 100*conflevel`% confident that the true value of the parameter is in our confidence interval", we express that 99% of the observed confidence intervals will hold the true value of the parameter. A confidence interval does **not** predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained. (see [wikipedia](https://en.wikipedia.org/wiki/Confidence_interval)).
\newpage
Step 5: Test results coverage area
===========
We test T-Mobile against the other two operators so we have two tests. We put the confidence level to `r 100*conflevel` %. Our null-hypothesis is that the means are drawn from the same sample, so they are not different.
In this test we use the whole cleaned data set, in the next chapter we test each individual city.
Let's see what our test results are:
```{r absolute-test-result, cache=FALSE, results='asis'}
library(xtable)
# do test result use sourced function
df.tr.dwnload <- fn.ttest(df.4G,"download_kbps",conflevel)
df.tr.upload <- fn.ttest(df.4G,"upload_kbps",conflevel)
df.tr.latency <- fn.ttest(df.4G,"latency",conflevel)
# print test result
print(xtable(df.tr.dwnload,format = "pandoc" , caption = "Comparison of means for metric: Download (Kbps)",row.names=FALSE, booktabs=TRUE), size="\\footnotesize",include.rownames=FALSE,comment=FALSE)
print(xtable(df.tr.upload,format = "pandoc" , caption = "Comparison of means for metric: Upload (Kbps)",row.names=FALSE), size="\\footnotesize",include.rownames=FALSE,comment=FALSE)
print(xtable(df.tr.latency,format = "pandoc", caption = "Comparison of means for metric: Latency (ms)",row.names=FALSE), size="\\footnotesize",include.rownames=FALSE,comment=FALSE)
```
```{r relative conslusions}
# sometimes R is a bit cumbersome...
d1 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Diff(Kbps)"]))
r1 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Rel(%)"]))
l1 <- min(as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Mean 1"])),as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Mean 2"])))
d2 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Diff(Kbps)"]))
r2 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Rel(%)"]))
l2 <- min(as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Mean 1"])),as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Mean 2"])))
```
**Explanation of terms**
\footnotesize
**Sample 1**: Number of speed test samples for operator 1.
**Sample 2**: Number of speed test samples for operator 2.
**Mean 1**: Average speed of speed tests for operator 1 in Kbps. A high number here means that this operator has a fast download(or upload) speed. For the latency a high number means you have to wait longer to contact webpages or servers.
**Mean 2**: Average speed of speed tests for operator 2 in Kbps. A high number here means that this operator has a fast download(or upload) speed. For the latency a high number means you have to wait longer to contact webpages or servers.
**P-value**: The test statistic, for more explanation see the paragraph P-Value above
**Sign.**: Short for Significance. We compare the test statistic with the predefined confidence level(`r conflevel`). 'Yes' means the test is significant, 'No' means the test is not significant."))
**Diff(in Kbps or ms)**: Difference of the means (DoM) is the difference of Mean 1 and Mean 2(Mean 1 - Mean 2). For download and upload speeds(Kbps) big positive number here means operator 1 has a faster speed then operator 2. A big negative number means that operator 2 has a faster speed then operator 1. For latency(ms) this is the opposite, because smaller is better.
**Conf Int**: Confidence interval consist of a range of values (interval) that act as good estimate of the *true* difference of the mean. We are `r 100*conflevel`% confident that the true value of the difference of the mean is in our confidence interval.
**Rel(%)**: Relative difference in percentage. It is calculated as the difference of the mean divided by the slower of the two operators average speed. If the difference is not significant(column Sign is No), this column will state NA(Not Applicable). The comparison rules are similar to what is explained in the Diff(in Kbps or ms).
\normalsize
Looking at the tables above we see that all results are significant at $\alpha$ = `r 1-conflevel` level(`r 100*conflevel`% confidence level) and the resulting p-values are very small. This means we can reject the null-hypothesis with great confidence. Hence we can state that with `r 100*conflevel`% confidence the true difference in the means lies within the confidence interval provided in the table.
\newpage
Conclusion
=========
This analysis has been conducted with the utmost care and to the best knowledge of the analyst (DIKW Consulting). The analysis is opensource and all code can be downloaded, reviewed and repeated from [GitHub](https://github.com/hugokoopmans/ookla-speedtest-analysis).
Overall we can say that based on the speedtest data analysed in the investigated area the 4G network of T-Mobile outperforms both KPN and Vodafone on download speed, upload speed and latency.
From the data analysed in the investigated area the average download speed of the 4G network of T-Mobile outperforms KPN by `r round(d1/1000,2)` Mbps, which is `r r1`%. Also, from the data analysed in the investigated area the average download speed of the 4G network of T-Mobile outperforms Vodafone by `r round(d2/1000,2)` Mbps, which is `r r2`%. From table 14 above similar statements can be derived for upload speed. For deriving these statements for latency, please see table 15 keeping in mind that smaller values are better.
For conclusions per individual city we refer to the section below. Please keep in mind that the significance of a test per city does not influence the significance of a test over the whole 4G area. The significance of a test per city only shows if the 4G network speeds (download speed, upload speed and latency) are also significantly different on a local level, so for that city treated separately.
\newpage