speedtest-analysis.Rmd

---
title: "Independent Speed test Analysis of 4G Mobile Networks \ Performed by DIKW Consulting"
author: "Hugo Koopmans"
date: "03-11-2015"
output:
  pdf_document:
    fig_height: 4
    keep_tex: yes
    number_sections: yes
    template: ./templates/mytemplatedikw.tex
    toc: yes
  word_document: default
---

\newpage

Colophon
-------------
```{r echo=FALSE }
# all parameters should go in here
source("global.R")
```
This analysis is performed by DIKW Consulting.

DIKW Consulting is a consulting firm that takes her customers on the path from Data to information to Knowledge to Wisdom. Our expertise is in the field of data logistics, data warehousing, data mining and machine learning. 

T-Mobile has asked DIKW Consulting to perform this test as an independent third party. DIKW Consulting was paid to perform this test by T-mobile and has no other intentions then to perform this test by it's own high quality standards. The analysis was performed by generally accepted and approved standards and statistical methods using open source tools.

We let the data speak for itself.

If you have questions you can contact [DIKW Consulting](http://www.dikw.nl). If you want to repeat this test by yourself you are welcome to do so, all necessary scripts are available on GitHub. The data is commercially available at [Ookla](http://www.ookla.com/).

This analysis, method, tools and scripts are open sourced and placed on GitHub, see the read-me on the GitHub [repository](https://github.com/hugokoopmans/ookla-speedtest-analysis).

For questions contact Hugo Koopmans at
hugo.koopmans@dikw.com or
+31 6 43106780

Code generation
---------------
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

\newpage

Abstract
========
In this document we have conducted a statistical analysis of Ookla’s Speedtest Intelligence  data from [speedtest.net](http://www.speedtest.net/). Ookla provides commercially available speed test data collected by their mobile app on the three main mobile platforms, Android, iOS and Windows mobile. We will load all the raw speed test data into a database, analyse the data ot the top three operators(T-Mobile, Vodafone and KPN) and perform a test on how fast their respective 4G networks are. Our test will be on download speed, upload speed and latency(or ping).

The speed test data that is provided by Ookla’s Speedtest Intelligence  data from [speedtest.net](http://www.speedtest.net/) and obtained by T-Mobile is of  `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2]` and `r p.ListQuarterMonths[3]` `r p.Year`. We will analyse and validate the data step by step. After investigation on suspicious testing circumstances, such as, but not limited to, devices, location, theoretical maximum speeds and specific dates the data is ready to be subjected to a significance test. This test will be done for all the data in the coverage area and for the biggest twenty cities in the Netherlands.

The result of these tests will give an answer to whether or not T-Mobile has on average a faster 4G network than Vodafone and/or KPN, based on the three metrics upload speed, download speed and latency.

\newpage

Introduction
===========
This document is a report of a statistical analysis of 4G network speed test data. The time period we consider is `r p.QuarterNumber` `r p.Year` so the months `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2]` and `r p.ListQuarterMonths[3]` of `r p.Year` are in scope. 
We perform an analysis of [Ookla](http://www.ookla.com/) Speedtest Intelligence data on three different measures: 

* download speed 
* upload speed 
* latency(or ping) 

On these metrics we compare the three major providers of 4G mobile networks in the Netherlands, Vodafone, KPN and T-Mobile.

This analysis is set up as follows:

* Step 1: Data Collection
* Step 2: Data preprocessing  
* Step 3: Data analysis
* Step 4: Test design
* Step 5: Test results coverage area

In section 8 we provide the conclusion, and additionally the analysis and results of the Top 20 cities per city is presented in section 9.

Step 1 : Data Collection 
======

The data was downloaded from Ookla servers by T-Mobile. For this analysis we use a PostgreSQL database that is locally installed.

The data is loaded from three different files, each file resembling a mobile platform (Android,iPhone and Windows mobile). The data is loaded as-is as it was received from the Ookla server. Scripts to load this data directly into a PostgreSQL database can be found in the GitHub repository.

All scripts to process the data are in SQL(available on GitHub) or R(included in this document, thanks to [knitr](http://yihui.name/knitr/))

```{r options , echo=FALSE, cache=FALSE}

# set global chunk options 
library(knitr)
opts_chunk$set(echo=FALSE, cache=FALSE, tidy=TRUE, warning=FALSE, message=FALSE, error=TRUE)

# load functions
source("statistical-analysis.R")

```


Let's get the data and do some basic counts.

```{r get-raw-data,cache=FALSE}

# postgreSQL
require("RPostgreSQL", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")

# Establish connection to PoststgreSQL using RPostgreSQL
drv <- dbDriver("PostgreSQL")

# read data from 4Gdb
con <- dbConnect(drv, dbname=db.name,host=db.hostname,port=db.portnumber,user=db.username,password=db.password )

# all data joined
df.ad <- dbReadTable(con, c("ookla_all_data"))

# disconnect
sink <- dbDisconnect(con)

```

The raw data
---------------
We load the raw files, downloaded from the Ookla server, into individual tables per mobile platform. In the table below we count the number of speed tests per mobile platform in `r p.QuarterNumber`/`r p.Year`.

```{r raw-table}
library(knitr)
tbl <- addmargins(table(df.ad$os))

kable(as.data.frame(tbl),col.names = c('','Counts'), caption="Raw test data counts")

```

\newpage
So we start this analysis with in total `r nrow(df.ad)` speed tests, which are represented as rows in the data set commercially downloaded from Ookla.

Speedtest.net data
----------------
Ookla designed their speed test in such a way that the results are as robust as possible. Ookla’s [speedtest.net](http://www.speedtest.net/) is the de-facto standard for internet speed testing. According to http://www.ookla.com/speedtest-intelligence, Speedtest Intelligence  is the choice of nearly every Fortune 500 ISP and Mobile Provider in the world. For more information please visit [Speedtest.net](https://support.speedtest.net/hc/en-us).

Sample test data from Ookla
--------------------------
Ookla has some random sample data available, this data can be used to validate our method. To validate the test result one would need the specific data of the Netherlands.

A sample set of Ookla Speedtest Intelligence  data can be found [here](http://www.ookla.com/speedtest-intelligence). The files differ per mobile platform. The file descriptors for all three mobile platforms are listed below.

### Android header descriptives

\footnotesize

```{}
test_id - unique id of test in our system
device_id - unique device id in our system
android_fingerprint
test_date - YYYY-MM-DD HH:MM:SS in Pacific time (we can accommodate different time zones if needed)
client_ip - ip of client
download_kbps - download speed in kilobits per second
upload_kbps - upload speed in kilobits per second
latency_ms - ping in milliseconds
server_name - name of server tested to (name of city it is located in)
server_country - country name of server
server_country_code - country code of server
server_latitude - latitude of server tested to
server_longitude - longitude of server tested to
server_sponsor_name - sponsor name of server
client_country - country name of the client
client_country_code - country code of the client
client_region - region name of client (this will be state in the US)
client_region_code - region code of client
client_city - city of client
client_latitude - latitude of client (GPS or Maxmind when location services disabled)
client_longitude - longitude of client (GPS or Maxmind when location services disabled)
miles_between - miles between the client and the server tested to
connection_type - http://developer.android.com/reference/android/telephony/TelephonyManager.html
   0=unknown,1= Cell, 2=Wifi, 3=Gprs, 4=Edge, 5=Utms, 6=Cdma, 7=Evdo0, 8=EvdoA, 9=OnexRTT, 
   10=Hsdpa, 11=Hspa, 12=Iden, 13=Ehrpd, 14=EvdoB, 15=Lte, 16=Hsupa, 17=Hspap
isp_name  - name of ISP (Maxmind)
is_isp    - 0=Corporation/Academic, 1=ISP
network_operator_name - Mobile Carrier Name http://developer.android.com/reference/android
   /telephony/TelephonyManager.html#getNetworkOperatorName()
network_operator_code - MCC + MNC  http://developer.android.com/reference/android
   /telephony/TelephonyManager.html#getNetworkOperator() 
brand - http://developer.android.com/reference/android/os/Build.html#BRAND 
device - http://developer.android.com/reference/android/os/Build.html#DEVICE 
hardware - http://developer.android.com/reference/android/os/Build.html#HARDWARE 
build_id - http://developer.android.com/reference/android/os/Build.html#ID 
manufacturer - http://developer.android.com/reference/android/os/Build.html#MANUFACTURER 
model - http://developer.android.com/reference/android/os/Build.html#MODEL 
product - http://developer.android.com/reference/android/os/Build.html#PRODUCT  
cdma_cell_id - http://developer.android.com/reference/android/telephony/cdma/package-summary.html 
gsm_cell_id - http://developer.android.com/reference/android/telephony/gsm/package-summary.html 
location_type - 0 = unknown, 1 = GPS, 2 = GeoIP
sim_network_operator_name - Mobile Carrier Name from the SIM
sim_network_operator_code - MCC + MNC from the SIM  http://en.wikipedia.org/wiki/Mobile_Country_Code 
```
\normalsize

### iOS header descriptives

\footnotesize

```{ }
test_id - unique id of test in our system
device_id - unique device id in our system
test_date - YYYY-MM-DD HH:MM:SS in Pacific time (we can accommodate different time zones if needed)
client_ip - ip of client 
download_kbps - download speed in kilobits per second
upload_kbps - upload speed in kilobits per second
latency_ms - ping in milliseconds
server_name - name of server tested to (name of city it is located in)
server_country - country name of server
server_country_code - country code of server
server_latitude - latitude of server tested to
server_longitude - longitude of server tested to
server_sponsor_name - sponsor name of server
client_country - country name of the client
client_country_code - country code of the client
client_region - region name of client (this will be state in the US)
client_region_code - region code of client
client_city - city of client
client_latitude - latitude of client (GPS or Maxmind when location services disabled)
client_longitude - longitude of client (GPS or Maxmind when location services disabled)
miles_between - miles between the client and the server tested to
connection_type - 0=unknown, 1=cell, 2=wifi, 3=GPRS, 4=Edge, 5=WCDMA, 6=HSDPA, 
   7=HSUPA, 8=CDMA1x, 9=CDMAEVDORev0, 10=CDMAEVDORevB, 11=eHRPD, 12=LTE 
isp_name  - name of ISP (Maxmind) 
is_isp    - 0=Corporation/Academic, 1=ISP
carrier_name - http://developer.apple.com/library/ios/documentation/NetworkingInternet
   /Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/carrierName 
iso_country_code - http://developer.apple.com/library/ios/documentation/NetworkingInternet
   /Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/isoCountryCode 
mobile_country_code - http://developer.apple.com/library/ios/documentation/NetworkingInternet
   /Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/mobileCountryCode 
mobile_network_code - http://developer.apple.com/library/ios/documentation/NetworkingInternet
   /Reference/CTCarrier/Reference/Reference.html#//apple_ref/occ/instp/CTCarrier/mobileNetworkCode
model - iPad, iPhone, iPod Touch 
version - iOS version
location_type - 0 = unknown, 1 = GPS, 2 = GeoIP
```
\normalsize

### Windows Mobile header descriptives

\footnotesize

```{}
test_id - unique id of test in our system
device_id - unique device id in our system
test_date - YYYY-MM-DD HH:MM:SS in Pacific time (we can accommodate different timezones if needed)
client_ip - ip of client 
download_kbps - download speed in kilobits per second
upload_kbps - upload speed in kilobits per second
latency_ms - ping in milliseconds
server_name - name of server tested to (name of city it is located in)
server_country - country name of server
server_country_code - country code of server
server_latitude - latitude of server tested to
server_longitude - longitude of server tested to
server_sponsor_name - sponsor name of server
client_country - country name of the client
client_country_code - country code of the client
client_region - region name of client (this will be state in the US)
client_region_code - region code of client
client_city - city of client
client_latitude - latitude of client (GPS or Maxmind when location services disabled)
client_longitude - longitude of client (GPS or Maxmind when location services disabled)
miles_between - miles between the client and the server tested to
connection_type - 0=unknown, 1=cell, 2=wifi, 3=GPRS, 4=1XRTT, 5=EVDO, 6=EDGE, 7=3G, 
   8=HSPA, 9=EVDV, 10=PassThru, 11=LTE, 12=EHRPD
isp_name  - name of ISP (Maxmind) 
is_isp    - 0=Corporation/Academic, 1=ISP
carrier_name - AT&T, Verizon etc 
manufacturer - Nokia, HTC, etc.
device_name - name of the device for e.g. "HD7 T9292" 
hardware_version - device hardware version e.g. "1.0.0.0"
firmware_version - device firmware_version e.g. "1232.2107.1241.1001"
location_type - 0 = unknown, 1 = GPS, 2 = GeoIP

```

\normalsize


Step 2: Data preprocessing
==============
In order to compare the data from the three different mobile platforms, we need to perform basic data transformations and merge it into one table. 
 
Following that, in this preprocessing and analysis step we validate the data on the following points: 1. Are there any specific individual devices that perform a suspiciously high number of tests? 
2. We apply filters so only the tests from the three operators we are interested in remain.
3. We apply filters so only tests done on 4G technology remain.
4. We are only interested in the coverage area in which all three operators claim to have 4G coverage.
5. We look at speed test results that are “too good to be true” that is, measured speeds that are above the theoretical maximum possible for that specific technology. We remove these speed tests.
6. We look at specific coordinates that are very frequent, depending on the explanation as for why these coordinates are used to often we remove or delete the speed tests per coordinate.
7. We look at specific dates that have a high number of speed tests for that day.

After all these checks we end up with a data set that is cleaned and ready to perform a statistical  significance test on the investigated metrics.

## Basic data transformations

As explained in section 3.3., the data from the three different mobile platforms comes in different formats and some basic data transformations are necessary before we can merge it into one table. 

First of all, the names for the individual operators are spelled in various ways (e.g. ‘T-Mobile NL‘, ‘T-Mobile  NL’). Next, we need to map connection types to the specific technology used (2G, 3G or 4G) depending on the operating system of the device. For more details on these transformations, please see the SQL script on GitHub. 

After these transformations are performed, we proceed with the checks and cleaning steps explained at in the previous section.

## Suspicious devices

Are there any devices that perform tests very frequently?

In order to investigate this, let’s look at a frequency plot of devices that occur at least **ten** times in each month. On the right hand side we see devices that are used for testing very often, some of them even on an hourly basis. Obviously those devices are not in the hands of real customers so these will be removed from the data set.

```{r freq-plot-devices}
library(ggplot2)

# months
df.ad.m1 <- df.ad[as.Date(df.ad$test_date) >= as.Date('2015-07-01') & as.Date(df.ad$test_date) < as.Date('2015-08-01'),]
df.ad.m2 <- df.ad[as.Date(df.ad$test_date) >= as.Date('2015-08-01') & as.Date(df.ad$test_date) < as.Date('2015-09-01'),]
df.ad.m3 <- df.ad[as.Date(df.ad$test_date) >= as.Date('2015-09-01') & as.Date(df.ad$test_date) < as.Date('2015-10-01'),]

# plot grid
par(mfrow=c(3,1),mar=c(1,1,1,1),mai=c(0.1,1,0.1,0.1)) 

#Tabel voor maand 1 
tbl <- table(df.ad.m1$device_id)
tbl <- sort(tbl[tbl>10]) # only keep > 10
nms <- rep("",dim(tbl))
barplot(tbl, names.arg = nms,main=paste(p.ListQuarterMonths[1],"2015"),ylab="Number of tests",xlab="Unique devices")
tbl.m1 <- tbl # TO DO aanpassen maand

#Tabel voor maand 2
tbl <- table(df.ad.m2$device_id)
tbl <- sort(tbl[tbl>10]) # only keep > 10
nms <- rep("",dim(tbl))
barplot(tbl, names.arg = nms,main=paste(p.ListQuarterMonths[2],"2015"),ylab="Number of tests",xlab="Unique devices")
tbl.m2 <- tbl

#Tabel voor maand 3
tbl <- table(df.ad.m3$device_id)
tbl <- sort(tbl[tbl>10]) # only keep > 10
nms <- rep("",dim(tbl))
barplot(tbl, names.arg = nms,main=paste(p.ListQuarterMonths[3],"2015"),ylab="Number of tests",xlab="Unique devices")
tbl.m3 <- tbl

# max number of test to call a device suspicious
max_susp_dev = 30
```

Based on these plots and the amount of data consumed when performing a speedtest, we decide to remove specific devices that test more than `r max_susp_dev` times a month. These devices are probably used by telecom professionals for testing purposes. The amount of data consumed per speedtest depends on the speed: the higher the speed, the more data is consumed. A test on 4G at very high speed can cost up to 100 MBytes per test out of the data bundle. So, 30 or more tests per month at high speed are equivalent to approximately 3GB of data usage just spend on tests alone and that is suspicious. The number 30 by itself is subjective, we could use 25 or 40 depending on what exact number of tests per device you would call suspicious.

Actually including these are not influencing the results significantly, but we want to use real customer data as much as possible, not affected by professionals testing their own (or others) network.


```{r get-clean-data, cache=FALSE}
# read data from 4Gdb
# Establish connection to PoststgreSQL using RPostgreSQL
drv <- dbDriver("PostgreSQL")

# Full version of connection setting
con <- dbConnect(drv, dbname=db.name,host=db.hostname,port=db.portnumber,user=db.username,password=db.password ) 

# all data joined and cleaned for suspicious devices in SQL
df.clnd <- dbReadTable(con, c("ookla_all_data_clean"))

# get data for tm coverage area only
df.4G <- dbReadTable(con, c("datatm4gcoverage"))

# disconnect
sink <- dbDisconnect(con)
```

We identify `r sum(tbl.m1>=max_susp_dev)` devices in `r p.ListQuarterMonths[1]`, `r sum(tbl.m2>=max_susp_dev)` devices in `r p.ListQuarterMonths[2]` and `r sum(tbl.m3>=max_susp_dev)` devices in `r p.ListQuarterMonths[3]`, which in total represent `r nrow(df.ad)-nrow(df.clnd)` speed tests.
After filtering these devices the data set has `r nrow(df.ad)`-`r nrow(df.ad)-nrow(df.clnd)`=`r nrow(df.clnd)` speed test cases.

Top three operators
-------------------
For this analysis we are only interested in the top three operators in the Netherlands. In the data set, at this point, there are `r length(levels(as.factor(df.clnd$operator)))` different operators identified. As we can see in the table below, in which we ranked the ten most used operators, most of the speed tests were performed by people using one of the three operators we are interested in. There are 7137 tests with no identifiable operator. We will filter out all but the top 3 operators and proceed with speed tests from these top three operators.

```{r top-operators}
top10 <- head(sort(table(df.clnd$operator),decreasing = TRUE),n=10)
kable(as.data.frame(cbind(rownames(top10),top10),row.names = FALSE),col.names = c('Operator','Number of speedtests'),align=c("l","r"),caption="Most frequent operators")

# filter on top three operators only
ops <- c("T-Mobile NL","Vodafone NL","KPN NL")
df.clnd.t3 <- df.clnd[df.clnd$operator %in% ops,]

```

The top three operators together are good for `r sum(tail(sort(table(df.clnd$operator)),n=3))` tests conducted all over the Netherlands in the test period of `r p.QuarterNumber` `r p.Year`. We keep only speed tests from the top three operators. 

This leaves `r nrow(df.clnd)` - `r nrow(df.clnd)-nrow(df.clnd.t3)` = `r nrow(df.clnd.t3)` rows in the data set.

Focus on 4G technology
----------------
In the raw Ookla Speedtest Intelligence  data the variable called 'connection_type' identifies which technology is used, this variable can be transformed into the network technology used while performing the test.

The variable Connection type defines 4G as connection type 15 for Android. For iOS connection type 12 is LTE, and for Windows Mobile connection type 11 is LTE.

Definition of 4G for Android OS from the SQL script(available on GitHub) :

```sql
  Case  WHEN  CONNECTION_TYPE=0			THEN 'UNKNOWN'
  	WHEN	CONNECTION_TYPE in (1,2)		THEN 'WIFI/CELL'
  	WHEN	CONNECTION_TYPE in (3,4)		THEN '2G'
  	WHEN	CONNECTION_TYPE=15			THEN '4G'
  	WHEN	CONNECTION_TYPE between 5 and 17	THEN '3G'
  	ELSE	'UNKNOWN'
  END AS TECHNOLOGY
```
Below we give an overview of the network technology types available in the data set.

```{r tech}
library(knitr)
tbl <- table(df.clnd.t3$technology)
kable(as.data.frame(cbind(tbl,100*prop.table(tbl))),col.names = c('Number of cases','Percentage'), digits = 2, caption="Technology used in tests")

# number of 4G tests to be used in document
n4g <- tbl["4G"]
```

**In the remainder of this analysis we will focus on 4G technology.**

Filtering on 4G technology leaves `r n4g` test cases in the data set.

Operating systems
------------------
For the top three operators we can look at the type of operating system used on these devices:

```{r top3-table}
library(knitr)
tbl <- table(df.clnd.t3$os,df.clnd.t3$operator)
kable(tbl)
```

Most of the tests were conducted on iOS closely followed by Android OS. Windows Mobile devices have limited representation in the data set. In this test we are not interested in testing the difference in performance per device or operating system.

Geographical coverage area for 4G
------------------
For this test to be fair to all three operators, we limit the comparison of the test to areas in which all three operators (KPN, Vodafone and T-Mobile) claim to have 4G coverage at the time of the measurements. While KPN and Vodafone already claim national 4G coverage, T-Mobile is still in the process of expanding their 4G network. Therefore, T-Mobile 4G coverage area is extended every month. This means that some areas only got 4G coverage during `r p.QuarterNumber` `r p.Year`, the period of the test. All of the top 21 cities in the Netherlands, which we will analyze per city,  had 4G enabled prior to `r p.QuarterNumber ` `r p.Year`.

### Coordinates with very high number of tests

Are there any locations, or coordinates, that occur very often in the investigated 4G area?
If we join the coordinates latitude and longitude together and look at the most frequent occurrences we see that there are indeed some coordinates that are very frequent. How do these exact same coordinates end up in the data? To understand this we need to explain a bit more on how the Ookla Speed test application gets the coordinates from a mobile device([read more online](https://support.speedtest.net/hc/en-us/articles/203845480-Mobile-Test-Server-Selection)). 
There are several scenario's that can be the case: 
1)The customer has approved the application access to the GPS coordinates of his/her device.
2)For some reason the app cannot read the GPS coordinates from the device at the time of the test. This reason can be of different origins, the user has blocked access or we are in a building or there are other technical reasons why the exact GPS coordinates cannot be accessed.

Whenever the exact coordinates are not available, due to measurement issues or because the customer is not allowing the application to use the GPS coordinates Ookla uses GEO-IP. GEO-IP is a online service to estimate the physical location of an ip-internet address (more online [from maxmind](https://www.maxmind.com/en/geoip2-services-and-databases) ).

```{r susp-coord}

# combine coordinates in "lat-lon" 
str.coord <- paste(df.4G$client_latitude,",",df.4G$client_longitude)

# top 10 most frequent locations
tbl <- head(sort(table(str.coord),decreasing = TRUE),n=15)
nms <- names(tbl)
df.tmp <- as.data.frame(cbind(nms,tbl))
names(df.tmp) <- c("Coordinates","Count")
kable(df.tmp, row.names = FALSE,align=c("l","r"))

coord <- names(sort(table(str.coord),decreasing = TRUE)[1])
n.test.at.coord <- sort(table(str.coord),decreasing = TRUE)[1]

```

We asked for an explanation from Ookla on frequent and rounded coordinates.

Coordinate (52.3667,4.9)  : Response from Ookla: *Results are definitely coming from GEO-IP Coordinates as you know. 2. In a single day here, there is 191 results from the same IP block to this location. 3. Correct, this is similar to the 'Kansas' issue we discussed last year.*

Coordinate (52.35-4.9167) is in Amsterdam. Question to Ookla: Can we still assume the measurements are from Amsterdam?
Response Ookla: *Yes, GeoIP is used here. We can assume you are in Amsterdam but like any other GeoIP location result, the confidence level isn't as high as it would be if we were able to get location information directly from the device, meaning if we were able to obtain GPS instead of GEO-IP.* 

Coordinate (52.3666-4.9027) is also in Amsterdam(Waterloo plein).

Coordinate (52.374-4.8897) is also in Amsterdam(Spuistraat).

Coordinate (51.9167-4.5) is in Rotterdam. Again with limited precision, same response as above from Ookla.

Coordinate (52.0666-4.3209) is in The Hague. All cases are tests with operator equal to "T-Mobile NL" also this location is close to the T-Mobile office in The Hague. We will exclude these tests as potentially being from T-Mobile employees.

Also see the [precision of the coordinates](https://en.wikipedia.org/wiki/Decimal_degrees) denotes the fact that we are unsure about the exact location.

**So what do we do with these suspicious coordinates?**
The "Unknown" location, which has coordinates (`r coord`) and is the default location from Ookla if the GPS cannot pick up the exact location during the test, we have `r n.test.at.coord` speed tests with this coordinate alone. We will exclude tests performed at this coordinate from the general analysis. Nevertheless, we will analyse this set the same way we analyse individual cities, the result for this set can be found as the last city labeled "unknown".

```{r default-location}
# filter unkown location
df.unkwnloc <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord,]

# filter out T-Mobile office 52.0666-4.3209
coord.tm <- "52.0666 , 4.3209"
df.tm <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord.tm,]

# filter out KPN office 52.0666 4.3479
coord.kpn <- "52.0666 , 4.3479"
df.kpn <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord.kpn,]

# filter out Vodafone office 52.3767 4.9061
coord.vf <- "52.3767 , 4.9061"
df.vf <- df.4G[paste(df.4G$client_latitude,",",df.4G$client_longitude)==coord.vf,]

# total number of tests around head offices
n.ho.tot = nrow(df.tm)+nrow(df.kpn)+nrow(df.vf)

# filter top 3 head offices location from 4G 
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.tm)),]
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.kpn)),]
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.vf)),]

# replace ookla default location with UNKNOWN
df.unkwnloc$gm_naam <- "Unkown Location"

# filter unkown location from 4G as we want only known locations
df.4G <- df.4G[setdiff(rownames(df.4G),rownames(df.unkwnloc)),]

```

**Head office locations**
We removed `r nrow(df.tm)` tests from coordinates "52.0666 , 4.3209" in The Hague. As they are close to the head-office of T-Mobile and indeed all tests from this location are done from a T-Mobile network.

We did the same for locations close to the head offices of KPN and Vodafone. We removed `r nrow(df.kpn)` tests from the co-ordinates "52.0666 4.3479" (KPN office) and we removed `r nrow(df.vf)` tests from the coordinates "52.3767 4.9061"" (Vodafone office Amsterdam).

After removing `r n.ho.tot` tests from the coordinates around the head-offices of the top three providers as described in the above paragraph, together with  `r nrow(df.unkwnloc)` from the coordinates "`r coord`" the data set contains `r n4g` - `r n.ho.tot + nrow(df.unkwnloc)`  = `r n4g - n.ho.tot - nrow(df.unkwnloc)` speed tests at this point.


### Mapping test coordiantes to city boundaries
To identify the exact 4G coverage area we will use in this analysis, we use data from CBS. CBS is the Central Bureau of Statistics in the Netherlands. They provide [publicly available](http://www.cbs.nl/nl-NL/menu/themas/dossiers/nederland-regionaal/publicaties/geografische-data/archief/2014/2013-wijk-en-buurtkaart-art.htm) polygon data on cities in the Netherlands. Based on these geographical city boundaries we map each latitude, longitude coordinate onto a city.

From T-Mobile we received a list of cities that, at the time of testing,  have 4G coverage. From the data we can see that per city each provider has sufficient number of speed tests in the data set for the tests to be representative. 

We use city boundaries to not be influenced by exact locations of network infrastructure.

Online we can get an up to date overview of network coverage for all three operators (see [here](http://www.4gdekking.nl/)).

This test is not about coverage but about speed of the 4G network. 

T-Mobile has the least 4G coverage of the three operators and per end of `r p.QuarterNumber` `r p.Year` has actual coverage in the following area:

```{r area }
library(png)
library(grid)
img <- readPNG("./img/tm-coverage-q3-2015.png")
grid.raster(img)

```

In order to do a fair comparison, we focus only on the area presented in the picture above.

We do a Geo-spacial filter on the latitude and longitude coordinates provided in the data set and only keep speed tests that are in any of the above city boundaries. For the area, defined above, the following number of tests are available in the data set.
```{r}
tbl <- table(df.4G$operator)
mt <- addmargins(tbl)
pt <- c(as.numeric(prop.table(tbl)),1.0)
kable(as.data.frame(cbind(mt,100*pt)),col.names = c('Number of tests','percentage'), digits = 2,caption="Number of 4G speedtests in the selected coverage area.")

```

Areas where T-Mobile announced 4G coverage in `r p.QuarterNumber` `r p.Year`: 

July: Heerenveen, Weststellingwerf, Hoogeveen, Meppel, Staphorst, Asten, Boekel, Deurne, Someren, Venray, Gemert-Bakel, Heeze-Leende, De Wolden, Westerveld and Midden-Drenthe
 
August: Leeuwarden, Graft-De Rijp, Beemster, Bergen (NH.), Castricum, Edam-Volendam, Enkhuizen, Heerhugowaard, Heiloo, Den Helder, Hoorn, Langedijk, Medemblik, Opmeer, Schagen, Texel, Schermer, Zeevang, Drechterland, Stede Broec, Waterland, Beesel, Nederweert, Venlo, Weert, Horst aan de Maas, Koggenland, Leudal, Cranendonck, Peel en Maas and Hollands Kroon
 
September: Emmen, Zeewolde, Skarsterlân, Lemsterland, Kampen, Noordoostpolder, Urk, Elburg, Oldebroek, Dronten, Baarle-Nassau, Etten-Leur, Rucphen, Steenbergen, Woensdrecht, Zundert, Halderberge, Roosendaal, Steenwijkerland, Alphen-Chaam and Zwartewaterland

All other cities in the TMNL coverage area had 4G enabled prior to Q3 2015.

Naturally, in the cities where 4G was announced in `r p.QuarterNumber`, almost all of T-Mobile’s 4G speedtests occur after announcing to customers that 4G is activated, so `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2]` or `r p.ListQuarterMonths[3]` respectively. Even though a very small number of tests were executed during the extension of the 4G coverage onto these cities, as 4G sites were being added (so before announcing 4G coverage in these cities; this is possible as for most 4G capable phones, the 4G network is selected automatically if available). In the same cities, KPN and Vodafone speedtests are distributed more evenly over the entire period of `r p.QuarterNumber`/`r p.Year`. However, this has no influence on the test results, as the test results per month do not differ substantially (please see section 4.6.4). 

So we filtered the data set to include only speed tests from the coverage area, speed tests outside this area are neglected.
The data set now contains `r n4g - n.ho.tot - nrow(df.unkwnloc)` - `r n4g - n.ho.tot - nrow(df.unkwnloc) - nrow(df.4G)` = `r nrow(df.4G)` speed tests.

### Suspicious speeds

In the data we check for up and download speeds that are technically impossible. 
*Download speeds* for 4G are limited to 150Mbps on the T-mobile technology.

KPN and Vodafone have a technology called LTE advanced which has a maximum download speed of 225Mbps.

Any speed tests that had a speed recorded above the technical maximum for that operator was removed from the data set.
 
```{r susp-down-speed}
# count number of cases that exceed theoretical max speed
n.max.tm <- nrow(subset(df.4G, download_kbps>150000 & operator == "T-Mobile NL"))
n.max.vf <- nrow(subset(df.4G, download_kbps>225000 & operator == "Vodafone NL"))
n.max.kpn <-nrow(subset(df.4G, download_kbps>225000 & operator == "KPN NL"))

# create filters 
rw.fltr.1 <- rownames(subset(df.4G, download_kbps>150000 & operator == "T-Mobile NL"))
rw.fltr.2 <- rownames(subset(df.4G, download_kbps>225000 & operator == "Vodafone NL"))
rw.fltr.3 <- rownames(subset(df.4G, download_kbps>225000 & operator == "KPN NL"))

# filter out these extremes
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.1),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.2),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.3),]

# filter also applied to seperate data in df.unkwnloc ...
unk.fltr.1 <- rownames(subset(df.unkwnloc, download_kbps>150000))
df.unkwnloc <- df.unkwnloc[setdiff(rownames(df.unkwnloc),unk.fltr.1),]

```

So we remove suspicious measurements in which the **download** speed exceeded the maximum theoretical speed per individual operator. 

For T-Mobile we removed `r n.max.tm` cases, for Vodafone we removed `r n.max.vf` cases and for KPN we removed `r n.max.kpn` cases, because they where above 150(or 225) MBps.

After removing in total `r n.max.tm+n.max.vf+n.max.kpn` of these suspicious measurements, the data set contains `r nrow(df.4G)` speed tests at this point.

Let's do the same for *upload speed*.

The maximum theoretical upload speed is for all operators the same: maximum upload speed of 50Mbps.

```{r susp-upload-speed}
# create filters 
rw.fltr.4 <- rownames(subset(df.4G, upload_kbps>50000 & operator == "T-Mobile NL"))
rw.fltr.5 <-rownames(subset(df.4G, upload_kbps>50000 & operator == "Vodafone NL"))
rw.fltr.6 <-rownames(subset(df.4G, upload_kbps>50000 & operator == "KPN NL"))

# filter out these extremes
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.4),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.5),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.6),]

# filter also applied to seperate data in df.unkwnloc ...
unk.fltr.2 <- rownames(subset(df.unkwnloc, upload_kbps>50000))
df.unkwnloc <- df.unkwnloc[setdiff(rownames(df.unkwnloc),unk.fltr.2),]

```

So again we remove suspicious measurements in which the **upload** speed exceeded the maximum theoretical speed per individual operator. For T-Mobile we removed `r length(rw.fltr.4)` cases, for Vodafone we removed `r length(rw.fltr.5)` cases and for KPN we removed `r length(rw.fltr.6)` cases.

After removing in total `r length(rw.fltr.4)+length(rw.fltr.5)+length(rw.fltr.6)` of these suspicious measurements the data set contains `r nrow(df.4G)` speed tests at this point.

Let's look at **latency** also. We see the maximum value in latecy measurements is 65535, which is the maximum value for a variable of type unsigned short. This is clearly an outlier and should be removed. 

Response Ookla: *"That issue was resolved in the latest Android release. As more and more clients download the latest version of the app, this will become less and less obvious. Just filter out for now until its cleaned up in the files."*

```{r susp-latency-speed}
# count number of cases that exceed theoretical max speed
n.l.mx.tm <- nrow(subset(df.4G, latency==65535 & operator == "T-Mobile NL"))
n.l.mx.vf <- nrow(subset(df.4G, latency==65535 & operator == "Vodafone NL"))
n.l.mx.kpn <-nrow(subset(df.4G, latency==65535 & operator == "KPN NL"))

# create filters 
rw.fltr.7 <- rownames(subset(df.4G, latency==65535 & operator == "T-Mobile NL"))
rw.fltr.8 <- rownames(subset(df.4G, latency==65535 & operator == "Vodafone NL"))
rw.fltr.9 <- rownames(subset(df.4G, latency==65535 & operator == "KPN NL"))

# filter out these extremes
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.7),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.8),]
df.4G <- df.4G[setdiff(rownames(df.4G),rw.fltr.9),]

# filter also applied to seperate data in df.unkwnloc ...
unk.fltr.3 <- rownames(subset(df.unkwnloc, latency==65535))
df.unkwnloc <- df.unkwnloc[setdiff(rownames(df.unkwnloc),unk.fltr.3),]

```

So we remove `r n.l.mx.tm` measurements from the T-Mobile set , `r n.l.mx.kpn` measurements from the KPN set and `r n.l.mx.vf` measurements from the Vodafone set. After this the data set contains `r nrow(df.4G)` speed tests at this point.


\newpage

### Suspicious dates or times
We count the number of tests per day for the months `r p.ListQuarterMonths[1]`, `r p.ListQuarterMonths[2] ` and `r p.ListQuarterMonths[3]` of `r p.Year`. Again we are looking at any suspicious peaks in the data.
```{r timestamps, fig.align='center',fig.width=8}
library(ggplot2)
library(reshape2)

#Timeseries
counts <- table(df.4G$operator, df.4G$test_date)

df.cnts <- as.data.frame(counts)
names(df.cnts) <- c("operator","date","freq")

# Faceting is a good alternative:
ggplot(df.cnts, aes(x=date,y=freq)) + geom_bar(width=.5,stat="identity") +
  facet_wrap(~ operator, nrow = 3) +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5, size=6))

```

We find no disturbing or unknown peaks on a specific date, except for Vodafone on the 18/08/2015. This can be explained by the network incident that Vodafone had on this date, causing customers to experience slow or no internet. The problem was fixed the same day and it was communicated to customers, which is a stimulus to test.

```{r}
# table
tbl <- addmargins(table(df.4G$operator,paste(format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%m"), format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%B"))))
kable(tbl,caption="Counts per operator per month")

```

Above a count per month per operator in the 4G coverage area. This is the data set on which we conduct the remainder of the analysis.

In the tables below we list the averages for download speed, upload speed and latency per operator per month. For KPN we see a large increase in download and upload speed between August and September, as well as reduced latency. This is probably a result of KPN adding a substantial amount of sites in the large cities by activating 4G on the 1800Mhz frequency. This also enables them to activate LTE Advanced on those locations, which has an additional positive influence on average download speed, because it allows download speeds of up to 225Mbps, provided that the customer doing the tests has a device that can support this and the network conditions at the location allow this.

```{r}
library(plyr)
# add month
df.4G$month <- paste(format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%m"),format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%B"))
#df.4G$month <- format(as.Date(df.4G$test_date, "%Y-%m-%d"), "%m")
df.t<-ddply(df.4G,.(operator,month),summarize, mean_dwn=round(mean(download_kbps),1))

# reshape df to matrix
library(reshape)
df.t2 <- acast(df.t,operator ~ month)

kable(df.t2, caption="Average download speed(Kbps) per operator per month")

```

Table below shows details for upload speed.
```{r}
library(plyr)
df.t<-ddply(df.4G,.(operator,month),summarize, mean_up=round(mean(upload_kbps),1))
# reshape df to matrix
df.t2 <- acast(df.t,operator ~ month)

kable(df.t2, caption="Average upload speed(Kbps) per operator per month")

```

Table below shows details for latency.
```{r}
library(plyr)
df.t<-ddply(df.4G,.(operator,month),summarize, mean_up=round(mean(latency),1))

# reshape df to matrix
library(reshape)
df.t2 <- acast(df.t,operator ~ month)

kable(df.t2, caption="Average latency(ms) per operator per month")

```

\newpage

Step 3: Data analysis
=============
So we have pre-processed the data and looked for anomalies in the data. If found, we have corrected them. Finally we are ready to compare speed test data between the three major telecom operators in Netherlands in the above defined coverage area for the period of `r p.QuarterNumber` `r p.Year`. 

We  will analyse three different metrics:

* Download speed
* Upload speed
* Latency

There is no useful way to aggregate these individual metrics into one overall ‘speed aggregated score’. Most customers are interested in download speeds, because it affects the most of their experience (browsing,streaming,downloading etc).  Most network speed comparisons only focus on download speed. However, since upload speed is also important for posting video’s and photos on social media, and ping times are important for gaming and fast opening of websites these metrics are also analyzed.

So how are the different metrics distributed?

Histogram distributions
----------------------
A histogram is a graphical representation of the distribution of data. It is an estimate of the probability distribution of a continuous variable (quantitative variable), [more](https://en.wikipedia.org/wiki/Histogram) about histograms on Wikipedia.

### Download speed

In the histogram below we see download speed in Kbps on the horizontal axis. The number of test cases are plotted as bars, on the vertical axis we see the count of the number of speed tests in a specific bin (or range). The histogram gives a visual representation of the distribution of the data.

```{r downloadspeed, fig.cap='Histogram download speed per operator', fig.height=3, fig.pos='H' }
library(plyr)
cdf <- ddply(df.4G, "operator", summarise, download_kbps.mean=mean(download_kbps))

# With mean lines, using cdf from above
ggplot(df.4G, aes(x=download_kbps)) + geom_histogram( colour="black", fill="white", binwidth = 5000) + 
    facet_grid(operator ~ .) +
    xlab("Download speed (Kbps)") +
    ylab("Number of speedtests (count)") +
    geom_vline(data=cdf, aes(xintercept=download_kbps.mean),
               linetype="solid", size=1, colour="red")

```

We plot the three histograms, one for each operator, right above each other so the horizontal axes are aligned. The red lines denotes the mean (or average) speed of all speed tests for the specific operator. If the red line is placed to the righthandside in this histogram that means the average speed for this operator is faster. The red line to the left means the avarage speed is slower. We can see that the red line of T-Mobile is farthest to the right so T-Mobile apparently has the highest average download speed. 

### Upload speed

In the histogram below we see upload speed in Kbps on the horizontal axis. The number of test cases are plotted as bars, on the vertical axis we see the count of the number of speed tests in a specific bin (or range). The histogram gives a visual representation of the distribution of the data.

```{r uploadspeed, fig.cap='Histogram upload speed per operator', fig.height=3 }
library(plyr)
cdf <- ddply(df.4G, "operator", summarise, upload_kbps.mean=mean(upload_kbps))

# With mean lines, using cdf from above
ggplot(df.4G, aes(x=upload_kbps)) + geom_histogram( colour="black", fill="white", binwidth = 1000) + 
    facet_grid(operator ~ .) +
    xlab("Upload speed (Kbps)") +
    ylab("Number of speedtests (count)") +
    geom_vline(data=cdf, aes(xintercept=upload_kbps.mean),
               linetype="solid", size=1, colour="red")

```

We plot the three histograms, one for each operator, right above each other so the horizontal axes are aligned. The red lines denotes the mean (or average) speed of all speed tests for the specific operator. If the red line is placed to the righthandside in this histogram that means the average speed for this operator is faster. The red line to the left means the avarage speed is slower. We can see that the red line of T-Mobile is farthest to the right so T-Mobile apparently has the highest average upload speed as well. 

### Latency

In the histogram below we see latency speed on the horizontal axis on a logaritmic scale. The log transformation makes the figure more readable. The number of test cases are plotted as bars, on the vertical axis we see the count of the number of speed tests in a specific range. For latency we take the log so the outlines scale and we can have a look at the shape of the distribution.

```{r latency, fig.cap='Histogram latency per operator', fig.height=3 }
library(plyr)

cdf <- ddply(df.4G, "operator", summarise, latency.mean=mean(latency))

# With mean lines, using cdf from above
ggplot(df.4G, aes(x=latency)) + geom_histogram( colour="black", fill="white", binwidth = 0.02) +  
    scale_x_log10(limits=c(10, 1000)) +
    facet_grid(operator ~ .)  +
    xlab("Latency (ms) log scale") +
    ylab("Number of speedtests (count)") +

     geom_vline(data=cdf, aes(xintercept=latency.mean),
                linetype="solid", size=1, colour="red")
```

For latency smaller is better, so in this plot we are looking at which operator has the smallest latency. Again we see the red lines per operator. The average latency(red line) for T-Mobile is the most to the left, which means T-Mobile has the smallest average latency of the three operators.

Box-plot
---------
In descriptive statistics, a box plot or box-plot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. [More](https://en.wikipedia.org/wiki/Box_plot) about box-plots on Wikipedia.

### Download speed
We see a box plot for each of the operators on the x axis, the vertical axis shows the speed test values in Kbps. The daimond shape represents the mean, the thick line represents the median. the black dots on the top and bottom represent extreme cases. The data is split up into quartiles, that means four equally sized proportions. The first and fourth quarter are represented as a line, the second and third as a box.

```{r box-down, fig.cap='Boxplot download speed per operator',fig.height=3}
# boxplot
ggplot(df.4G, aes(x=operator, y=download_kbps, fill=operator)) + 
  geom_boxplot() +
  ylab("Download speed (Kbps)") +
  guides(fill=FALSE) + stat_summary(fun.y=mean, geom="point", shape=5, size=4)

```

Also in the box-plot we see that for T-Mobile the diamond shape(average) and the thick line representing the median are higher then the same values for the other operators. So also in this box-plot for download speed we see that T-Mobile has the highest average and median download speed.

If we zoom in on percentiles we can look at the fastest 10%, 5% and 1% of speed tests:
```{r quant-down}
qdwn <- ddply(df.4G,.(operator), function(x) round(quantile(x$download,c(.9,.95,.99))))
names(qdwn) <- c("operator","10%","5%","1%")
kable(qdwn, caption="Top percentiles average download speed(Kbps)")
```

From the table we see that T-Mobile scores best in the 10%  of download speed tests per operator. For the 5% percentile the T-Mobile outcome is very close to the KPN outcome but both better then Vodafone.

In the top 1% of download speed tests KPN scores best, followed by Vodafone. Both these operators they have the LTE-Advanced technology in some locations mostly in the largest cities. This technology is supported by a small number of devices. In excellent radio conditions and in case the user has a suitable device, download speeds higher than 150MBps (up to 225Mbps) can be achieved. This is the case in less than 1% of all tests of KPN and Vodafone, but also the main reason why the top 1% for these two operators is higher than T-Mobile’s. The LTE-advanced technology does not contribute to higher upload speeds nor lower latency.

### Upload speed
We see a box plot for each of the operators on the x axis, the vertical axis shows the speed test values in Kbps. The daimond shape represents the mean, the thick line represents the median. the black dots on the top and bottom represent extreme cases. The data is split up into quartiles, that means four equally sized proportions. The first and fourth quarter are represented as a line, the second and third as a box.

```{r box-up, fig.cap='Boxplot upload speed per operator',fig.height=3}
# boxplot
ggplot(df.4G, aes(x=operator, y=upload_kbps, fill=operator)) + 
  geom_boxplot() +
    ylab("Upload speed (Kbps)") +
    guides(fill=FALSE) + stat_summary(fun.y=mean, geom="point", shape=5, size=4)

```

Also in the box-plot we see that for T-Mobile the diamond shape(average) and the thick line representing the median are higher then the same values for the other operators. So also in this box-plot for upload speed we see that T-Mobile has the highest average and median upload speed.

If we zoom in on percentiles we can look at the fastest 10%, 5% and 1% of speed tests:
```{r quant-up}
qdwn <- ddply(df.4G,.(operator), function(x) round(quantile(x$upload,c(.9,.95,.99))))
names(qdwn) <- c("operator","10%","5%","1%")
kable(qdwn, caption="Top percentiles average upload speed(Kbps)")
```

From the table we see that T-Mobile scores best in the fastest 10%,5% and 1% of upload speed tests per operator. 

### Latency
For latency ( or ping) we take the log so the outliers scale and we can have a look at the shape of the distribution.

We see a box plot for each of the operators on the x axis, the vertical axis shows the speed test values in Kbps. The daimond shape represents the mean, the thick line represents the median. the black dots on the top and bottom represent extreme cases. The data is split up into quartiles, that means four equally sized proportions. The first and fourth quarter are represented as a line, the second and third as a box.

```{r box-latency, fig.cap='Boxplot latency speed per operator',fig.height=3}
# boxplot
ggplot(df.4G, aes(x=operator, y=latency, fill=operator)) + 
    geom_boxplot() +
    scale_y_log10() + ylab("Latency (ms) log scale") +
    guides(fill=FALSE) + stat_summary(fun.y=mean, geom="point", shape=5, size=4)

```

In the box-plot for latancy we see that for T-Mobile the diamond shape(average) and the thick line representing the median are lower then the same values for the other operators. Remember, for latancy lower is better. 

If we look at the fastest 1%, 5% and 10% percentiles we see the following table:

```{r quant-lat}
qlat <- ddply(df.4G,.(operator), function(x) round(quantile(x$latency,c(.01,.05,.1)),2))
names(qlat) <- c("operator","1%","5%","10%")
kable(qlat, caption="Top percentiles average latency (ms)")
```


\newpage

Step 4: Test design
=========
What we want to test is if, on average, a customer that uses T-Mobile 4G Mobile Network gets a higher average speed(in terms of download speed, upload speed and latency) than a customer using KPN’s or Vodafone’s 4G Mobile Network, with all else equal. To do this we have collected thousands of Ookla Speedtest results taken from the three top operators (KPN, Vodafone and T-Mobile) which have been filtered as set out in the above to a final dataset consisting of `r nrow(df.4G)` data points. Now we want to compare T-Mobile with the other two operators. So we do two tests: the first is comparing T-Mobile with KPN and the second is comparing T-Mobile with Vodafone. In each test we compare all three metrics: Upload speed, Download speed and latency(or ping). For each operator we have a sample set available in the data. These sets are so called samples (from the Dutch population of mobile phone users) from which we calculate the sample means. Now our statistical test tests if these sample means are significantly different from one another.

In practice, the Central Limit Theorem assures us that, under a wide range of assumptions, the distributions of the two sample means being tested will themselves approach Normal distributions as the sample sizes get large, regardless (this is where the assumptions come in) of the distributions of the underlying data. As a consequence, as the sample size gets larger, the difference of the means becomes normally distributed, and the requirements necessary for the t-statistic of an unpaired t-test to have the nominal t distribution become satisfied.

Which statistical test do we need?
-----------
What we have here is a set of unpaired, independent, different sample size, different variance data. A suitable and powerful test for this kind of data is a [Welch t-test](https://en.wikipedia.org/wiki/Welch%27s_t_test).

In statistics, Welch's t-test (or Welch-Aspin Test) is a two-sample location test, and is used to check the hypothesis that two populations have equal means(our NULL hypothesis). Welch's t-test is an adaptation of Student's t-test, and is intended for use when the two samples have possibly unequal variances(which is the case here). These tests are often referred to as "unpaired" or "independent samples" t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping(in our case the units are different people performing the test with different devices on different networks). 

```{r set-confidence-level}
conflevel <- 0.99
```

###Significance

So when is a test significant? And if so at what level? And furthermore can we qualify such a significant result as good or bad? To start with the last remark, all qualifications of a statistical result are subjective. One way of looking at 95% confidence is that 1 out of 20 trials (in 5% of the cases) you make a so called Type 1 error, in which you wrongly reject the null-hypothesis. So in this case, if the p-value would be 0.05(confidence level 95%) you would claim that operator x is faster then operator y while in fact they were not. In applied practice, confidence intervals are typically stated at the 95% or 99% confidence level (More on [significance](https://en.wikipedia.org/wiki/Statistical_significance) ). 

In our test we will set the confidence level to be `r 100*conflevel`%, which is more strict then 95%. This means we will reject the Null Hypothesis only if we are `r 100*conflevel` % confident we do not make a mistake. From the test result we see that in most cases the calculated  p-values are very much smaller then 1 - `r conflevel` = `r 1-conflevel`, so changes of making this type of error are even considerably smaller than the claimed confidence level of `r 100*conflevel`%. 

### P-value
In statistics, the p-value is a function of the observed sample results (a statistic) that is used for testing a statistical hypothesis. Before performing the test a threshold value is chosen, called the significance level of the test, traditionally 5% or 1% and denoted as $\alpha$. If the p-value is equal or smaller than the significance level ($\alpha$), it suggests that the observed data is inconsistent with the assumption that the null hypothesis is true, and thus that hypothesis must be rejected and the alternative hypothesis is accepted as true ( see [wikipedia](https://en.wikipedia.org/wiki/P-value)).

### Confidence intervals
Confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. The level of confidence of the confidence interval would indicate the probability that the confidence range captures this true population parameter given a distribution of samples. This value is represented by a percentage, so when we say, "we are `r 100*conflevel`% confident that the true value of the parameter is in our confidence interval", we express that 99% of the observed confidence intervals will hold the true value of the parameter. A confidence interval does **not** predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained. (see [wikipedia](https://en.wikipedia.org/wiki/Confidence_interval)).


\newpage

Step 5: Test results coverage area
===========
We test T-Mobile against the other two operators so we have two tests. We put the confidence level to `r 100*conflevel` %. Our null-hypothesis is that the means are drawn from the same sample, so they are not different.

In this test we use the whole cleaned data set, in the next chapter we test each individual city. 

Let's see what our test results are:

```{r absolute-test-result, cache=FALSE, results='asis'}
library(xtable)

# do test result use sourced function
df.tr.dwnload <- fn.ttest(df.4G,"download_kbps",conflevel)
df.tr.upload <- fn.ttest(df.4G,"upload_kbps",conflevel)
df.tr.latency <- fn.ttest(df.4G,"latency",conflevel)

# print test result 
print(xtable(df.tr.dwnload,format = "pandoc" ,  caption = "Comparison of means for metric: Download (Kbps)",row.names=FALSE, booktabs=TRUE), size="\\footnotesize",include.rownames=FALSE,comment=FALSE)
print(xtable(df.tr.upload,format = "pandoc" ,  caption = "Comparison of means for metric: Upload (Kbps)",row.names=FALSE), size="\\footnotesize",include.rownames=FALSE,comment=FALSE)
print(xtable(df.tr.latency,format = "pandoc",  caption = "Comparison of means for metric: Latency (ms)",row.names=FALSE), size="\\footnotesize",include.rownames=FALSE,comment=FALSE)


```

```{r relative conslusions}
# sometimes R is a bit cumbersome...
d1 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Diff(Kbps)"]))
r1 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Rel(%)"]))
l1 <- min(as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Mean 1"])),as.numeric(as.character(df.tr.dwnload["T-Mobile vs KPN","Mean 2"])))

d2 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Diff(Kbps)"]))
r2 <- as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Rel(%)"]))
l2 <- min(as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Mean 1"])),as.numeric(as.character(df.tr.dwnload["T-Mobile vs Vodafone","Mean 2"])))

```

**Explanation of terms**

\footnotesize

**Sample 1**: Number of speed test samples for operator 1.

**Sample 2**: Number of speed test samples for operator 2.

**Mean 1**: Average speed of speed tests for operator 1 in Kbps. A high number here means that this operator has a fast download(or upload) speed. For the latency a high number means you have to wait longer to contact webpages or servers.

**Mean 2**: Average speed of speed tests for operator 2 in Kbps. A high number here means that this operator has a fast download(or upload) speed. For the latency a high number means you have to wait longer to contact webpages or servers.

**P-value**: The test statistic, for more explanation see the paragraph P-Value above

**Sign.**: Short for Significance. We compare the test statistic with the predefined confidence level(`r conflevel`). 'Yes' means the test is significant, 'No' means the test is not significant."))

**Diff(in Kbps or ms)**: Difference of the means (DoM) is the difference of Mean 1 and Mean 2(Mean 1 - Mean 2). For download and upload speeds(Kbps) big positive number here means operator 1 has a faster speed then operator 2. A big negative number means that operator 2 has a faster speed then operator 1. For latency(ms) this is the opposite, because smaller is better.

**Conf Int**: Confidence interval consist of a range of values (interval) that act as good estimate of the *true* difference of the mean. We are `r 100*conflevel`% confident that the true value of the difference of the mean is in our confidence interval.

**Rel(%)**: Relative difference in percentage. It is calculated as the difference of the mean divided by the slower of the two operators average speed. If the difference is not significant(column Sign is No), this column will state NA(Not Applicable). The comparison rules are similar to what is explained in the Diff(in Kbps or ms).

\normalsize
Looking at the tables above we see that all results are significant at $\alpha$ = `r 1-conflevel` level(`r 100*conflevel`% confidence level) and the resulting p-values are very small. This means we can reject the null-hypothesis with great confidence. Hence we can state that with `r 100*conflevel`% confidence the true difference in the means lies within the confidence interval provided in the table.

\newpage

Conclusion
=========
This analysis has been conducted with the utmost care and to the best knowledge of the analyst (DIKW Consulting). The analysis is opensource and all code can be downloaded, reviewed and repeated from [GitHub](https://github.com/hugokoopmans/ookla-speedtest-analysis).

Overall we can say that based on the speedtest data analysed in the investigated area the 4G network of T-Mobile outperforms both KPN and Vodafone on download speed, upload speed and latency. 

From the data analysed in the investigated area the average download speed of the 4G network of T-Mobile outperforms KPN by `r round(d1/1000,2)` Mbps, which is `r r1`%. Also, from the data analysed in the investigated area the average download speed of the 4G network of T-Mobile outperforms Vodafone by `r round(d2/1000,2)` Mbps, which is `r r2`%. From table 14 above similar statements can be derived for upload speed.  For deriving these statements for latency, please see table 15 keeping in mind that smaller values are better. 

For conclusions per individual city we refer to the section below. Please keep in mind that the significance of a test per city does not influence the significance of a test over the whole 4G area. The significance of a test per city only shows if the 4G network speeds (download speed, upload speed and latency) are also significantly different on a local level, so for that city treated separately. 

\newpage
```{r topn}
# top n
top.n=21
```

Analysis and results Top `r top.n` cities per city 
=============
From CBS we have the following top `r top.n` cities based on number of inhabitants("aantal inwoners") from 2014. We see that Zwolle is now number twenty in favour of Maastricht(now `r top.n`). For completness we keep Maastricht in the list of cities to be analysed.  


```{r get_top_n}

# postgreSQL
require("RPostgreSQL", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.0")

# Establish connection to PoststgreSQL using RPostgreSQL
drv <- dbDriver("PostgreSQL")

# Full version of connection setting
con <- dbConnect(drv, dbname=db.name,host=db.hostname,port=db.portnumber,user=db.username,password=db.password )


# all data joined on top x gemeentes
res <- dbSendQuery(con, "SELECT gm_code,gm_naam,aant_inw,opp_land from top25layer")
df.top.25 <- fetch(res, n = -1)

# top n
df.top <- head(df.top.25,top.n)

# aantal inwoners
aantal.inw.sum <- as.integer(sum(df.top$aant_inw))
# CBS data is in Hectare
# 100 ha in 1 km2
df.top$opp_land <- df.top$opp_land/100
aantal.km2.sum <- as.integer(sum(df.top$opp_land))

# aantal inwoners NL 
res <- dbSendQuery(con, "select sum(aant_inw) from nlstaging.gem_2014 where water='NEE'")
aantal.inw.nl <- as.integer(fetch(res, n = -1))

# disconnect
sink <- dbDisconnect(con)

# print table
kable(df.top,col.names=c('code','naam','aantal inwoners','oppervlak(km^2^)'))

# Add unknown as seperate "gemeente"
df.top.plus.unknown <- rbind(df.top, c("---","Unkown Location", nrow(df.unkwnloc)))

# drop month from df.4G
df.4G$month <- NULL

# add data of unknown to df
df.4G.plus.hotspots <- rbind(df.4G,df.unkwnloc)

```

In this test we use the CBS cities in this area as the benchmark area for the overall comparison.

In more detailed analysis we investigate the top twenty "Gemeentes" based on number of inhabitants. based on data available at [CBS](http://www.cbs.nl/nl-NL/menu/themas/dossiers/nederland-regionaal/publicaties/geografische-data/archief/2014/2013-wijk-en-buurtkaart-art.htm).

The coverage area of top 21 cities looks like this.

```{r top, echo=FALSE}
library(png)
library(grid)
img <- readPNG("./img/top21gemeentes-cbs-2014.png")
 grid.raster(img)
```

From the CBS data we learn that the top `r top.n` cities cover `r aantal.inw.sum` out of `r aantal.inw.nl` inhabitants. Which is `r round(100*aantal.inw.sum/aantal.inw.nl,2)` %. In terms of area this is `r aantal.km2.sum` km^2^ of a total of 41545 km^2^, which is `r round(100*aantal.km2.sum/41545,2)`%. 

For each city we will do the significance test separately in the next pages.

```{r run-analysis-per-gemeente, echo=FALSE, cache=FALSE, message=FALSE, warning=FALSE, results='asis' }
# run the analysis for each gemeente in the df.top
out = NULL
for (i in  1:nrow(df.top.plus.unknown)) {
#  out = c(out, knit_expand('analysis-per-gemeente.Rmd'))
  out = c(out, knit_child('analysis-per-gemeente.Rmd'))
}

cat(paste(out, collapse = '\n'))
#cat(knit(text=unlist(paste(out, collapse = '\n')), quiet=TRUE))

```