Data

author: I. Bartomeus

Science = Ideas + Data

Data is at the core of the scientific process.
We need to take care of it.
- Ensure it is accurate
- It is easy to use
- It is preserved in the long term
- Ensure reproducibility!

Science is changing: Data sharing culture is stablishing

Open science helps advance Science (i.e. reproducibility)
Advance your career (increase collaborations, enhance citations)
Mandatory by main Journals (BES, PLoS, Porc B,...)

More later...

Most Important

We are losing data!

Which is your Data life cycle?

Which is Data life cycle?

understanding your data needs
collecting data
entering data
cleaning data
storing data
manipulating data
re-using data

What do you think of before collecting data?

Think what you will need (think a lot)
Power analysis?
In which format you will need it
How are you going to use it (explicitelly)
write detailed protocols (helps working incremental)

What do you do to collect high quality data?

use entry sheets
double labeling of samples when possible
standardize coding (4_NitrogenPhosphurous > 4NP > 4)

What do you do to collect high quality data?

double blind when possible (video about autosuggestion)

How do you enter data?

Software: Excel, GoogleDrive (forms), OpenOffice...

But remember, Excel is a data entry program, nothing else.*

Keep the link between physical and digital world
Use a consistent style (have style guideline)

*or if you have less than 10 rows, 4 columns...

Database software

related csv files
MS Access (only works on PC)
Filemaker Pro (only works on PC and Mac)
SQL (require some set up): SQLite, MySQL, PostgreSQL (advantanges for spatially-explicit data compared with MySQL and open source)
MariaDB – drop-in replacement for MySQL (even Google switched from MySQL to MariaDB)
MongoDB – open source, no-SQL database
Metadata:
- Ecological Metadata Language (EML) – standard way to format metadata for ecology based in XML
- Morpho – helps you write your metadata

How do you clean your Data?

Note: This is not data transformation, is about knowing your data is robust.

How do you clean your Data?

check impossible values vs improbable values
secure data quality
plot your data

Check your data: If you find no errors, look up again

Ascombe's quaasrtet

And always plot your data...

Ascombe's quaasrtet

Anscombe's quaasrtet

Mean of x in each case -> 9 (exact)
Variance of x in each case -> 11
Mean of y in each case -> 7.50
Variance of y in each case -> 4.1
Correlation between x and y in each case -> 0.816
Linear regression line in each case -> y = 3.00 + 0.500x

Where to do you store your Master Data?

Note: entering and storing are different things!
Do you have a Master data?

Where to do you store your Master Data?

Master data:
Metadata (for others, and for you!)
Local repositories: csv, database (MySQL, Access, MongoDB...)
Online repositories: csv, database (MySQL, Access, MongoDB...)
Online public repositories: Dryad, Figshare, ...
Use plain & standard formats (e.g. .txt, .csv, GenBank, nexus… )
Licence it! CC0
use a tidy data framework

Repositories

Dryad
Figshare
Data papers: Ecological Archives (ESA), F1000Research, etc...
DataONE
Specialized Networks
- The Knowledge Network for Biocomplexity (KNB)
- Global Population Dynamics Database (GPDD)
- Gbif
- VertNET
- etc...

Note: check out rOpenSci to retrieve data!

Tidy data:

Each variable forms a column
Each observation forms a row
Each data set contains information on only one observational unit of analysis (e.g., Genus, species, species visits)

Use concepts from relational databases.

... it is 100 times easier to turn data from list form to cross-tab than vice-versa, particularly if you don’t code- you can actually do it very easily in Excel using Pivot tables. Your master dataset should be in list form.

@cbahlai

Where do you manipulate your data?

Never manipulate the Master data.
Use a scripting lenguage (e.g. R) to make it reproducible
- You can't reproduce… if you don't understand where a number came from.
- You can't reproduce… what you don't remember. And trust me: you won't.
- You can't reproduce… what you've lost. What if you need access to a file as it existed 1, 10, or 100, or 1000 days ago? - incremental back up (Git, Dropbox, Time machine...) Vargas lab
Packages that simplify your life (reshape, dplyr)

How my projects looks like

project/

data/
    
    master.csv
    
    data_analysis_y.csv

get_data.R 

analysis_x.R

analysis_y.R

Figures/

ms/

Do you re-used data? It was easy?

cleaning, securing, understanding and realising data is 100 times harder a posteriori.
Re-use your own data
Other people data

How owns your data?

Yourself?
Your advisor?
The project?
The funding agency?
Your University?
Science -> Free your data!

And if you don't know the answer: find out now!

Do you share your data? Why?

Interested in discussion here:

Do you share your data? Why?

We share ideas in publications
Why not data?
The only way to reproducibility

"I don't publish my data, someone may use them to write his/her own papers!"

"I can share with you my data, but only if I become a coauthor in any of your papers using it"

Data sharing

"I don't publish my hypothesis, someone may use them to write his own papers!"

"I can share with you the conclusions of the paper, but only if I become a coauthor in any of your papers using them"

scooping is rare in ecology
you are in best position to understand your data
people will do new things with your data

Data sharing

"Nobody will understand my data"

Do better metadata

"My data is not interesting"

Let others judge

"My data can be missinterpreted"

Do better metadata and trust researchers

Resources:

Tidy data paper

Must read about Data: Nine simple ways to make it easier to (re)use your data

Some Simple Guidelines for Effective Data Management

More: Best practice for biodiversity data management and publication

An introduction to data cleaning with R

Paper on Git and reproducibilty

Resources:

Data sharing: Open data

practice and perceptions

enhanced citations and re-use

Resources:

For Excel fans

Best practices on Data One

Data Management Practices

Ref data lost

Blogs worth checking: http://practicaldatamanagement.wordpress.com/

Licence your data: Licences

Example on why use apropiate Licences

Resources:

Other presentation: R and reproducibility

about dplyr: dplyr blog and dplyr example

More from [Hadley Wickham](https://github.com/hadley and https://github.com/hadley/tidyr)

About Regular expresions

More about Excel

Files

data.md

Latest commit

History

data.md

File metadata and controls

Data

Science = Ideas + Data

Science is changing: Data sharing culture is stablishing

Most Important

Which is your Data life cycle?

Which is Data life cycle?

What do you think of before collecting data?

What do you think of before collecting data?

What do you do to collect high quality data?

What do you do to collect high quality data?

How do you enter data?

How do you enter data?

Database software

How do you clean your Data?

How do you clean your Data?

Ascombe's quaasrtet

Ascombe's quaasrtet

Where to do you store your Master Data?

Where to do you store your Master Data?

Repositories

Tidy data:

Where do you manipulate your data?

Where do you manipulate your data?

How my projects looks like

Do you re-used data? It was easy?

Do you re-used data? It was easy?

How owns your data?

How owns your data?

Do you share your data? Why?

Do you share your data? Why?

Data sharing

Data sharing

Resources:

Resources:

Resources:

Resources: