Skip to content

Latest commit

 

History

History
340 lines (229 loc) · 10.8 KB

data.md

File metadata and controls

340 lines (229 loc) · 10.8 KB

Data

author: I. Bartomeus

Science = Ideas + Data

  • Data is at the core of the scientific process.
  • We need to take care of it.
    • Ensure it is accurate
    • It is easy to use
    • It is preserved in the long term
    • Ensure reproducibility!

Science is changing: Data sharing culture is stablishing

  • Open science helps advance Science (i.e. reproducibility)
  • Advance your career (increase collaborations, enhance citations)
  • Mandatory by main Journals (BES, PLoS, Porc B,...)

More later...

Most Important

  • We are losing data! losing data

Which is your Data life cycle?

Which is Data life cycle?

  • understanding your data needs
  • collecting data
  • entering data
  • cleaning data
  • storing data
  • manipulating data
  • re-using data

What do you think of before collecting data?

What do you think of before collecting data?

  • Think what you will need (think a lot)
  • Power analysis?
  • In which format you will need it
  • How are you going to use it (explicitelly)
  • write detailed protocols (helps working incremental)

What do you do to collect high quality data?

  • use entry sheets
  • double labeling of samples when possible
  • standardize coding (4_NitrogenPhosphurous > 4NP > 4)

What do you do to collect high quality data?

How do you enter data?

How do you enter data?

  • Software: Excel, GoogleDrive (forms), OpenOffice...

But remember, Excel is a data entry program, nothing else.*

  • Keep the link between physical and digital world
  • Use a consistent style (have style guideline)

*or if you have less than 10 rows, 4 columns...

Database software

  • related csv files
  • MS Access (only works on PC)
  • Filemaker Pro (only works on PC and Mac)
  • SQL (require some set up): SQLite, MySQL, PostgreSQL (advantanges for spatially-explicit data compared with MySQL and open source)
  • MariaDB – drop-in replacement for MySQL (even Google switched from MySQL to MariaDB)
  • MongoDB – open source, no-SQL database
  • Metadata:
    • Ecological Metadata Language (EML) – standard way to format metadata for ecology based in XML
    • Morpho – helps you write your metadata

How do you clean your Data?

  • Note: This is not data transformation, is about knowing your data is robust.

How do you clean your Data?

  • check impossible values vs improbable values
  • secure data quality
  • plot your data

Check your data: If you find no errors, look up again

Ascombe's quaasrtet

And always plot your data...

Anscombe

Ascombe's quaasrtet

Anscombe's quaasrtet

  • Mean of x in each case -> 9 (exact)
  • Variance of x in each case -> 11
  • Mean of y in each case -> 7.50
  • Variance of y in each case -> 4.1
  • Correlation between x and y in each case -> 0.816
  • Linear regression line in each case -> y = 3.00 + 0.500x

Where to do you store your Master Data?

  • Note: entering and storing are different things!
  • Do you have a Master data?

Where to do you store your Master Data?

  • Master data:

  • Metadata (for others, and for you!)

  • Local repositories: csv, database (MySQL, Access, MongoDB...)

  • Online repositories: csv, database (MySQL, Access, MongoDB...)

  • Online public repositories: Dryad, Figshare, ...

  • Use plain & standard formats (e.g. .txt, .csv, GenBank, nexus… )

  • Licence it! CC0

  • use a tidy data framework

Repositories

  • Dryad
  • Figshare
  • Data papers: Ecological Archives (ESA), F1000Research, etc...
  • DataONE
  • Specialized Networks
    • The Knowledge Network for Biocomplexity (KNB)
    • Global Population Dynamics Database (GPDD)
    • Gbif
    • VertNET
    • etc...

Note: check out rOpenSci to retrieve data!

Tidy data:

  • Each variable forms a column
  • Each observation forms a row
  • Each data set contains information on only one observational unit of analysis (e.g., Genus, species, species visits)

Use concepts from relational databases.

... it is 100 times easier to turn data from list form to cross-tab than vice-versa, particularly if you don’t code- you can actually do it very easily in Excel using Pivot tables. Your master dataset should be in list form.

@cbahlai

Where do you manipulate your data?

Where do you manipulate your data?

  • Never manipulate the Master data.

  • Use a scripting lenguage (e.g. R) to make it reproducible

    • You can't reproduce… if you don't understand where a number came from.
    • You can't reproduce… what you don't remember. And trust me: you won't.
    • You can't reproduce… what you've lost. What if you need access to a file as it existed 1, 10, or 100, or 1000 days ago? - incremental back up (Git, Dropbox, Time machine...) Vargas lab
  • Packages that simplify your life (reshape, dplyr)

How my projects looks like

project/

data/
    
    master.csv
    
    data_analysis_y.csv

get_data.R 

analysis_x.R

analysis_y.R

Figures/

ms/

Do you re-used data? It was easy?

Do you re-used data? It was easy?

  • cleaning, securing, understanding and realising data is 100 times harder a posteriori.
  • Re-use your own data
  • Other people data

How owns your data?

How owns your data?

  • Yourself?
  • Your advisor?
  • The project?
  • The funding agency?
  • Your University?
  • Science -> Free your data!

And if you don't know the answer: find out now!

Do you share your data? Why?

Interested in discussion here:

Do you share your data? Why?

  • We share ideas in publications
  • Why not data?
  • The only way to reproducibility

"I don't publish my data, someone may use them to write his/her own papers!"

"I can share with you my data, but only if I become a coauthor in any of your papers using it"

Data sharing

"I don't publish my hypothesis, someone may use them to write his own papers!"

"I can share with you the conclusions of the paper, but only if I become a coauthor in any of your papers using them"

  • scooping is rare in ecology
  • you are in best position to understand your data
  • people will do new things with your data

Data sharing

"Nobody will understand my data"

Do better metadata

"My data is not interesting"

Let others judge

"My data can be missinterpreted"

Do better metadata and trust researchers

Resources:

Tidy data paper

Must read about Data: Nine simple ways to make it easier to (re)use your data

Some Simple Guidelines for Effective Data Management

More: Best practice for biodiversity data management and publication

An introduction to data cleaning with R

Paper on Git and reproducibilty

Resources:

Data sharing: Open data

practice and perceptions

enhanced citations and re-use

more on data sharing enhance citations

who shares data

Resources:

For Excel fans

Best practices on Data One

Data Management Practices

Ref data lost

Blogs worth checking: http://practicaldatamanagement.wordpress.com/

Licence your data: Licences

Example on why use apropiate Licences

Resources:

Other presentation: R and reproducibility

about dplyr: dplyr blog and dplyr example

More from [Hadley Wickham](https://github.com/hadley and https://github.com/hadley/tidyr)

About Regular expresions

More about Excel