author: I. Bartomeus
- Data is at the core of the scientific process.
- We need to take care of it.
- Ensure it is accurate
- It is easy to use
- It is preserved in the long term
- Ensure reproducibility!
- Open science helps advance Science (i.e. reproducibility)
- Advance your career (increase collaborations, enhance citations)
- Mandatory by main Journals (BES, PLoS, Porc B,...)
More later...
- understanding your data needs
- collecting data
- entering data
- cleaning data
- storing data
- manipulating data
- re-using data
- Think what you will need (think a lot)
- Power analysis?
- In which format you will need it
- How are you going to use it (explicitelly)
- write detailed protocols (helps working incremental)
- use entry sheets
- double labeling of samples when possible
- standardize coding (4_NitrogenPhosphurous > 4NP > 4)
- double blind when possible (video about autosuggestion)
- Software: Excel, GoogleDrive (forms), OpenOffice...
But remember, Excel is a data entry program, nothing else.*
- Keep the link between physical and digital world
- Use a consistent style (have style guideline)
*or if you have less than 10 rows, 4 columns...
- related csv files
- MS Access (only works on PC)
- Filemaker Pro (only works on PC and Mac)
- SQL (require some set up): SQLite, MySQL, PostgreSQL (advantanges for spatially-explicit data compared with MySQL and open source)
- MariaDB – drop-in replacement for MySQL (even Google switched from MySQL to MariaDB)
- MongoDB – open source, no-SQL database
- Metadata:
- Ecological Metadata Language (EML) – standard way to format metadata for ecology based in XML
- Morpho – helps you write your metadata
- Note: This is not data transformation, is about knowing your data is robust.
- check impossible values vs improbable values
- secure data quality
- plot your data
Check your data: If you find no errors, look up again
And always plot your data...
- Mean of x in each case -> 9 (exact)
- Variance of x in each case -> 11
- Mean of y in each case -> 7.50
- Variance of y in each case -> 4.1
- Correlation between x and y in each case -> 0.816
- Linear regression line in each case -> y = 3.00 + 0.500x
- Note: entering and storing are different things!
- Do you have a Master data?
-
Master data:
-
Metadata (for others, and for you!)
-
Local repositories: csv, database (MySQL, Access, MongoDB...)
-
Online repositories: csv, database (MySQL, Access, MongoDB...)
-
Use plain & standard formats (e.g. .txt, .csv, GenBank, nexus… )
-
Licence it! CC0
-
use a tidy data framework
- Dryad
- Figshare
- Data papers: Ecological Archives (ESA), F1000Research, etc...
- DataONE
- Specialized Networks
Note: check out rOpenSci to retrieve data!
- Each variable forms a column
- Each observation forms a row
- Each data set contains information on only one observational unit of analysis (e.g., Genus, species, species visits)
Use concepts from relational databases.
... it is 100 times easier to turn data from list form to cross-tab than vice-versa, particularly if you don’t code- you can actually do it very easily in Excel using Pivot tables. Your master dataset should be in list form.
-
Never manipulate the Master data.
-
Use a scripting lenguage (e.g. R) to make it reproducible
- You can't reproduce… if you don't understand where a number came from.
- You can't reproduce… what you don't remember. And trust me: you won't.
- You can't reproduce… what you've lost. What if you need access to a file as it existed 1, 10, or 100, or 1000 days ago? - incremental back up (Git, Dropbox, Time machine...) Vargas lab
-
Packages that simplify your life (
reshape
,dplyr
)
project/
data/
master.csv
data_analysis_y.csv
get_data.R
analysis_x.R
analysis_y.R
Figures/
ms/
- cleaning, securing, understanding and realising data is 100 times harder a posteriori.
- Re-use your own data
- Other people data
- Yourself?
- Your advisor?
- The project?
- The funding agency?
- Your University?
- Science -> Free your data!
And if you don't know the answer: find out now!
Interested in discussion here:
- We share ideas in publications
- Why not data?
- The only way to reproducibility
"I don't publish my data, someone may use them to write his/her own papers!"
"I can share with you my data, but only if I become a coauthor in any of your papers using it"
"I don't publish my hypothesis, someone may use them to write his own papers!"
"I can share with you the conclusions of the paper, but only if I become a coauthor in any of your papers using them"
- scooping is rare in ecology
- you are in best position to understand your data
- people will do new things with your data
"Nobody will understand my data"
Do better metadata
"My data is not interesting"
Let others judge
"My data can be missinterpreted"
Do better metadata and trust researchers
Must read about Data: Nine simple ways to make it easier to (re)use your data
Some Simple Guidelines for Effective Data Management
More: Best practice for biodiversity data management and publication
An introduction to data cleaning with R
Paper on Git and reproducibilty
Data sharing: Open data
more on data sharing enhance citations
Blogs worth checking: http://practicaldatamanagement.wordpress.com/
Licence your data: Licences
Example on why use apropiate Licences
Other presentation: R and reproducibility
about dplyr: dplyr blog and dplyr example
More from [Hadley Wickham](https://github.com/hadley and https://github.com/hadley/tidyr)
About Regular expresions
More about Excel