Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New chapters: Data munging; basic statistical analysis / lying with statistics #46

Open
wetherc opened this issue Dec 29, 2014 · 10 comments

Comments

@wetherc
Copy link
Collaborator

wetherc commented Dec 29, 2014

Hey, y'all,

Any thoughts on adding a couple chapters (or a section) on best practices for data munging and basic statistical analysis using common data analytic software? Right now the book largely assumes that the individual's data are immediately usable or that he or she knows how to get them to that point.

There's also not much discussion of how to determine when differences in data are meaningful. That can turn into a deep, dark rabbit hole, but restricting it to a basic discussion of some common analyses with linkouts for more detail should be manageable.

Thoughts on this?

@infoactive
Copy link
Collaborator

I'm so down to add that section if we have an author or two wanting to work
on it. Sounds like a great idea to me!

Trina Chiasson | Infoactive https://infoactive.co | 872-216-7802

On Mon, Dec 29, 2014 at 8:54 AM, Christopher Wetherill <
[email protected]> wrote:

Hey, y'all,

Any thoughts on adding a couple chapters (or a section) on best practices
for data munging and basic statistical analysis using common data analytic
software? Right now the book largely assumes that the individual's data are
immediately usable or that he or she knows how to get them to that point.

There's also not much discussion of how to determine when differences in
data are meaningful. That can turn into a deep, dark rabbit hole, but
restricting it to a basic discussion of some common analyses with linkouts
for more detail should be manageable.

Thoughts on this?


Reply to this email directly or view it on GitHub
#46.

@wetherc
Copy link
Collaborator Author

wetherc commented Dec 29, 2014

I'm down to write as much as I know how to. I can comfortably do write-ups
for data munging in Excel, R, Python, SPSS, and probs Julia. Likewise for
any statistical analysis walkthroughs.

I'm thinking we might end up with something like:

Section: Data Munging and Analysis

- R
    - basic cleaning, aggregation, &c.
    - t-tests
    - ANOVAs
    - chi-squared
    - correlation
    - simple linear / multiple regression
- Excel
- Python
- SPSS
- Julia
- SAS

Or do we want more of a focus on munging than analysis and interpretation?

On Mon, Dec 29, 2014 at 12:33 PM, Infoactive [email protected]
wrote:

I'm so down to add that section if we have an author or two wanting to
work
on it. Sounds like a great idea to me!

Trina Chiasson | Infoactive https://infoactive.co | 872-216-7802

On Mon, Dec 29, 2014 at 8:54 AM, Christopher Wetherill <
[email protected]> wrote:

Hey, y'all,

Any thoughts on adding a couple chapters (or a section) on best
practices
for data munging and basic statistical analysis using common data
analytic
software? Right now the book largely assumes that the individual's data
are
immediately usable or that he or she knows how to get them to that
point.

There's also not much discussion of how to determine when differences in
data are meaningful. That can turn into a deep, dark rabbit hole, but
restricting it to a basic discussion of some common analyses with
linkouts
for more detail should be manageable.

Thoughts on this?


Reply to this email directly or view it on GitHub
#46.


Reply to this email directly or view it on GitHub
#46 (comment)
.

@grumpel7
Copy link

Funny - I just finished half of the coursera specialization courses on data
science and it's all about cleaning, processing and making things
reproducible. There are guidelines but not sure how easy it will be to
write them. What impressed on me the most is the concept of Tidy Data as
laid out by Hadley Wickham (http://vita.had.co.nz/) - the idea that all
data should be processed in a way that can be used by statistical software.
I think a chapter clarifying why data cleaning and processing is so crucial
and how to get started with would be fantastic.

Analysis (and modelling/prediction) would easily fill an entire book...
maybe we can start with pointing to already written resources and go from
there?

I'm not quite down to writing the chapter due to time (and still learning)
but I'll be happy to contribute anything (editing, proofreading etc.) on
this.

Cheers,
Jane

On Mon, Dec 29, 2014 at 12:33 PM, Infoactive [email protected]
wrote:

I'm so down to add that section if we have an author or two wanting to
work
on it. Sounds like a great idea to me!

Trina Chiasson | Infoactive https://infoactive.co | 872-216-7802

On Mon, Dec 29, 2014 at 8:54 AM, Christopher Wetherill <
[email protected]> wrote:

Hey, y'all,

Any thoughts on adding a couple chapters (or a section) on best
practices
for data munging and basic statistical analysis using common data
analytic
software? Right now the book largely assumes that the individual's data
are
immediately usable or that he or she knows how to get them to that
point.

There's also not much discussion of how to determine when differences in
data are meaningful. That can turn into a deep, dark rabbit hole, but
restricting it to a basic discussion of some common analyses with
linkouts
for more detail should be manageable.

Thoughts on this?


Reply to this email directly or view it on GitHub
#46.


Reply to this email directly or view it on GitHub
#46 (comment)
.

@grumpel7
Copy link

Sorry, I probably should have used the term munging rather than cleaning -
the chapters on data cleaning in the book are solid. I just reread them and
think they give plenty of reasons why data cleaning is necessary and
important. We could include more content of processing and getting it ready
for the stats applications. And definitely more in the Appendix on
statistical analysis.

On Mon, Dec 29, 2014 at 1:03 PM, Jane F [email protected] wrote:

Funny - I just finished half of the coursera specialization courses on
data science and it's all about cleaning, processing and making things
reproducible. There are guidelines but not sure how easy it will be to
write them. What impressed on me the most is the concept of Tidy Data as
laid out by Hadley Wickham (http://vita.had.co.nz/) - the idea that all
data should be processed in a way that can be used by statistical software.
I think a chapter clarifying why data cleaning and processing is so crucial
and how to get started with would be fantastic.

Analysis (and modelling/prediction) would easily fill an entire book...
maybe we can start with pointing to already written resources and go from
there?

I'm not quite down to writing the chapter due to time (and still learning)
but I'll be happy to contribute anything (editing, proofreading etc.) on
this.

Cheers,
Jane

On Mon, Dec 29, 2014 at 12:33 PM, Infoactive [email protected]
wrote:

I'm so down to add that section if we have an author or two wanting to
work
on it. Sounds like a great idea to me!

Trina Chiasson | Infoactive https://infoactive.co | 872-216-7802

On Mon, Dec 29, 2014 at 8:54 AM, Christopher Wetherill <
[email protected]> wrote:

Hey, y'all,

Any thoughts on adding a couple chapters (or a section) on best
practices
for data munging and basic statistical analysis using common data
analytic
software? Right now the book largely assumes that the individual's data
are
immediately usable or that he or she knows how to get them to that
point.

There's also not much discussion of how to determine when differences
in
data are meaningful. That can turn into a deep, dark rabbit hole, but
restricting it to a basic discussion of some common analyses with
linkouts
for more detail should be manageable.

Thoughts on this?


Reply to this email directly or view it on GitHub
#46.


Reply to this email directly or view it on GitHub
#46 (comment)
.

@wetherc
Copy link
Collaborator Author

wetherc commented Dec 29, 2014

Very good points! Definitely agree that linking to existing resources for statistical analysis is a good way to go. There's a ton that can be said there and most of it is probably outside the scope of this book.

I guess what I'm wondering is how we'd want any data munging chapter/section to be structured. Right now it's basically a post script to chapter 8 saying, "Please document!" Do we want to actually recommend programs or workflows for this, along with minimal reproducible examples? Reproducible data wrangling in Excel, for instance, is a whole different beast than in R with tools like knitr.

@dyannali
Copy link
Contributor

Hi all!
I'm actually just about to start writing an entire set of plain-language
stat analysis chapters for use in my intro biostats course this coming fall
(commencing writing in January). It really is kind of a book unto itself.
I'm happy to pass along these chapters for comment as I finish them up -
and also happy to help come up with some modified examples to make it more
accessible/relevant to the general public, since the ones in this text will
focus on public health and medicine. However, the text won't all be
complete in the next 3 months or anything, so if people want to do a super
speedy time schedule like we did for the first round, they won't be done
that quickly. I do have a deadline of August though, so they'll certainly
be finished by then!

I'll be covering descriptives, distributions, chi-squared, t-tests, ANOVA,
correlation, simple linear regression, and basic multiple regression. Let
me know if you have an interest in that!

  • D

On Mon, Dec 29, 2014 at 12:53 PM, Jane [email protected] wrote:

Sorry, I probably should have used the term munging rather than cleaning -
the chapters on data cleaning in the book are solid. I just reread them
and
think they give plenty of reasons why data cleaning is necessary and
important. We could include more content of processing and getting it
ready
for the stats applications. And definitely more in the Appendix on
statistical analysis.

On Mon, Dec 29, 2014 at 1:03 PM, Jane F [email protected] wrote:

Funny - I just finished half of the coursera specialization courses on
data science and it's all about cleaning, processing and making things
reproducible. There are guidelines but not sure how easy it will be to
write them. What impressed on me the most is the concept of Tidy Data as
laid out by Hadley Wickham (http://vita.had.co.nz/) - the idea that all
data should be processed in a way that can be used by statistical
software.
I think a chapter clarifying why data cleaning and processing is so
crucial
and how to get started with would be fantastic.

Analysis (and modelling/prediction) would easily fill an entire book...
maybe we can start with pointing to already written resources and go
from
there?

I'm not quite down to writing the chapter due to time (and still
learning)
but I'll be happy to contribute anything (editing, proofreading etc.) on
this.

Cheers,
Jane

On Mon, Dec 29, 2014 at 12:33 PM, Infoactive [email protected]
wrote:

I'm so down to add that section if we have an author or two wanting to
work
on it. Sounds like a great idea to me!

Trina Chiasson | Infoactive https://infoactive.co | 872-216-7802

On Mon, Dec 29, 2014 at 8:54 AM, Christopher Wetherill <
[email protected]> wrote:

Hey, y'all,

Any thoughts on adding a couple chapters (or a section) on best
practices
for data munging and basic statistical analysis using common data
analytic
software? Right now the book largely assumes that the individual's
data
are
immediately usable or that he or she knows how to get them to that
point.

There's also not much discussion of how to determine when differences
in
data are meaningful. That can turn into a deep, dark rabbit hole, but
restricting it to a basic discussion of some common analyses with
linkouts
for more detail should be manageable.

Thoughts on this?


Reply to this email directly or view it on GitHub
#46.


Reply to this email directly or view it on GitHub
<
https://github.com/infoactive/data-design/issues/46#issuecomment-68278351>

.


Reply to this email directly or view it on GitHub
#46 (comment)
.

@wetherc
Copy link
Collaborator Author

wetherc commented Dec 29, 2014

That's awesome, actually! I'm wondering now (although this may just be insomnia-induced delirium talking) if it'd be worthwhile just making a companion book to Data. Design. that covers a slew of basic stats in easily-digestible language?

Dyanna, you're about to get started on something like that; I got partway through a similar project this past summer. Seems like we've just about got the groundwork laid already and it's a topic that's very relevant to data presentation and interpretation, but doesn't quite fit in the scope of Data. Design.

@dyannali
Copy link
Contributor

Ya, Trina and I actually talked about that when we were trying to figure
out how to make an intro stats chapter fit into the book. It wasn't coming
out right, and it didn't totally make sense in the flow of the rest of the
book, so we talked about making it a separate thing.

Chris, I think if we take what I end up with and what you already have,
we'll end up with something pretty sweet! I'm currently still fairly fail
with the github HTML stuff, but willing to learn so that's not all on you.
Are you down for this?

On Mon, Dec 29, 2014 at 2:12 PM, Christopher Wetherill <
[email protected]> wrote:

That's awesome, actually! I'm wondering now (although this may just be
insomnia-induced delirium talking) if it'd be worthwhile just making a
companion book to Data. Design. that covers a slew of basic stats in
easily-digestible language?

Dyanna, you're about to get started on something like that; I got partway
through a similar project https://github.com/faulconbridge/appliedStats
this past summer. Seems like we've just about got the groundwork laid
already and it's a topic that's very relevant to data presentation and
interpretation, but doesn't quite fit in the scope of Data. Design.


Reply to this email directly or view it on GitHub
#46 (comment)
.

@wetherc
Copy link
Collaborator Author

wetherc commented Dec 29, 2014

You'd better believe I'm down for this, yeah!

@dyannali
Copy link
Contributor

Sweet! I'll make a plan of attack and touch base just after the new year
(still out of town on holiday).

Excited!!!
On Dec 29, 2014 1:51 PM, "Christopher Wetherill" [email protected]
wrote:

You'd better believe I'm down for this, yeah!


Reply to this email directly or view it on GitHub
#46 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants