New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Graph skewed 1958 - 1970 - Scatterplot request! #109

Open

titojankowski opened this issue Nov 19, 2017 · 18 comments

Assignees

Milestone

Contributor

titojankowski commented Nov 19, 2017

The 10-15 year period at the beginning of the grap is much shorter than another 15 year period, ie 1980-1995. I think this is because there are less data points in the early 1958 period? @lwm @grady-lad
The effect is it looks like levels rocketed up fast from 1958-1970 but it’s just that the graph is compressed.

Member

decentral1se commented Nov 19, 2017

Yeah, I think you're probably right there.

Not sure there is much we can about that? It's simply a reflection of the data we have.

Possibly a candidate for all the explanatory documentation we don't have ;)

Contributor

grady-lad commented Nov 19, 2017

We could double check the amount of data points between that period querying the api ?

Member

decentral1se commented Nov 19, 2017

Like, make two queries, where you grab all the data for the first 15 years?

And another for the remaining data using your usual 10 points per whatever query?

Contributor Author

titojankowski commented Nov 19, 2017

Well, the graph could be reconfigured to graph the points on the date axis. Like a scatterplot or something. Or, since we’re sampling 1/10 of the data, could the sampling function pick a fixed # of samples each year? ie 1960 and 1990 would have the same # of data points
@lwm @grady-lad

Contributor

grady-lad commented Nov 19, 2017

Currently within in the frontend we have all the data. So we could most definitely update the sampling function to sample data differently based on the yea.

E.G. First 15 years -> show all data points
Remaining data -> sample every 10th item.

Contributor Author

titojankowski commented Nov 24, 2017

Sounds fine to me!

grady-lad self-assigned this

Contributor

grady-lad commented Nov 30, 2017

This is what the graph looks like when we show all the data for the first 1500 data points (up to 1976-07-16) and sample 1/10 of the data after.

And here is what the data looks like currently when we sample every 10th data point

Not much of a difference =/

Contributor Author

titojankowski commented Nov 30, 2017

How weird!
Every 10th data point = 36.5 datapoints per year (365 days / 10)
1500 datapoints over 18 years (1958 - 1976) = 83 datapoints per year

So therefore, in this experiment I would expect the early years to be wider than the later years on the graph (which presents its own issue). But instead the earlier years are still skinnier. Are you sure it's working as you intended?

The points should be really just be positioned relative to the date rather than just giving each point equal spacing. The graphing function is basically ignoring the fact that each datapoint has a corresponding date. How might we get the graph to do this?

Here's the gold standard, the Keeling Curve for reference!

Contributor

grady-lad commented Nov 30, 2017

Ah I've just realised I was not updating the values for the x axis when doing the sampling 🤦‍♂️

So here is what the sampling logic looks like now.

For the first 1000 items -> take every 2nd data point
For the remaining items -> take every 5th data point.

Contributor

grady-lad commented Nov 30, 2017 •

edited

Loading

Is this more like the desired result ? @titojankowski

Contributor Author

titojankowski commented Dec 1, 2017

@grady-lad That's headed in the right direction!

Can we do a scatterplot instead? Sorry to mention it again if you've already thought through it and it's not doable, but it would make everything easy! With the current method, we have to manually get the right datapoints otherwise it skews the graph a lot. @lwm thoughts?

With this method:
Looking at our raw data, we have roughly weekly data up until 1974-05-17. (API data: http://api.carbondoomsday.com/api/co2/?date__lte=1974-05-17)

The count is 790 data points of weekly data from the beginning until 1974-05-17. After that it's pretty much daily data.

Conclusion: I suggest trying all of the first 790 data points, and then every 7th data point after that (keep it consistent with weekly data). How's that look? Again, a scatter plot would not need any of this, and this count will change if we ever add more data to the early period. Let me know if there's anything I can do to help!

Contributor

grady-lad commented Dec 1, 2017

I thought the sampling was to reduce the performance issue with the chart ?
The reason for the chart being skewed was an error on my part (not updating the x axis values correctly).

@titojankowski I have added your suggestion for the sampling (1st 790 items & every 7th item) and its looking good!

In relation to the scatter plot would we still not have to sample the data? I still trying to understand the full benefits of switching over to a scatter plot ?

grady-lad mentioned this issue

Sampling by week for ALL #114

Merged

Contributor Author

titojankowski commented Dec 1, 2017

@grady-lad 2 separate issues

The sampling is to cut the total amount of data. Sampling is good overall...ie pulling 1/10 of all samples.

But the issue here is skewing, and happens whether we sample or not. It was an issue on the ALL chart forever, we just didn’t notice it.

A normal line graph works great if the spacing between every point is the same. ie Coinbase has one data point for every day.

But it’s not in our case. We have weekly data at the beginning of our dataset, and daily data towards the end.

X-Y Scatterplot is useful because it places the datapoints on the x-axis based on their date. We shouldn’t need the whole “treat the first 790 datapoints this way, and the rest this other way”. A scatterplot would just make it all work without that, put the data in, and it positions the data points correctly. Does that help? Or more confusing?

Contributor Author

titojankowski commented Dec 1, 2017

And yes! The current fix does look great, just saw it! tho it will break if the early data ever changes from 790. @grady-lad

Contributor Author

titojankowski commented Dec 1, 2017

The reason this skew issue matters is it made it look like CO2 rose really fast from 1958-1975 or so, and then slowed down. But that’s not true at all! CO2 has risen steadily every year since 1958...and maybe now it’s speeding up a bit. But the skew issue made it look like the CO2 increase isnt as bad as it was around 1958.

That’s why I noticed it recently, I was wondering if CO2 was accelerating vs just steadily rising. That’s when I saw the issue with the 1958-1975 data.

So that’s why it’s important we convey the data correctly. And a scatterplot makes that easy.

Contributor Author

titojankowski commented Jan 5, 2018

@grady-lad I'm taking another look at the sampling. One major outlier is the year 1964. There's only 31 data points for 1964[1], so with the current graph it ends up looking really skinny. But we should just take an equal number of data points for each year. Thoughts on how to fix this? Maybe we could make our sampling function smarter so it picks the same number of points from every year? That way you wouldn't need separate code for 1958-1974 and >1974.

[1]: 31 datapoints in 1964: http://api.carbondoomsday.com/api/co2/?date__range=1964-01-01%2C1965-01-01

Member

decentral1se commented Jan 21, 2018 •

edited

Loading

And a scatterplot makes that easy.

Briefly coming in here, didn't read everything 🦅 BUT is this the status of this issue?

A feature request for a new plot?

decentral1se added this to the P3: Optional milestone

Contributor Author

titojankowski commented Jan 22, 2018 via email

Yeah, make the x-axis position based on the date, rather than equally spacing data. At this point if we had a datapoint from 2025, it would simply appear right next to our existing data, rather than over on 2025 on the x-axis.

On Sun, Jan 21, 2018, at 2:30 AM, Luke Murphy wrote: > And a scatterplot makes that easy. Briefly coming in here, didn't read everything 🦅 BUT is this the status of this issue? A feature request for a new plot?> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub[1], or mute the thread[2].>

Links: 1. #109 (comment) 2. https://github.com/notifications/unsubscribe-auth/AAH4JMIiPaGN0gT_vNlCALIz9sn1iawfks5tMxGugaJpZM4QjgI4

decentral1se changed the title ~~Graph skewed 1958 - 1970~~ Graph skewed 1958 - 1970 - Scatterplot request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment