Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pump Chunk Marking #2

Open
fleerdayo opened this issue Jan 15, 2021 · 3 comments
Open

Pump Chunk Marking #2

fleerdayo opened this issue Jan 15, 2021 · 3 comments

Comments

@fleerdayo
Copy link

For each resampled (chunked) pump csv file, did you only mark 1 chunk as True?

E.g. if there is a pump at 2019-03-1 17.00 and I chunked my csv data into 5 second chunks (and only taking into consideration the pump day and 1 day before and after), I only marked the chunk from 17.00.00 to 17.00.05 as True.
This leaves me with an extremely imbalanced dataset so that a RandomForrestClassifier ends up predicting every chunk as False.

What am I missing here?

------- Offtopic -----------
Also thank you guys for your effort to collect all the data. I enjoyed reading your paper too and got lots of useful information out of it. It's a welcome distraction to fiddle around with your data during all the restrictions :)

@RaibekTussupbekov
Copy link

Hello, @fleerdayo :)

I've been trying to reverse engineer the paper model for the last two weeks:)

I've been able to achieve 77.907 % recall but ridiculously low 0.185 % precision:(

I use imblearn.ensemble.BalancedRandomForestClassifier to undersample the data.

I tried to cut off 30 minutes after each pump chunk because the paper says that "...Once a pump is detected we pause our classifier for 30 minutes to avoid multiple alerts for the same event..."

However it does not help.

So I believe that the data should be filtered before training. The paper says that the authors picked only 104 samples out of 175.

Maybe this is the main reason of so many false positives?

Let me know if you're still interested. We could collaborate:) I see that the authors are not responding here:) Maybe they are too busy...or too rich:) Just kidding:)

Btw I'm ready to share my code and collaborate with whoever is interested including the authors:)

@RazcoDev
Copy link

Hey @RaibekTussupbekov , did you mange to make this work ? I also encounter many issues with the dataset.
Thanks !

@RaibekTussupbekov
Copy link

@RazcoDev Not yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants