Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook on Data Wrangling with Polars #18

Open
koushikkhan opened this issue Feb 5, 2025 · 9 comments
Open

Notebook on Data Wrangling with Polars #18

koushikkhan opened this issue Feb 5, 2025 · 9 comments

Comments

@koushikkhan
Copy link

Description

Hi team,

Thanks for taking this initiative to create high quality notebooks covering fundamental topics. I would like to contribute with regards to Polars covering examples for filtering, joins, window operations and many more.

@akshayka
Copy link
Contributor

akshayka commented Feb 6, 2025

@koushikkhan — that would be fantastic!

We could create a "course" (folder) that is an introduction to polars. And teach these concepts through a number of small notebooks.

  • Do you have a sequence of notebooks in mind? If so I'd love to take a look.
  • Would you like to create a PR that creates the folder, and adds one or two introductory notebooks so we can get aligned on style and content?

@koushikkhan
Copy link
Author

@koushikkhan — that would be fantastic!

We could create a "course" (folder) that is an introduction to polars. And teach these concepts through a number of small notebooks.

  • Do you have a sequence of notebooks in mind? If so I'd love to take a look.
  • Would you like to create a PR that creates the folder, and adds one or two introductory notebooks so we can get aligned on style and content?

@akshayka I don't have any notebook ready, but I can create couple of them as you requested. Thanks.

@akshayka
Copy link
Contributor

akshayka commented Feb 7, 2025

Great! It might be worth it to come up with a list of notebook titles too -- like a table of contents, without necessarily writing the notebooks. That would help others contribute too.

@koushikkhan
Copy link
Author

@akshayka Alright, will work on that.

@koushikkhan
Copy link
Author

@akshayka Here is my proposal for Polars:

  • Why Polars? (we can talk about easier syntax which is more natural like PySpark & SQL, faster processing, handling larger datasets etc. may be with some benchmarks)

  • Handling various data types

    • type casting (string to datetime or int to float)
    • handling datetimes
    • list and arrays
    • handling strings (ex. with regex)
    • handling missing data
    • using structs
  • Basic Data Wrangling Techniques

    • Loading datasets from various sources
    • Performing basic checks on a dataset
    • Selecting a subset of columns
    • Filtering rows by evaluating one or more conditions
    • Creating new columns with expressions
    • Performing basic aggregations using group-by (e.g. group wise average)
    • Performing window-wise operations (e.g. group wise ranking/ cumulative sum)
    • Performing joins between two data frames
      • inner joins
      • left/right joins
      • anti joins
  • Understanding lazy evaluation to reduce memory usage

  • Using user-defined functions

Please let me know your opinion on this. Thanks.

@akshayka
Copy link
Contributor

akshayka commented Feb 7, 2025

This looks fantastic!

I like the idea of starting with "Why Polars".

Keeping with this repo's spirit of teaching the fundamentals, I think we should add two additional notebooks after Why Polars: "Series" and "Dataframes", where we define both and give basic example usage (more than given in "Why Polars").

The various data types and data wrangling techniques look like great topics; I wonder if it makes sense to split them into multiple notebooks, but we can cross that bridge when we get there.

I'd be happy to look at a notebook in a PR when you get the time; perhaps we can start with Why Polars?

@koushikkhan
Copy link
Author

@akshayka Akshay, I agree with you. Will start working on the "Why Polars".

@akshayka
Copy link
Contributor

akshayka commented Feb 8, 2025

Great, thank you!

@koushikkhan
Copy link
Author

Hey @akshayka , I have added the first notebook for Polars: https://github.com/koushikkhan/learn/blob/feat/issue%2318/polars-data-wrangling/polars/001_why_polars.py

Let me know if you can access it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants