-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kedro integration cannot load datasets created using dataset factories #988
Comments
BTW, I am happy to open a PR to fix this :) |
Thanks for the very clear issue @gtauzin! I haven't been keeping up to date with all the kedro developments but was always a strong supporter of the idea of a dict interface to the data catalog so very pleased to see it's happened 😀 Thank you @astrojuanlu @ElenaKhaustova and everyone else! Just skimming through the kedro issues now this seems like great progress 🙌 If you're happy to raise a PR for this then that would be amazing. I'm very happy to support with it and offer any help that you need over github issues or a Zoom call or anything - just let me know. A few things to note upfront:
And a few tangentially related points...
|
That's my pleasure @antonymilne. Let me provide you with an answer to your very first point now as I have already asked this question on the kedro slack and in the interest of the time of the people that have already answered me there. I will answer to the rest of your points later on as I want to take the time to properly address them.
First, thank you. I really appreciate how supportive the vizro team is (and the kedro team for that matter)! From the answer of @ankatiyar on Slack:
So, as I understand it, this is a simple workaround that should work for
I tested this workaround, and it works well for me. In practice that would simply mean that adding those two lines to |
Thanks for the answer! This makes sense although I have some follow on questions (no problem if you can't answer them @gtauzin, I can take it up with the kedro team)... I looked through the source of If I do The reason this slightly matters is that the above code that works for
Hence I am wondering how |
One thing to keep in mind is that For the new |
I see there's confusion between datasets, lazy datasets and patterns. We'll try to communicate the difference better when switching to new catalog. Here are some things to keep in mind:
@antonymilne so
|
Thanks for the comments @ankatiyar and @ElenaKhaustova! Indeed I had assumed that lazy datasets were the same thing as dataset patterns, but part of the confusion came from me misreading things so it's not necessarily a problem with the docs! Let me explain how I understand things now and hopefully you can say if I got anything wrong here:
Let's say the catalog contains datasets A, B and pattern P, and we have a pipeline that contains A, C, D, where C resolves to P but D does not (and hence will be memory dataset or whatever the default is). Then the options for finding "all" datasets would be:
Am I correct so far?! 😅 Currently on vizro we use @ElenaKhaustova do you know what a future The other question would be when you do
wdyt? 🙏 |
I tried to clarify things with my previous reply but it looks like I reached the opposite effect 😅
If you have any ideas about what you would like as an output, we can consider them as well. Now is a perfect time for that 🙂
|
From how I see it now, we will do the same thing as in the CLI command: input an optional pipelines or use all pipelines if one is not provided. I guess that's option 3 you mentioned above. |
I had a chat with @ElenaKhaustova this afternoon (thanks again Elena 🙏) and another think about this, and here's the plan @gtauzin. It's a bit trickier than I first thought because the assumed flow on vizro is that you first define a kedro While the kedro
So to match our existing scheme of
Intended usage is like this:
The tests which currently exist for this aren't great so let's start again with them. We should update
In
And then minimal test cases would be:
Phew, that was more code than I expected to write! But this change turned out to be a bit more involved than I had first expected. @gtauzin please do feel free to proceed with the PR now I think we've figured out the correct way we should do this. |
Thank you for the insightful exchange! Sounds like a nice plan, I'll get to the PR very soon. @antonymilne I have a few points:
I think inputs should be
Shouldn't test 1 return
@antonymilne I also wanted to answer the questions on your first message :)
Sounds good, thank you!
Yes, I've seen plotly (and more recently bokeh) moving to narwhals and I use narwhals myself, so I am excited about this change!
Actually, I've been thinking to open an issue on the kedro issue tracker to discuss this. Here are a few ideas:
Somehow, if you use both together, your dahsboards are fed by kedro pipelines and take kedro dataset in. They are part of the same data science project, so it makes sense to me to deal with them together within a monorepos. For now, I put them in a
I organize dependencies using the
If there's more discussions on that topic, I'd love to follow them! Do you have any concrete plan to make the kedro/vizro integration? It seems to me a kedro plugin for vizro would be really nice.
Would love to connect with you and the vizro team if you think that would be useful! |
@gtauzin Just so we don't lose track of this when your PR gets merged, I've split the ongoing discussion off into a new issue: #1008. Will follow up on it all over there!
Amazing! We would love to talk to you some time. Probably the easiest way is to send me a message on the kedro slack channel - I'm not very active there these days but I should see a notification if you ping me there 🙂 Or feel free to drop me an email at [email protected]. |
This was released with |
Which package?
vizro
Package version
0.1.30
Description
When using the kedro integration for data management, it is currently not possible to load datasets that have been generated by dataset factories datasets. This is a known problem that is being addressed by the new
KedroDataCatalog
in the most recent versions of kedro (>=0.19.9). The issue is discussed here.How to Reproduce
kedro new --starter=spaceflights-pandas
conf/base/catalog.yml
to use the dataset factories syntax. Removeand replace it with:
%load_ext kedro.ipython
. The kedro catalog is now defined.Output
The steps above outputs a list of the kedro dataset names without "companies" and "reviews". However, as kedro pipelines' inputs/outputs refer to "companies" and "reviews" and they match the dataset factory defined above, they should be also listed.
Code of Conduct
The text was updated successfully, but these errors were encountered: