-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buildout OTDI Pipelines v0.0.1 #85
Comments
Note: This will be a prototype for the ideas discussed. The plan is to implement all pipelines in DPK. #12 is for defining the initial lists of pipeline requirements. |
Although we can use Spark to create a result for Paris, our long-term strategy should be using data-prep-kit. To make this happen, we need several things:
So we need to address number 3 immediately as it defines whether we need DPK and/or SPARK for this |
I will start initial implementation for HF data sets data access to asses complexity |
Concerning K8s, I feel it's unnecessary at this time, because we will run relatively simple, sporadic pipelines for a while. If we get to the point where we run nonstop and K8s makes sense, we can introduce it. Ephemeral Ray clusters on AWS (like we used in the ADP project) is the optimal approach now, IMHO. Thoughts? |
For me, question number 3 is a more important one. Do we need Ray/Spark to validate the license? My gut feeling is no. So we need to decide whether we need to do data processing for complete data. This is when we do need scalability and consequently Spark/Ray. Otherwise, a simple Python main should suffice. Assuming that the answer to question 3 is yes, we need full data processing and hence scalability then the k8 vs standalone Ray cluster question boils down to KFP usage. Do we need complex pipelines? If the answer is yes, we need KFP (or another workflow tool) and K8. If we are ok with simple sequential execution, then a standalone Ray cluster would suffice. I would like not to introduce additional complexity, but changing the platform down the road can be even more expensive. This is the reason why I am pushing the vision so hard. We need to know what we will need at least in the next several years. |
Current state below, we can discuss next steps for this ticket after the standup - both short term and long term. In the event @blublinsky has made enough progress on the DPK / Ray front, we can abandon this approach altogether. Currently implemented and working: ![]() Notes:
Net result: data stored in HF Hub can be read into a Spark data frame using EMR Serverless: ![]() Again, we can discuss next steps here later today. |
Very nice, but... We still need to decide what kind of processing we plan to do before moving further and investing into all these things. |
#12 is for defining the initial pipeline requirements.
Does this mean that the entire dataset is copied to S3, rather than just streamed through the process? |
it does, and highlights two problems:
We should probably check with HF to see what best practices are here. |
Well it has to do with you using dataframes to read all data. This means that you have to copy all data to S3 and then read it. DPK is using a different approach (https://github.com/blublinsky/dpk/blob/dev/data-processing-lib/doc/spark-runtime.md). We only get a file list and the actual data read is done by the worker. Makes things a lot faster. Take a look here blublinsky/dpk#2 |
Interesting situation with having to use |
Regarding the license on HF ask:
|
Addressing 3b above, and building on the work being done in #12, the following code demonstrates how we can use the metadata fields (both at the field level and the file level) in a Parquet file to store the license information being compiled in #12. This could be in lieu of, or in addition to a data card. I will run a check today on HF hub and see if anyone (at least the big data producers) are doing anything like this already. Can blow the below out into a real workbook, but felt it was short enough to get the point across. We can talk about this at the next standup. Lots and lots of possibilities here.
|
Actually it is much simpler:
|
The HuggingFace parquet converter bot seems to write the schema information to the file level metadata field, under the top level "huggingface" tag:
The metadata stored here can also be retrieved using the dataset viewer API:
I believe this is also the endpoint that is being called from the HF web UI to render the "Dataset viewer" view. That being said, there is nothing stopping us from adding an "AI Alliance" block to the file level metadata, right alongside the "huggingface" block. The "AI Alliance" block can contain the licensing information desired to be captured here, either in addition to, or in lieu of the dataset card. I could also see a sha256 being generated on the parquet file to make sure the licensing information attached to the metadata field is legitimate. |
We are talking about 2 different things. I am talking about README, which has to contain data card with the filed of license. This is mandatory. |
Please read the list of V0.1 features we'll try to implement: #12 (comment) First is parsing the README. A stretch goal, which can be part of V0.2, is to do a uniqueness query on the license column in the parquet files, then look for discrepancies, unacceptable licenses, etc. |
Ticket is obsolete. @blublinsky reports he has a license validator working under DPK. This was a temporary ticket until we could get clarification on a DPK strategy. This ticket was also opened before @blublinsky joined the team, and he has experience developing under DPK. |
The purpose of this task is to make an initial pass at an AWS PySpark EMR job that can
The text was updated successfully, but these errors were encountered: