Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildout OTDI Pipelines v0.0.1 #85

Closed
jolson-ibm opened this issue Jan 16, 2025 · 19 comments
Closed

Buildout OTDI Pipelines v0.0.1 #85

jolson-ibm opened this issue Jan 16, 2025 · 19 comments
Assignees

Comments

@jolson-ibm
Copy link

jolson-ibm commented Jan 16, 2025

The purpose of this task is to make an initial pass at an AWS PySpark EMR job that can

  • Take a HuggingFace (HF) data set as an input
  • Develop a job that uses an off the shelf library that either calculates some relevant data metric on the data set, or scores the data set against some metric.
  • Append the results to the data set card, including when the job was executed, and either the sha256 or git hash of the HF data set it was run against.
@jolson-ibm jolson-ibm self-assigned this Jan 16, 2025
@jolson-ibm jolson-ibm converted this from a draft issue Jan 16, 2025
@deanwampler
Copy link
Contributor

deanwampler commented Jan 16, 2025

Note: This will be a prototype for the ideas discussed. The plan is to implement all pipelines in DPK.

#12 is for defining the initial lists of pipeline requirements.

@jolson-ibm jolson-ibm changed the title OTDI Pipelines v0.0.1 Buildout OTDI Pipelines v0.0.1 Jan 16, 2025
@jolson-ibm
Copy link
Author

EMR cluster up and running, can be turned on / off on demand. Not completely scripted out as IaaS using Python AWS CDK yet, but will get there shortly and will check in to a code repo.

Test job 'word count' submitted and executed successfully. See screen shot below.

Next goal: pull in a single, small data set from HuggingFace passed in as an argument, and calculate some metric on it.

Image Image

@blublinsky
Copy link
Contributor

blublinsky commented Jan 18, 2025

Although we can use Spark to create a result for Paris, our long-term strategy should be using data-prep-kit. To make this happen, we need several things:

  1. Implement HF data sets support - data access/data access factory
  2. Decide on deployment options for AWS - currently its based on KubeRay on k8. Should we pursue this? Additional advantage is KFP support. If we want to go this route we need autosclaled K8 cluster. The other option is directly install Ray on AWS. This will not support KFP, so we need to merge this PR
  3. We also need to start outlining pipelines themselves. Do we need full data processing (that will require DPK) or just simple things like checking data set metadata for a license? In the latter case DPK usage becomes a moot point

So we need to address number 3 immediately as it defines whether we need DPK and/or SPARK for this

@blublinsky
Copy link
Contributor

I will start initial implementation for HF data sets data access to asses complexity

@deanwampler
Copy link
Contributor

Concerning K8s, I feel it's unnecessary at this time, because we will run relatively simple, sporadic pipelines for a while. If we get to the point where we run nonstop and K8s makes sense, we can introduce it. Ephemeral Ray clusters on AWS (like we used in the ADP project) is the optimal approach now, IMHO. Thoughts?

@blublinsky
Copy link
Contributor

blublinsky commented Jan 18, 2025

For me, question number 3 is a more important one. Do we need Ray/Spark to validate the license? My gut feeling is no. So we need to decide whether we need to do data processing for complete data. This is when we do need scalability and consequently Spark/Ray. Otherwise, a simple Python main should suffice.

Assuming that the answer to question 3 is yes, we need full data processing and hence scalability then the k8 vs standalone Ray cluster question boils down to KFP usage. Do we need complex pipelines? If the answer is yes, we need KFP (or another workflow tool) and K8. If we are ok with simple sequential execution, then a standalone Ray cluster would suffice.

I would like not to introduce additional complexity, but changing the platform down the road can be even more expensive.

This is the reason why I am pushing the vision so hard. We need to know what we will need at least in the next several years.
Just saying DPK is the answer is probably the wrong one. Its the answer for what?

@jolson-ibm
Copy link
Author

jolson-ibm commented Jan 21, 2025

Current state below, we can discuss next steps for this ticket after the standup - both short term and long term. In the event @blublinsky has made enough progress on the DPK / Ray front, we can abandon this approach altogether.

Currently implemented and working:

Image

Notes:

  1. AWS Serverless EMR easiest mode of operation is to read / write from S3. That's the tradeoff for not having to boot, manage, and pay for a complex Spark cluster.
  2. This means something else has to interact the HuggingFace's (HF) myriad of interface to get the data into S3. I built out a very very simple Docker loader that can deploy using AWS's serverless container store (ECS / Fargate). The deployed container simply uses git lfs and the AWS CLI to stage the data. for the EMR pipeline.
  3. Both of these are run manually at the moment, but it wouldn't be too difficult to instrument them both using AWS SNS / SQS, and wrap all of this in a web based user interface, allowing anyone to operate it. I'd estimate 40-60 hours of work for this.
  4. The top arrow, the output of the EMR job to the HF data card is not yet implemented, I would estimate 10-20 hours to get a first pass at that online.
  5. The other need to code all of this up as IaaS using AWS CDK so that it can be treat as IaaS. Estimated time here: 30-40 hours.

Net result: data stored in HF Hub can be read into a Spark data frame using EMR Serverless:

Image

Again, we can discuss next steps here later today.

@blublinsky
Copy link
Contributor

Very nice, but...

We still need to decide what kind of processing we plan to do before moving further and investing into all these things.

@deanwampler
Copy link
Contributor

...
3. We also need to start outlining pipelines themselves. Do we need full data processing (that will require DPK) or just simple things like checking data set metadata for a license? In the latter case DPK usage becomes a moot point

#12 is for defining the initial pipeline requirements.

  1. This means something else has to interact the HuggingFace's (HF) myriad of interface to get the data into S3. I built out a very very simple Docker loader that can deploy using AWS's serverless container store (ECS / Fargate). The deployed container simply uses git lfs and the AWS CLI to stage the data. for the EMR pipeline.

Does this mean that the entire dataset is copied to S3, rather than just streamed through the process?

@jolson-ibm
Copy link
Author

Does this mean that the entire dataset is copied to S3, rather than just streamed through the process?

it does, and highlights two problems:

  1. I don't know of any way to stream data directly out of HuggingFace Hub. The several data access interfaces they have require a "copy to local" - namely git lfs, their datasets Python SDK, their Parquet REST API, and their Python filesystem SDK.

  2. Since their data is stored as git lfs, any cloud based commoditized, scalable compute is going to require a "copy to cloud" step. If you are doing data engineering properly, you do this once, and cache it where you need it and go from there. This does get expensive with the larger datasets.

We should probably check with HF to see what best practices are here.

@blublinsky
Copy link
Contributor

blublinsky commented Jan 21, 2025

Well it has to do with you using dataframes to read all data. This means that you have to copy all data to S3 and then read it. DPK is using a different approach (https://github.com/blublinsky/dpk/blob/dev/data-processing-lib/doc/spark-runtime.md). We only get a file list and the actual data read is done by the worker. Makes things a lot faster.

Take a look here blublinsky/dpk#2

@deanwampler
Copy link
Contributor

Interesting situation with having to use git-lfs. I should mention that we know that not all datasets will be in HF, e.g., Common Crawl.

@jolson-ibm
Copy link
Author

jolson-ibm commented Jan 23, 2025

Regarding the license on HF ask:

  1. The simplest, low hanging fruit is to see if there is a license file in the dataset filesystem using the HF filesystem API. Another option would be to parse the README.md if it exists and see if there is any mention of a license in there.
  2. A variant of the above - use a deployed LLM and send it URLs to HF data repos and data cards and ask the LLM if it sees any licensing information.
  3. As far as parsing out license information within the data itself I see a few possible approaches there, depending on the format of the data set. itself. An LLM may be useful here, too.
    • If the dataset consists of pdf documents (I don't know if this is actually a case on HF..I'll need to look around) check the .pdf for license info either in the document itself, or in the document metadata.
    • If the dataset consists of a set of parquet files, check to see if licensing is mentioned in each of the parquet metadata.
    • If the data set is text (csv, json, etc) check to see if licensing is part of the headers / json structure.

@jolson-ibm
Copy link
Author

Addressing 3b above, and building on the work being done in #12, the following code demonstrates how we can use the metadata fields (both at the field level and the file level) in a Parquet file to store the license information being compiled in #12. This could be in lieu of, or in addition to a data card.

I will run a check today on HF hub and see if anyone (at least the big data producers) are doing anything like this already.

Can blow the below out into a real workbook, but felt it was short enough to get the point across.

We can talk about this at the next standup. Lots and lots of possibilities here.

import pyarrow.csv as pv
import pyarrow.parquet as pq
import pyarrow as pa
import json

# start with some raw test data:
#
# autos.csv:
#
# make,model,year
# Tesla,Cybertruck,2024
# Ford,F150,2023
# Jeep,Grand Cherokee,2024

table = pv.read_csv('autos.csv')

# Now assign whatever metadata we want. 
# Can be added at the column level *OR* at the table level.
# Metadata fields cannot take Python dictionaries.

licensed_schema = pa.schema([
    pa.field("make", "string", False, metadata={"PII Check":"Pass",
                                                "HAP Check":"Pass", 
                                                "Executed By":"AI Alliance OTDI Validation v1.1.2"}),
    pa.field("model", "string", False, metadata={"PII Check":"Pass",
                                                "HAP Check":"Pass", 
                                                "Executed By":"AI Alliance OTDI Validation v1.1.2"}),   
    pa.field("year", "int64", False, metadata={"PII Check":"Pass",
                                                "HAP Check":"Pass", 
                                                "Executed By":"AI Alliance OTDI Validation v1.1.2"})
    ],
    metadata={
                "license": "cdla-permissive-2.0",
                "name": "Community Data License Agreement – Permissive, Version 2.0",
                "license_link": "https://github.com/The-AI-Alliance/trust-safety-user-guide/blob/main/LICENSE",
                "tags": "autos, AI Alliance",
                "paperswithcode_id": "2408-01800",
                "validation by AI Alliance OTDI ":"True",
                "validation date": "2025-01-23"
            }
    )

licensed_table = table.cast(licensed_schema)
pq.write_table(licensed_table, 'autos.parquet')


read_test = pq.read_table('autos.parquet').schema

read_test.metadata               # returns enite metadata block
read_test.metadata[b'license')    # => b"cdla-permissive-2.0"


read_test.field('make').metadata[b'PII Check'] # b'Pass'

@blublinsky
Copy link
Contributor

Actually it is much simpler:

    def get_dataset_card(self, ds_name: str) -> RepoCard:
        """
        Get the Repo card for the data set
        :param ds_name: data set name in the format owner/ds_name
        :return: DS card object
        """
        # get file location
        if ds_name[-1] == "/":
            path = f"datasets/{ds_name[:-1]}/README.md"
        else:
            path = f"datasets/{ds_name}/README.md"
        # read README file
        try:
            with self.fs.open(path=path, mode="r", newline="", encoding="utf-8") as f:
                data = f.read()
        except Exception as e:
            logger.warning(f"Failted to read README file {e}")
            return None
        # convert README to Repo card
        return RepoCard(content=data)

Also see https://github.com/IBM/data-prep-kit/blob/hf-data-access/data-processing-lib/python/src/data_processing/data_access/data_access_hf.py#L242

@jolson-ibm
Copy link
Author

The HuggingFace parquet converter bot seems to write the schema information to the file level metadata field, under the top level "huggingface" tag:

schema = pq.read_table('./wine_reviews/data/train-00000-of-00001-258bcfc491c8fd80.parquet').schema
metadata = schema.metadata
metadata_dictionary = json.loads(metadata[b"huggingface"].decode("utf-8"))
print(json.dumps(metadata_dictionary, indent = 3))

{
   "info": {
      "features": {
         "country": {
            "dtype": "string",
            "_type": "Value"
         },
         "description": {
            "dtype": "string",
            "_type": "Value"
         },
         "points": {
            "dtype": "int64",
            "_type": "Value"
         },
         "price": {
            "dtype": "float64",
            "_type": "Value"
         },
         "province": {
            "dtype": "string",
            "_type": "Value"
         },
         "variety": {
            "names": [
               "Bordeaux-style Red Blend",
               "Bordeaux-style White Blend",
               [...]          # Removed for brevity
               "White Blend",
               "Zinfandel"
            ],
            "_type": "ClassLabel"
         }
      }
   }
}

The metadata stored here can also be retrieved using the dataset viewer API:

dataset = "james-burton/wine_reviews"
url = f"https://datasets-server.huggingface.co/rows?dataset={dataset}&config=default&split=train&offset=0&length=1"
response = requests.get(url)
print(json.dumps(response.json(), indent = 3))


   "features": [
      {
         "feature_idx": 0,
         "name": "country",
         "type": {
            "dtype": "string",
            "_type": "Value"
         }
      },
      {
         "feature_idx": 1,
         "name": "description",
         "type": {
            "dtype": "string",
            "_type": "Value"
         }
      },
      {
         "feature_idx": 2,
         "name": "points",
         "type": {
            "dtype": "int64",
            "_type": "Value"
         }
      },
      {
         "feature_idx": 3,
         "name": "price",
         "type": {
            "dtype": "float64",
            "_type": "Value"
         }
      },
      {
         "feature_idx": 4,
         "name": "province",
         "type": {
            "dtype": "string",
            "_type": "Value"
         }
      },
      {
         "feature_idx": 5,
         "name": "variety",
         "type": {
            "names": [
               "Bordeaux-style Red Blend",
               "Bordeaux-style White Blend",
                [...]          # Removed for brevity
               "White Blend",
               "Zinfandel"
            ],
            "_type": "ClassLabel"
         }
      }
   ],


I believe this is also the endpoint that is being called from the HF web UI to render the "Dataset viewer" view.

That being said, there is nothing stopping us from adding an "AI Alliance" block to the file level metadata, right alongside the "huggingface" block. The "AI Alliance" block can contain the licensing information desired to be captured here, either in addition to, or in lieu of the dataset card.

I could also see a sha256 being generated on the parquet file to make sure the licensing information attached to the metadata field is legitimate.

@blublinsky
Copy link
Contributor

We are talking about 2 different things. I am talking about README, which has to contain data card with the filed of license. This is mandatory.
You seem to talk about individual parquet files, which might or might not contain license field. I think most of them do not. You also are talking about hF parquet converter, which runs only if the source is not parquet. Too many ifs for my taste

@deanwampler
Copy link
Contributor

Please read the list of V0.1 features we'll try to implement: #12 (comment)

First is parsing the README. A stretch goal, which can be part of V0.2, is to do a uniqueness query on the license column in the parquet files, then look for discrepancies, unacceptable licenses, etc.

@jolson-ibm
Copy link
Author

Ticket is obsolete. @blublinsky reports he has a license validator working under DPK.

This was a temporary ticket until we could get clarification on a DPK strategy. This ticket was also opened before @blublinsky joined the team, and he has experience developing under DPK.

@github-project-automation github-project-automation bot moved this from In Progress to Done in FA5: OTDI Tasks Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants