diff --git a/README.md b/README.md index 46ee86dca..f17b58289 100644 --- a/README.md +++ b/README.md @@ -1,330 +1,267 @@
- -

Welcome to the official Hamilton Github Repository

- - Hamilton CircleCI - - - Documentation Status - - Hamilton Slack - Twitter -
- Python supported - PyPi Version - Total Downloads -
- - +

Hamilton β€” portable & expressive
data transformation DAGs

+ + Documentation Status + + Python supported + + + PyPi Version + + + Total Downloads + + + + + + + Hamilton Slack + + + + + + +
+

-# Hamilton +Hamilton is a lightweight Python library for directed acyclic graphs (DAGs) of data transformations. Your DAG is **portable**; it runs anywhere Python runs, whether it's a script, notebook, Airflow pipeline, FastAPI server, etc. Your DAG is **expressive**; Hamilton has extensive features to define and modify the execution of a DAG (e.g., data validation, experiment tracking, remote execution). -The general purpose micro-orchestration framework for building [dataflows](https://en.wikipedia.org/wiki/Dataflow) from python functions. Express data, ML, LLM pipelines/workflows, and web requests in a simple declarative manner. -Hamilton also comes with a [UI](https://hamilton.dagworks.io/en/latest/concepts/ui) to visualize, catalog, and monitor your dataflows. +To create a DAG, write regular Python functions that specify their dependencies with their parameters. As shown below, it results in readable code that can always be visualized. Hamilton loads that definition and automatically builds the DAG for you! -Hamilton is a novel paradigm for specifying a flow of delayed execution in python. It works on python objects of any type and dataflows of any complexity. Core to the design of Hamilton is a clear mapping of function name to artifact, allowing you to quickly grok the relationship between the code you write and the data you produce. +
+ Create a project +
Functions B() and C() refer to function A in their parameters
+
+
-This paradigm makes modifications easy to build and track, ensures code is self-documenting, and makes it natural to unit test your data transformations. When connected together, these functions form a [Directed Acyclic Graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) (DAG), which the Hamilton framework can execute, optimize, and report on. +Hamilton brings modularity and structure to any Python application moving data: ETL pipelines, ML workflows, LLM applications, RAG systems, BI dashboards, and the [Hamilton UI](https://hamilton.dagworks.io/en/latest/concepts/ui) allows you to automatically visualize, catalog, and monitor execution. -Note: Hamilton describes DAGs. If you're looking for something to handle loops or conditional edges (say, for a human-in-the-loop application like a chatbot or agent), you might appreciate [Burr](https://github.com/dagworks-inc/burr) -- it integrates well with any python library (including Hamilton!). +> Hamilton is great for DAGs, but if you need loops or conditional logic to create an LLM agent or a simulation, take a look at our sister library [Burr](https://github.com/dagworks-inc/burr) πŸ€– . -

- Description1 - Description2 - Description3 -

-

- Optional UI to browse transforms, monitor datasets, and track executions -

+# Installation -## Problems Hamilton Solves -βœ… Model a dataflow -- If you can model your problem as a DAG in python, Hamilton is the cleanest way to build it.
-βœ… Unmaintainable spaghetti code -- Hamilton dataflows are unit testable, self-documenting, and provide lineage.
-βœ… Long iteration/experimentation cycles -- Hamilton provides a clear, quick, and methodical path to debugging/modifying/extending your code.
-βœ… Reusing code across contexts -- Hamilton encourages code that is independent of infrastructure and can run regardless of execution setting.
-βœ… Collaborating on dataflows & tracking execution + artifacts -- Hamilton comes with an optional [UI](https://hamilton.dagworks.io/en/latest/concepts/ui) to visualize, catalog, and monitor your dataflows, which helps teams operate smoothly. - -## Problems Hamilton Does not Solve -❌ Provisioning infrastructure -- you want a macro-orchestration system (see airflow, kubeflow, sagemaker, etc...).
-❌ Doing your ML for you -- we organize your code, BYOL (bring your own libraries).
- -See the table below for more specifics/how it compares to other common tooling. - -## Full Feature Comparison -Here are common things that Hamilton is compared to, and how Hamilton compares to them. - -| Feature | Hamilton | Macro orchestration systems (e.g. Airflow) | Feast | dbt | Dask | -|-------------------------------------------|:---:|:---------------------------------------------:|:-----:|:---:|:----:| -| Python 3.8+ | βœ… | βœ… | βœ… | βœ… | βœ… | -| Helps you structure your code base | βœ… | ❌ | ❌ | βœ… | ❌ | -| Code is always unit testable | βœ… | ❌ | ❌ | ❌ | ❌ | -| Documentation friendly | βœ… | ❌ | ❌ | ❌ | ❌ | -| Can visualize lineage easily | βœ… | ❌ | ❌ | βœ… | βœ… | -| Is just a library | βœ… | ❌ | ❌ | ❌ | βœ… | -| Runs anywhere python runs | βœ… | ❌ | ❌ | ❌ | βœ… | -| Built for managing python transformations | βœ… | ❌ | ❌ | ❌ | ❌ | -| Can model GenerativeAI/LLM based workflows | βœ… | ❌ | ❌ | ❌ | ❌ | -| Replaces macro orchestration systems | ❌ | βœ… | ❌ | ❌ | ❌ | -| Is a feature store | ❌ | ❌ | βœ… | ❌ | ❌ | -| Can model transforms at row/column/object/dataset level | βœ… | ❌ | ❌ | ❌ | ❌ | - -# Getting Started -If you don't want to install anything to try Hamilton, we recommend trying [www.tryhamilton.dev](https://www.tryhamilton.dev/?utm_source=README). -Otherwise, here's a quick getting started guide to get you up and running in less than 15 minutes. -If you need help join our [slack](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) community to chat/ask Qs/etc. -For the latest updates, follow us on [twitter](https://twitter.com/hamilton_os)! - -## Installation -Requirements: - -* Python 3.8+ - -To get started, first you need to install hamilton. It is published to pypi under `sf-hamilton`: -> pip install sf-hamilton - -Note: to use the DAG visualization functionality, you should instead do: -> pip install "sf-hamilton[visualization]" - -While it is installing we encourage you to start on the next section. - -Note: the content (i.e. names, function bodies) of our example code snippets are for illustrative purposes only, and don't reflect what we actually do internally. - -## Hamilton in <15 minutes -Hamilton is a new paradigm when it comes to building datasets (in this case we'll use Hamilton to create columns of a -dataframe as an example. Otherwise hamilton can handle _any_ python object. - -Rather than thinking about manipulating a central object (dataframe in this case), -you instead declare the components (columns in this case)/intermediate results you want to create, and the inputs that are required. There -is no need for you to worry about maintaining this object, meaning you do not need to think about any "glue" code; -this is all taken care of by the Hamilton framework. - -For example, rather than writing the following to manipulate a central dataframe object `df`: -```python -df['col_c'] = df['col_a'] + df['col_b'] +Hamilton supports Python 3.8+. We include the optional `visualization` dependency to display our Hamilton DAG. For visualizations, [Graphviz](https://graphviz.org/download/) needs to be installed on your system separately. + +```bash +pip install "sf-hamilton[visualization]" ``` -you would write -```python -def col_c(col_a: pd.Series, col_b: pd.Series) -> pd.Series: - """Creating column c from summing column a and column b.""" - return col_a + col_b +To use the Hamilton UI, install the `ui` and `sdk` dependencies. + +```bash +pip install "sf-hamilton[ui,sdk]" ``` -In diagram form: -![example](hamiltondag.png) -The Hamilton framework will then be able to build a DAG from this function definition. -So let's create a "Hello World" and start using Hamilton! +To try Hamilton in the browser, visit [www.tryhamilton.dev](https://www.tryhamilton.dev/?utm_source=README) -### Your first hello world. -By now, you should have installed Hamilton, so let's write some code. +# Why use Hamilton? -1. Create a file `my_functions.py` and add the following functions: -```python -import pandas as pd +Data teams write code to deliver business value, but few have the resources to standardize practices and provide quality assurance. Moving from proof-of-concept to production and cross-function collaboration (e.g., data science, engineering, ops) remain challenging for teams, big or small. Hamilton is designed to help throughout a project's lifecycle: -def avg_3wk_spend(spend: pd.Series) -> pd.Series: - """Rolling 3 week average spend.""" - return spend.rolling(3).mean() +- **Separation of concerns**. Hamilton separates the DAG "definition" and "execution" which lets data scientists focus on solving problems and engineers manage production pipelines. -def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series: - """The cost per signup in relation to spend.""" - return spend / signups -``` -The astute observer will notice we have not defined `spend` or `signups` as functions. That is okay, -this just means these need to be provided as input when we come to actually wanting to create a dataframe. +- **Effective collaboration**. The [Hamilton UI provides a shared interface](https://hamilton.dagworks.io/en/latest/hamilton-ui/ui/) for teams to inspect results and debug failures throughout the development cycle. -Note: functions can take or create scalar values, in addition to any python object type. +- **Low-friction dev to prod**. Use `@config.when()` to modify your DAG between execution environments instead of error-prone `if/else` feature flags. The notebook extension prevents the pain of migrating code from a notebook to a Python module. -2. Create a `my_script.py` which is where code will live to tell Hamilton what to do: +- **Portable transformations**. Your DAG is [independent of infrastructure or orchestration](https://blog.dagworks.io/publish/posts/detail/145543927?referrer=%2Fpublish%2Fposts), meaning you can develop and debug locally and reuse code across contexts (local, Airflow, FastAPI, etc.). -```python -import pandas as pd -import my_functions +- **Maintainable DAG definition**. Hamilton [automatically builds the DAG from a single line of code whether it has 10 or 1000 nodes](https://hamilton.dagworks.io/en/latest/concepts/driver/). It can also assemble multiple Python modules into a pipeline, encouraging modularity. -from hamilton import driver +- **Expressive DAGs**. [Function modifiers](https://hamilton.dagworks.io/en/latest/concepts/function-modifiers/) are a unique feature to keep your code [DRY](https://en.wikipedia.org/wiki/Don't_repeat_yourself) and reduce the complexity of maintaining large DAGs. Other frameworks inevitably lead to code redundancy or bloated functions. -# This uses one module, but you are free to pass in multiple -dr = driver.Builder().with_modules(my_functions).build() +- **Built-in coding style**. The Hamilton DAG is [defined using Python functions](https://hamilton.dagworks.io/en/latest/concepts/node/), encouraging modular, easy-to-read, self-documenting, and unit testable code. -# This is input data -- you can get it from anywhere -initial_columns = { - 'signups': pd.Series([1, 10, 50, 100, 200, 400]), - 'spend': pd.Series([10, 10, 20, 40, 40, 50]), -} -output_columns = [ - 'spend', - 'signups', - 'avg_3wk_spend', - 'spend_per_signup', -] -df = dr.execute(output_columns, inputs=initial_columns) -print(df) -``` -3. Run my_script.py -> python my_script.py +- **Data and schema validation**. Decorate functions with `@check_output` to validate output properties, and raise warnings or exceptions. Add the `SchemaValidator()` adapter to automatically inspect dataframe-like objects (pandas, polars, Ibis, etc.) to track and validate their schema. -You should see the following output: +- **Built for plugins**. Hamilton is designed to play nice with all tools and provides the right abstractions to create custom integrations with your stack. Our lively community will help you build what you need! - spend signups avg_3wk_spend spend_per_signup - 0 10 1 NaN 10.000 - 1 10 10 NaN 1.000 - 2 20 50 13.333333 0.400 - 3 40 100 23.333333 0.400 - 4 40 200 33.333333 0.200 - 5 50 400 43.333333 0.125 -You should see the following image if you ran `dr.visualize_execution(output_columns, './my-dag.dot', {"format": "png"}, orient="TB")`: +# Hamilton UI -![hello_world_image](hello_world_image.png) -Note: we treat displaying `Inputs` in a special manner for readability in our visualizations. So you'll likely see input -nodes repeated. +You can track the execution of your Hamilton DAG in the [Hamilton UI](https://hamilton.dagworks.io/en/latest/hamilton-ui/ui/). It automatically populates a data catalog with lineage and provides execution observability to inspect results and debug errors. You can run it as a [local server](https://hamilton.dagworks.io/en/latest/hamilton-ui/ui/#local-mode) or a [self-hosted application using Docker](https://hamilton.dagworks.io/en/latest/hamilton-ui/ui/#docker-deployed-mode). -Congratulations - you just created your Hamilton dataflow that created a dataframe! +

+ Description1 + Description2 + Description3 +

+

+ DAG catalog, automatic dataset profiling, and execution tracking +

-### Tracking in the UI -To get started with tracking in the UI, you'll first have to install the `sf-hamilton[ui]` package: +## Get started with the Hamilton UI -```bash -pip install "sf-hamilton[ui,sdk]" -``` +1. To use the Hamilton UI, install the dependencies (see `Installation` section) and start the server with -Then, you can run the following code to start the UI: + ```bash + hamilton ui + ``` -```bash -hamilton ui -# python -m hamilton.cli.__main__ ui # on windows -``` +2. On the first connection, create a `username` and a new project (the `project_id` should be `1`). -This will start the UI at [localhost:8241](https://localhost:8241). You can then navigate to the UI to see your dataflows. -You will next want to create a project (you'll have an empty project page), and remember the project ID (E.G. 2 in the following case). -You will also be prompted to enter a username -- recall that as well! - -To track, we'll modify the driver you wrote above: - -```python -import pandas as pd -import my_functions -from hamilton import driver -from hamilton_sdk import adapters -dr = ( - driver - .Builder() - .with_modules(my_functions) - .with_adapters(adapters.HamiltonTracker( - username="elijah", # replace with your username - project_id=2, - dag_name="hello_world", - )) - .build() -) - -# This is input data -- you can get it from anywhere -initial_columns = { - 'signups': pd.Series([1, 10, 50, 100, 200, 400]), - 'spend': pd.Series([10, 10, 20, 40, 40, 50]), -} -output_columns = [ - 'spend', - 'signups', - 'avg_3wk_spend', - 'spend_per_signup', -] -df = dr.execute(output_columns, inputs=initial_columns) -print(df) -``` -Run this script, navigate back to the UI/select your project, and click on the `runs` -link on the left hand side. You'll see your run! +
+ Create a project +
+
-## Example Hamilton Dataflows -We have a growing list of examples showcasing how one might use Hamilton. You currently have two places to find them: +3. Track your Hamilton DAG by creating a `HamiltonTracker` object with your `username` and `project_id` and adding it to your `Builder`. Now, your DAG will appear in the UI's catalog and all executions will be tracked! -1. The [Hamilton Dataflow Hub](https://hub.dagworks.io/) -- which makes it easy to pull and then modify code. -2. The [`examples/`](https://github.com/dagworks-inc/hamilton/tree/main/examples) folder in this repository. + ```python + from hamilton import driver + from hamilton_sdk.adapters import HamiltonTracker + import my_dag + + # use your `username` and `project_id` + tracker = HamiltonTracker( + username="my_username", + project_id=1, + dag_name="hello_world", + ) + + # adding the tracker to the `Builder` will add the DAG to the catalog + dr = ( + driver.Builder() + .with_modules(my_dag) + .with_adapters(tracker) # add your tracker here + .build() + ) + + # executing the `Driver` will track results + dr.execute(["C"]) + ``` + +# Documentation & learning resources + +* πŸ“š See the [official documentation](https://hamilton.dagworks.io/) to learn about the core concepts of Hamilton. + +* πŸ‘¨β€πŸ« Consult the [examples on GitHub](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples) to learn about specific features or integrations with other frameworks. + +* πŸ“° The [DAGWorks blog](https://blog.dagworks.io/) includes guides about how to build a data platform and narrative tutorials. + +* πŸ“Ί Find video tutorials on the [DAGWorks YouTube channel](https://www.youtube.com/@DAGWorks-Inc) + +* πŸ“£ Reach out via the [Hamilton Slack community](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) for help and troubleshooting + + +# How does Hamilton compare to X? + +Hamilton is not an orchestrator ([you might not need one](https://blog.dagworks.io/p/lean-data-automation-a-principal)), nor a feature store ([but you can use it to build one!](https://blog.dagworks.io/p/featurization-integrating-hamilton)). Its purpose is to help you structure and manage data transformations. If you know dbt, Hamilton does for Python what dbt does for SQL. + +Another way to frame it is to think about the different layers of a data stack. Hamilton is at the **asset layer**. It helps you organize data transformations code (the **expression layer**), manage changes, and validate & test data. + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LayerPurposeExample Tools
OrchestrationOperational system for the creation of assetsAirflow, Metaflow, Prefect, Dagster
AssetOrganize expressions into meaningful units
(e.g., dataset, ML model, table)
Hamilton, dbt, dlt, SQLMesh, Burr
ExpressionLanguage to write data transformationspandas, SQL, polars, Ibis, LangChain
ExecutionPerform data transformationsSpark, Snowflake, DuckDB, RAPIDS
DataPhysical representation of data, inputs and outputsS3, Postgres, file system, Snowflake
+
+ +See our page on [Why use Hamilton?](https://hamilton.dagworks.io/en/latest/get-started/why-hamilton/) and framework [code comparisons](https://hamilton.dagworks.io/en/latest/code-comparisons/) for more information. -For the Hub, this will contain user contributed dataflows, e.g. text_summarization, forecasting, data processing, that will be continually added to. +# πŸ“‘ License -For the [`examples/`](https://github.com/dagworks-inc/hamilton/tree/main/examples) directory, you'll have to copy/fork the repository to run them. -E.g. -* [Hello world](https://github.com/dagworks-inc/hamilton/tree/main/examples/hello_world) -* Scaling on to [Ray](https://github.com/dagworks-inc/hamilton/tree/main/examples/ray), [Dask](https://github.com/dagworks-inc/hamilton/tree/main/examples/dask), or [Pandas on Spark](https://github.com/dagworks-inc/hamilton/tree/main/examples/spark) -* Training [a model with scikit-learn](https://github.com/dagworks-inc/hamilton/tree/main/examples/model_examples) -* Doing [air quality analysis solely in numpy](https://github.com/dagworks-inc/hamilton/tree/main/examples/numpy/air-quality-analysis) +Hamilton is released under the BSD 3-Clause Clear License. See [LICENSE](https://github.com/DAGWorks-Inc/hamilton/blob/main/LICENSE.md) for details. -We also have a docker container that contains some of these examples so you can pull that and run them locally. See the [examples folder README](https://github.com/DAGWorks-Inc/hamilton/blob/main/examples/README.md#running-examples-through-a-docker-image) for details. -# We forked and lost some stars -This repository is maintained by the original creators of Hamilton, who have since founded [DAGWorks inc.](https://dagworks.io/), a company largely dedicated to building and maintaining the Hamilton library. We decided to fork the original because Stitch Fix did not want to transfer ownership to us; we had grown the star count in the original repository to 893: Screen Shot 2023-02-23 at 12 58 43 PM -before forking. +# 🌎 Community +## πŸ‘¨β€πŸ’» Contributing +We're very supportive of changes by new contributors, big or small! Make sure to discuss potential changes by creating an issue or commenting on an existing one before opening a pull request. Good first contributions include creating an example or an integration with your favorite Python library! -For the backstory on how Hamilton came about, see the original Stitch Fix [blog post!](https://multithreaded.stitchfix.com/blog/2021/10/14/functions-dags-hamilton/). + To contribute, checkout our [contributing guidelines](https://github.com/DAGWorks-Inc/hamilton/blob/main/CONTRIBUTING.md), our [developer setup guide](https://github.com/DAGWorks-Inc/hamilton/blob/main/developer_setup.md), and our [Code of Conduct](https://github.com/DAGWorks-Inc/hamilton/blob/main/CODE_OF_CONDUCT.md). -# Slack Community -We have a small but active community on [slack](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg). Come join us! -# License -Hamilton is released under the [BSD 3-Clause Clear License](https://github.com/DAGWorks-Inc/hamilton/blob/main/LICENSE.md). +## 😎 Used by +Hamilton was started at Stitch Fix before the original creators founded DAGWorks Inc! The library is battle-tested and has been supporting production use cases since 2019. -# Used internally by: -* [Stitch Fix](https://www.stitchfix.com/) -* [UK Government Digital Services](https://github.com/alphagov/govuk-feedback-analysis) -* [IBM](https://www.ibm.com/) -* [British Cycling](https://www.britishcycling.org.uk/) -* [PNNL](https://pnnl.gov/) +>Read more about the [origin story](https://multithreaded.stitchfix.com/blog/2021/10/14/functions-dags-hamilton/). + + +* [Stitch Fix](https://www.stitchfix.com/) β€” Time series forecasting +* [UK Government Digital Services](https://github.com/alphagov/govuk-feedback-analysis) β€” National feedback pipeline (processing & analysis) +* [IBM](https://www.ibm.com/) β€” Internal search and ML pipelines +* [Opendoor](https://www.opendoor.com/) β€” Manage PySpark pipelines +* [Lexis Nexis](https://www.lexisnexis.com/en-us/home.page) β€” Feature processing and lineage +* [Adobe](https://www.adobe.com/) β€” Prompt engineering research +* [WrenAI](https://github.com/Canner/WrenAI) β€” async text-to-SQL workflows +* [British Cycling](https://www.britishcycling.org.uk/) β€” Telemetry analysis +* [Oak Ridge & PNNL](https://pnnl.gov/) β€” Naturf project * [ORNL](https://www.ornl.gov/) * [Federal Reserve Board](https://www.federalreserve.gov/) -* [Joby Aviation](https://www.jobyaviation.com/) +* [Joby Aviation](https://www.jobyaviation.com/) β€” Flight data processing * [Two](https://www.two.inc/) -* [Transfix](https://transfix.io/) -* [Railofy](https://www.railofy.com) -* [Habitat Energy](https://www.habitat.energy/) -* [KI-Insurance](https://www.ki-insurance.com/) -* [Ascena Retail](https://www.ascena.com/) -* [Opendoor](https://www.opendoor.com/) +* [Transfix](https://transfix.io/) β€” Online featurization and prediction +* [Railofy](https://www.railofy.com) β€” Orchestrate pandas code +* [Habitat Energy](https://www.habitat.energy/) β€” Time-series feature engineering +* [KI-Insurance](https://www.ki-insurance.com/) β€” Feature engineering +* [Ascena Retail](https://www.ascena.com/) β€” Feature engineering * [NaroHQ](https://www.narohq.com/) * [EquipmentShare](https://www.equipmentshare.com/) * [Everstream.ai](https://www.everstream.ai/) * [Flectere](https://flectere.net/) * [F33.ai](https://f33.ai/) -To add your company, make a pull request to add it here. +## 🀝 Code Contributors +[![Contributors](https://contrib.rocks/image?repo=dagworks-inc/hamilton)](https://github.com/DAGWorks-Inc/hamilton/graphs/contributors) -# Contributing -We take contributions, large and small. We operate via a [Code of Conduct](https://github.com/DAGWorks-Inc/hamilton/blob/main/CODE_OF_CONDUCT.md) and expect anyone -contributing to do the same. -To see how you can contribute, please read our [contributing guidelines](https://github.com/DAGWorks-Inc/hamilton/blob/main/CONTRIBUTING.md) and then our [developer -setup guide](https://github.com/DAGWorks-Inc/hamilton/blob/main/developer_setup.md). +## πŸ™Œ Special Mentions & 🦟 Bug Hunters -# Blog Posts -* [Lineage + Hamilton in 10 minutes](https://towardsdatascience.com/lineage-hamilton-in-10-minutes-c2b8a944e2e6) -* [(Organic Content) The perks of creating dataflows with Hamilton by Thierry Jean](https://medium.com/@thijean/the-perks-of-creating-dataflows-with-hamilton-36e8c56dd2a) -* [Developing Scalable Feature Engineering DAGs with Metaflow & Hamilton](https://outerbounds.com/blog/developing-scalable-feature-engineering-dags) -* [Tidy Production Pandas with Hamilton](https://towardsdatascience.com/tidy-production-pandas-with-hamilton-3b759a2bf562) -* [Towards Data Science post on backstory & introduction](https://towardsdatascience.com/functions-dags-introducing-hamilton-a-microframework-for-dataframe-generation-more-8e34b84efc1d). -* [How to use Hamilton with Pandas in 5 minutes](https://medium.com/@stefan.krawczyk/how-to-use-hamilton-with-pandas-in-5-minutes-89f63e5af8f5). -* [How to iterate with Hamilton in a Notebook](https://towardsdatascience.com/how-to-iterate-with-hamilton-in-a-notebook-8ec0f85851ed). -* [Original Stitch Fix Post](https://multithreaded.stitchfix.com/blog/2021/10/14/functions-dags-hamilton/). -* [Extension Stitch Fix Post](https://multithreaded.stitchfix.com/blog/2021/10/14/functions-dags-hamilton/). +Thanks to our awesome community and their active involvement in the Hamilton library. -# Videos of talks -* [Hamilton: a python micro-framework for data/feature engineering at Stitch Fix - 40 mins](https://www.youtube.com/watch?v=PDGIt37dov8&ab_channel=AICamp): -[![Watch the video](https://img.youtube.com/vi/PDGIt37dov8/hqdefault.jpg)](https://youtu.be/PDGIt37dov8) -* [Hamilton: a python micro-framework for tidy scalable pandas - ~20 mins](https://www.youtube.com/watch?v=m_rjCzxQj4c&ab_channel=Ponder): +[Nils Olsson](https://github.com/nilsso), [MichaΕ‚ Siedlaczek](https://github.com/elshize), [Alaa Abedrabbo](https://github.com/AAbedrabbo), [Shreya Datar](https://github.com/datarshreya), [Baldo Faieta](https://github.com/baldofaieta), [Anwar Brini](https://github.com/AnwarBrini), [Gourav Kumar](https://github.com/gms101), [Amos Aikman](https://github.com/amosaikman), [Ankush Kundaliya](https://github.com/akundaliya), [David Weselowski](https://github.com/j7zAhU), [Peter Robinson](https://github.com/Peter4137), [Seth Stokes](https://github.com/sT0v), [Louis Maddox](https://github.com/lmmx), [Stephen Bias](https://github.com/s-ducks), [Anup Joseph](https://github.com/AnupJoseph), [Jan Hurst](https://github.com/janhurst), [Flavia Santos](https://github.com/flaviassantos), [Nicolas Huray](https://github.com/nhuray), [Manabu Niseki](https://github.com/ninoseki), [Kyle Pounder](https://github.com/kpounder), [Alex Bustos](https://github.com/bustosalex1), [Andy Day](https://github.com/adayNU) -[![Watch the video](https://img.youtube.com/vi/m_rjCzxQj4c/hqdefault.jpg)](https://www.youtube.com/watch?v=m_rjCzxQj4c&ab_channel=Ponder) - -# Citing Hamilton +# πŸŽ“ Citations We'd appreciate citing Hamilton by referencing one of the following: -``` +```bibtex @inproceedings{DBLP:conf/vldb/KrawczykI22, - author = {Stefan Krawczyk and Elijah ben Izzy}, - editor = {Satyanarayana R. Valluri and Mohamed Za{\"{\i}}t}, title = {Hamilton: a modular open source declarative paradigm for high level modeling of dataflows}, + author = {Stefan Krawczyk and Elijah ben Izzy}, + editor = {Satyanarayana R. Valluri and Mohamed Za{\"{\i}}t}, booktitle = {1st International Workshop on Composable Data Management Systems, CDMS@VLDB 2022, Sydney, Australia, September 9, 2022}, year = {2022}, @@ -333,210 +270,15 @@ We'd appreciate citing Hamilton by referencing one of the following: biburl = {https://dblp.org/rec/conf/vldb/KrawczykI22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } - +``` +```bibtex @inproceedings{CEURWS:conf/vldb/KrawczykIQ22, + title = {Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs}, author = {Stefan Krawczyk and Elijah ben Izzy and Danielle Quinn}, editor = {Cinzia Cappiello and Sandra Geisler and Maria-Esther Vidal}, - title = {Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs}, booktitle = {1st International Workshop on Data Ecosystems co-located with 48th International Conference on Very Large Databases (VLDB 2022)}, pages = {41--50}, url = {https://ceur-ws.org/Vol-3306/paper5.pdf}, year = {2022} } ``` -# πŸ›£πŸ—Ί Roadmap / Things you can do with Hamilton -Hamilton is an ambitious project to provide a unified way to describe any dataflow, independent of where it runs. -You can find currently support integrations and high-level roadmap below. Please reach out via [slack](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) -or email (stefan / elijah at dagworks.io) to contribute or share feedback! - -## Object types: -* [x] Any python object type! E.g. Pandas, Spark dataframes, Dask dataframes, Ray datasets, Polars, dicts, lists, primitives, -your custom objects, etc. - -## Workflows: -* [x] data processing -* [x] feature engineering -* [x] model training -* [x] LLM application workflows -* [x] all of them together - -## Data Quality -See the [data quality](https://hamilton.dagworks.io/en/latest/how-tos/run-data-quality-checks.html) docs. -* [x] Ability to define data quality check on an object. -* [x] Pandera schema integration. -* [x] Custom object type validators. -* [ ] Integration with other data quality libraries (e.g. Great Expectations, Deequ, whylogs, etc.) - -## Online Monitoring -* [ ] Open telemetry/tracing plugin. - -## Caching: -* [ ] Checkpoint caching (e.g. save a function's result to disk, independent of input) - [WIP](https://github.com/DAGWorks-Inc/hamilton/pull/195). -* [ ] Finergrained caching (e.g. save a function's result to disk, dependent on input). - -## Execution: -* [x] Runs anywhere python runs. E.g. airflow, prefect, dagster, kubeflow, sagemaker, jupyter, fastAPI, snowpark, etc. - -## Backend integrations: -Specific integrations with other systems where we help you write code that runs on those systems. -### Ray -* [x] Delegate function execution to Ray. -* [ ] Function grouping (e.g. fuse multiple functions into a single Ray task) - -### Dask -* [x] Delegate function execution to Dask. -* [ ] Function grouping (e.g. fuse multiple functions into a single Dask task) - -### Spark -* [x] Pandas on spark integration (via GraphAdapter) -* [x] PySpark native UDF map function integration (via GraphAdapter) -* [ ] PySpark native aggregation function integration -* [ ] PySpark join, filter, groupby, etc. integration - -### Snowpark -* [ ] Packaging functions for Snowpark - -### LLVMs & related -* [ ] Numba integration - -### Custom Backends -* [ ] Generate code to execute on a custom topology, e.g. microservices, etc. - -## Integrations with other systems/tools: -* [ ] Generating Airflow | Prefect | Metaflow | Dagster | Kubeflow Pipelines | Sagemaker Pipelines | etc from Hamilton. -* [ ] Plugins for common MLOps/DataOps tools: MLFlow, DBT, etc. - -## Dataflow/DAG Walking: -* [x] Depth first search traversal -* [x] Async function support via AsyncDriver -* [x] Parallel walk over a generator -* [x] Python multiprocessing execution (still in beta) -* [x] Python threading support -* [x] Grouping of nodes into tasks for efficient parallel computation -* [ ] Breadth first search traversal -* [ ] Sequential walk over a generator - -## DAG/Dataflow resolution: -* [x] At Driver instantiation time, using configuration/modules and [`@config.when`](https://hamilton.dagworks.io/en/latest/reference/api-reference/decorators.html#config). -* [x] With [`@resolve`](https://hamilton.dagworks.io/en/latest/reference/api-reference/decorators.html#resolve) during Driver instantiation time. - - -# Prescribed Development Workflow -In general we prescribe the following: - -1. Ensure you understand [Hamilton Basics](https://www.tryhamilton.dev/intro). -2. Familiarize yourself with some of the [Hamilton decorators](https://www.tryhamilton.dev/tutorial-extras/configuration). They will help keep your code DRY. -3. Start creating Hamilton Functions that represent your work. We suggest grouping them in modules where it makes sense. -4. Write a simple script so that you can easily run things end to end. -5. Join our [Slack](https://join.slack.com/t/hamilton-opensource/shared_invite/zt-1bjs72asx-wcUTgH7q7QX1igiQ5bbdcg) community to chat/ask Qs/etc. - -For the backstory on Hamilton we invite you to watch a roughly-9 minute lightning talk on it that we gave at the apply conference: -[video](https://www.youtube.com/watch?v=B5Zp_30Knoo), [slides](https://www.slideshare.net/StefanKrawczyk/hamilton-a-micro-framework-for-creating-dataframes). - -## PyCharm Tips -If you're using Hamilton, it's likely that you'll need to migrate some code. Here are some useful tricks we found -to speed up that process. - -### Live templates -Live templates are a cool feature and allow you to type in a name which expands into some code. - -E.g. For example, we wrote one to make it quick to stub out Hamilton functions: typing `graphfunc` would turn into -> - -```python -def _(_: pd.Series) -> pd.Series: - """""" - return _ -``` - -Where the blanks are where you can tab with the cursor and fill things in. See your pycharm preferences for setting this up. - -### Multiple Cursors -If you are doing a lot of repetitive work, one might consider multiple cursors. Multiple cursors allow you to do things on multiple lines at once. - -To use it hit `option + mouse click` to create multiple cursors. `Esc` to revert back to a normal mode. - -# Usage analytics & data privacy -By default, when using Hamilton, it collects anonymous usage data to help improve Hamilton and know where to apply development -efforts. - -We capture three types of events: one when the `Driver` object is instantiated, one when the `execute()` call on the `Driver` object completes, and one for most `Driver` object function invocations. -No user data or potentially sensitive information is or ever will be collected. The captured data is limited to: - -* Operating System and Python version -* A persistent UUID to indentify the session, stored in ~/.hamilton.conf. -* Error stack trace limited to Hamilton code, if one occurs. -* Information on what features you're using from Hamilton: decorators, adapters, result builders. -* How Hamilton is being used: number of final nodes in DAG, number of modules, size of objects passed to `execute()`, the name of the Driver function being invoked. - -If you're worried, see telemetry.py for details. - -If you do not wish to participate, one can opt-out with one of the following methods: -1. Set it to false programmatically in your code before creating a Hamilton driver: - ```python - from hamilton import telemetry - telemetry.disable_telemetry() - ``` -2. Set the key `telemetry_enabled` to `false` in ~/.hamilton.conf under the `DEFAULT` section: - ``` - [DEFAULT] - telemetry_enabled = False - ``` -3. Set HAMILTON_TELEMETRY_ENABLED=false as an environment variable. Either setting it for your shell session: - ```bash - export HAMILTON_TELEMETRY_ENABLED=false - ``` - or passing it as part of the run command: - ```bash - HAMILTON_TELEMETRY_ENABLED=false python NAME_OF_MY_DRIVER.py - ``` - -For the hamilton UI you jmust use the environment variable method prior to running docker compose. - -# Contributors - -## Code Contributors -- Stefan Krawczyk (@skrawcz) -- Elijah ben Izzy (@elijahbenizzy) -- Danielle Quinn (@danfisher-sf) -- Rachel Insoft (@rinsoft-sf) -- Shelly Jang (@shellyjang) -- Vincent Chu (@vslchusf) -- Christopher Prohm (@chmp) -- James Lamb (@jameslamb) -- Avnish Pal (@bovem) -- Sarah Haskins (@frenchfrywpepper) -- Thierry Jean (@zilto) -- MichaΕ‚ Siedlaczek (@elshize) -- Benjamin Hack (@benhhack) -- Bryan Galindo (@bryangalindo) -- Jordan Smith (@JoJo10Smith) -- Roel Bertens (@roelbertens) -- Swapnil Delwalkar (@swapdewalkar) -- Fran Boon (@flavour) -- Tom Barber (@buggtb) -- Konstantin Tyapochkin (@tyapochkin) -- Walber Moreira (@wmoreiraa) - -## Bug Hunters/Special Mentions -- Nils Olsson (@nilsso) -- MichaΕ‚ Siedlaczek (@elshize) -- Alaa Abedrabbo (@AAbedrabbo) -- Shreya Datar (@datarshreya) -- Baldo Faieta (@baldofaieta) -- Anwar Brini (@AnwarBrini) -- Gourav Kumar (@gms101) -- Amos Aikman (@amosaikman) -- Ankush Kundaliya (@akundaliya) -- David Weselowski (@j7zAhU) -- Peter Robinson (@Peter4137) -- Seth Stokes (@sT0v -- Louis Maddox (@lmmx) -- Stephen Bias (@s-ducks) -- Anup Joseph (@AnupJoseph) -- Jan Hurst (@janhurst) -- Flavia Santos (@flaviassantos) -- Nicolas Huray (@nhuray) -- Manabu Niseki (@ninoseki) -- Kyle Pounder (@kpounder) -- Alex Bustos (@bustosalex1) -- Andy Day (@adayNU) diff --git a/docs/_static/abc_highlight.png b/docs/_static/abc_highlight.png new file mode 100644 index 000000000..e3872a38d Binary files /dev/null and b/docs/_static/abc_highlight.png differ diff --git a/requirements-docs.txt b/requirements-docs.txt index 9a96c4cd3..4285d8561 100644 --- a/requirements-docs.txt +++ b/requirements-docs.txt @@ -16,6 +16,7 @@ lxml lz4 mock==1.0.1 # read the docs pins myst-parser==2.0.0 # latest version of myst at this time +numpy < 2.0.0 pandera pillow polars