Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build DAG from functions directly #1268

Open
zilto opened this issue Jan 6, 2025 · 4 comments
Open

Build DAG from functions directly #1268

zilto opened this issue Jan 6, 2025 · 4 comments
Labels
core-work Work that is "core". Likely overseen by core team in most cases. question Further information is requested

Comments

@zilto
Copy link
Collaborator

zilto commented Jan 6, 2025

Summary

Add first-class way to build a DAG from function objects (in contrast to DIY/ hacky). The user API could be either:

  • Builder().with_functions()
  • or an intermediary my_module = module_from_functions(fns*) then passed to .with_modules()

Goals:

  • simplify codebase
  • foundation for better hierarchical / nested graph structures (i.e., subdags)

Current

At the core of Hamilton, users:

  1. write functions in a Python module ("dataflow code")
  2. load that module in the "driver code" to build a DAG
  3. execute the DAG via the Driver

Problem

  • It is currently possible to build Hamilton DAGs from functions, but we have no official "here's how you do it" that we guarantee we'll support.
    • ad_hoc_utils means exactly the opposite
  • In a notebook, there's no good reason to create a module from a notebook function before passing to Hamilton (except our how constraints)
  • Python module machinery is complex and adds indirection to the codebase (tests, notebook extension, LSP)

Benefits

  • greatly simply many unit tests
  • facilitate marimo integration

Hamilton 2.0 / Broader perspective

There's no well-defined structure or purpose to Hamilton top-level modules (e.g., nodes, graph_types, graph_utils, graph, ad_hoc_utils, base, hamilton.common, models). I propose a structure that matches the Hamilton lifecycle:

  • hamilton.parser: everything that deals with source code: how functions are written, if type annotations are present (not type matching), collecting functions from modules, converting a notebook cell string to a module, remove comments and docstring before hashing source code
  • hamilton.compiler: converting code to DAG: structuring the DAG from functions, applying function modifiers, validating types, etc.
  • remove ad_hoc_utils
@zilto zilto added enhancement New feature or request core-work Work that is "core". Likely overseen by core team in most cases. labels Jan 6, 2025
@skrawcz
Copy link
Collaborator

skrawcz commented Jan 6, 2025

Yes there could be a big refactor for Hamilton 2.0.

Otherwise a small reason trying to make things look like a module was done, was so that we wouldn't have to rewire the internals of Hamilton which assumed a module would be passed in. The larger reason why it wasn't enabled from the beginning and why ad_hoc_utils was named that on purpose is because in our opinion we wanted people curating their code into modules as part of a SDLC; coupling functions with where you execute them makes them less usable and modular since you likely couple execution imports with the ones for the functions... Now there's been a lot of lessons and learnings since then, so there could be improvements here.

Question:

  1. What's the blocker for marimo integration?

@skrawcz skrawcz changed the title Build DAG from functions Build DAG from functions directly Jan 6, 2025
@skrawcz skrawcz added question Further information is requested and removed enhancement New feature or request labels Jan 6, 2025
@skrawcz
Copy link
Collaborator

skrawcz commented Jan 6, 2025

Otherwise some API things to think through / check:

  1. What is the proposed API?
  2. Where can functions come from? Only the current module? or any module?
  3. How would this interact with modules if they're provided?
  4. Would this impact serialization / deserialization, e.g. for ray parallel...
  5. Would this break any Hamilton UI assumptions?

@Dev-iL
Copy link
Contributor

Dev-iL commented Jan 6, 2025

  1. What types of callables will be supported? Lambdas? Static methods from classes? How to restrict the user to only the allowed callable types?

@zilto
Copy link
Collaborator Author

zilto commented Jan 6, 2025

Current main code path (roughly):

  1. functions are defined in a file, let's say dataflow.py
  2. dataflow.py is imported into the module dataflow
  3. hamilton.graph_utils.find_functions() retrieves "hamilton functions" from the module dataflow
  4. find_functions() is called in to places (serving the same purpose): in hamilton.graph.create_function_graph() and @subdag()
  5. we get a FunctionGraph, Driver, etc.
  6. the Node object directly retrieves the __module__ information from the function
  7. Parts of the codebase use Node metadata about the module

In other words, the module abstraction is currently irrelevant for building the DAG. It only matters for downstream usage.

Propositions:

  1. the current hamilton.graph_utils.find_functions() does 2 things: get functions from a module and determine if they are valid "hamilton functions". We need to decouple these two operations for flexibility.
  2. Change create_function_graph() to take in functions instead of modules. The modules passed are irrelevant to this operation
  3. the module metadata attributes could be left empty for dynamic cases (would need to evaluate what are the affected downstream code paths).

Answers

Question:
What's the blocker for marimo integration?

Current API options around "get functions from the current namespace and build a DAG" look hacky and have poor ergonomics. Solving this problem would be the same as smoother providing a smoother "if name == main, run this as a DAG"

What is the proposed API?

The current propositions don't need to involve a user-facing API change. They would make developer life easy and would provide a first-class way of passing functions to create a function graph / driver

Simple user-facing API options:

  • allow .with_modules(...) to take in functions too
  • add .with_functions(...) to the Builder, and create_function_graph() will work on the functions and modules.

Where can functions come from? Only the current module? or any module?

Could be from anywhere. Doesn't need to come from a file. Our options to maintain compatibility:

  • all "anonymous" functions are put in the same namespace (i.e., module)
  • we tweak downstream paths to accept empty module attribute (my preferred option)
  • users must provide namespace or we automatically assign a uuid (similar to create_temporary_module())

How would this interact with modules if they're provided?

No change to current behavior because the module metadata plays no role in graph building and core Hamilton features.

Would this impact serialization / deserialization, e.g. for ray parallel...

Don't know the details of this. If you have an instantiated function, you must have the pickleable bytes of the instantiated function, and you probably have available source code (from a .py file or from an interactive session) that you can gather and use to re instantiate the function remotely. Doesn't sound like a blocker; all orchestrators have to deal with that.

Would this break any Hamilton UI assumptions?

Don't know the assumptions of the Hamilton UI. If there's is a blocking assumption, it's better to change it? The main potential limitation is that we have some UI components that expect a non-empty metadata field.

What types of callables will be supported? Lambdas? Static methods from classes? How to restrict the user to only the allowed callable types?

Lambdas and static method are not currently supported and are not within the scope of what I intended here. Though, building graphs from functions would introduce a simple way to create nodes and graph from lambdas and static methods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-work Work that is "core". Likely overseen by core team in most cases. question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants