Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create data structure for dagster SQL views #2264

Closed
Tracked by #2386
bendnorman opened this issue Feb 3, 2023 · 0 comments · Fixed by #2445
Closed
Tracked by #2386

Create data structure for dagster SQL views #2264

bendnorman opened this issue Feb 3, 2023 · 0 comments · Fixed by #2445
Labels
dagster Issues related to our use of the Dagster orchestrator inframundo sqlite Issues related to interacting with sqlite databases
Milestone

Comments

@bendnorman
Copy link
Member

bendnorman commented Feb 3, 2023

We are planning on converting a lot of our output table pandas code to SQL views so the output tables can be persisted in the database. To do this, we'll create dagster assets that depend on upstream assets (normalized tables or other output tables) and execute SQL code. Here is an example:

@asset(
    non_argument_deps={"utilities_entity_eia", "utilities_eia860", "utilities_eia"},
    io_manager_key="pudl_sqlite_io_manager",
    compute_kind="SQL",
)
def utils_eia860() -> str:
    """Create view of all fields from the EIA860 Utilities table."""
    query = """
    CREATE VIEW utils_eia860 AS
    SELECT *
    FROM (
        SELECT *
        FROM utilities_entity_eia
            LEFT JOIN utilities_eia860 USING (utility_id_eia)
    )
    LEFT JOIN (
        SELECT utility_id_eia,
            utility_id_pudl
        FROM utilities_eia
    ) USING (utility_id_eia);"""
    return query

The non_argument_deps argument allows you specify upstream dependencies of the asset without having to actually load the upstream tables as dataframes. We don't need the tables as dataframes because this view asset is just executing a SQL query. It's important to add the upstream dependencies so views aren't created before the component tables are in the database!

This asset returns a string that the pudl_sqlite_io_manager executes as SQL.

Storing hundreds of lines of SQL as block quotes in python is not ideal! We should be saving these view creation statements in .SQL files to take advantage of formatting and syntax highlighting.

We could create a SQL file for each view in a directory in the output sub-package.

Option 1

One option is to use factory function to takes in the view name and the upstream asset names, reads in a view statement from a SQL file and returns the statement for the IO manager to execute.

def sql_view_asset_factory(
    name: str,
    non_argument_deps: set[str],
    io_manager_key: str = "pudl_sqlite_io_manager",
    compute_kind: str = "SQL",
):
    """Factory for creating SQL view assets."""

    @asset(
        name=name,
        non_argument_deps=non_argument_deps,
        io_manager_key=io_manager_key,
        compute_kind=compute_kind,
    )
    def sql_view_asset() -> str:
        """Asset that creates sql view in a database."""
        SQL_DIR = Path().absolute().parent / "sql"
        with open("tox.ini") as reader:
            # Raise a helpful error here if a sql file doesn't exist
            return reader.read(SQL_DIR / f"{name}.sql")

    return sql_view_asset

utils_eia860_asset = sql_view_asset_factory(
    name="utils_eia860",
    non_argument_deps={"utilities_entity_eia", "utilities_eia860", "utilities_eia"},
)

Option 2

We could also store the asset names and non_argument_deps in a pydantic model called SQLViewAsset. This way we could iterate through a set of SQLViewAsset objects to create the dagster assets. Something like this:

class SQLViewAsset(BaseModel):
      name: str
      non_argument_deps: set[str]
      io_manager_key: str = "pudl_sqlite_io_manager",

      def convert_to_asset(self) -> AssetDefinition:
            @asset(
                name=self.name,
                non_argument_deps=self.non_argument_deps,
                io_manager_key=self.io_manager_key,
                compute_kind="SQL",
            )
            def sql_view_asset() -> str:
                """Asset that creates sql view in a database."""
                SQL_DIR = Path().absolute().parent / "sql"
                with open("tox.ini") as reader:
                    # Raise a helpful error here if a sql file doesn't exist
                    return reader.read(SQL_DIR / f"{self.name}.sql")

            return sql_view_asset

SQLViewAsset(name="utils_eia860", non_argument_deps={"utilities_entity_eia", "utilities_eia860", "utilities_eia"},)
@jdangerx jdangerx moved this to 🆕 New in Catalyst Megaproject Feb 7, 2023
@jdangerx jdangerx moved this from 🆕 New to 🔖 Ready in Catalyst Megaproject Feb 7, 2023
@jdangerx jdangerx added the dagster Issues related to our use of the Dagster orchestrator label Feb 7, 2023
@jdangerx jdangerx modified the milestones: Port ETL to Dagster, Persist Outputs in Dagster Feb 7, 2023
@jdangerx jdangerx moved this from 🔖 Backlog to 🥶 Icebox in Catalyst Megaproject Mar 13, 2023
@zaneselvans zaneselvans modified the milestones: Outputs & Analysis Tables in Dagster, 2023Q1, 2023Q2 Mar 14, 2023
@zaneselvans zaneselvans added the sqlite Issues related to interacting with sqlite databases label Mar 22, 2023
@bendnorman bendnorman moved this from 🥶 Icebox to 🔖 Backlog in Catalyst Megaproject Mar 22, 2023
@bendnorman bendnorman linked a pull request Mar 23, 2023 that will close this issue
8 tasks
@bendnorman bendnorman moved this from 🔖 Backlog to 🚧 In progress in Catalyst Megaproject Mar 23, 2023
@github-project-automation github-project-automation bot moved this from 🚧 In progress to ✅ Done in Catalyst Megaproject Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dagster Issues related to our use of the Dagster orchestrator inframundo sqlite Issues related to interacting with sqlite databases
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants