Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Azure Synapse External Tables #44

Merged
merged 30 commits into from
Nov 29, 2020
Merged

Add support for Azure Synapse External Tables #44

merged 30 commits into from
Nov 29, 2020

Conversation

dataders
Copy link
Collaborator

@dataders dataders commented Oct 28, 2020

Description & motivation

Now users of Azure SQL (using dbt-sqlserver) and Azure Synapse (using dbt-azuresynapse) can use dbt-external-tables to stage external tables.

I mostly copied the redshift implementation, but, because Synapse has no notion of external table "refresh", I removed that.

hanging chads

I left comments in code of ways to make these macros more robust, but can take those comments out if that's not y'alls convention. Basically I think it would be "cool" to be able to set:

  • column nullity based on whether a column has as not_null test associated with it. (tests don't seem to be an attribute of a table node?)
  • the ability to set QUOTED_IDENTIFIER and ANSI_NULLS in the config rather than always being set to ON.

Checklist

  • I have verified that these changes work locally
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)

@dataders
Copy link
Collaborator Author

@jtcohen6 anything I can do to help move this forward? It looks like #42 is also hanging out. FYI our team isn't currently blocked by this as we're referencing my fork in the packages.yml, but we'd like to make this available to other teams using Synapse.

@jtcohen6
Copy link
Collaborator

@swanderz This looks really cool! I have a few issues and PRs in this repo to catch up on. I'm out for a few days, but I plan to get to these by the end of next week.

Copy link
Collaborator

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@swanderz Really neat, sorry for my delay getting back to you, and above all, thank you for wanting to contribute to this package.

I left a few comments:

  • to make this implementation consistent with how the others work, especially regarding var('ext_full_refresh') (which is my hack for --full-refresh)
  • to clarify my own understanding about how external tables work in AzureSQL / Synapse

I added very simple integration tests to this package over the weekend, so as to:

  • help folks contribute small fixes to this package
  • establish a low-level baseline of common functionality across databases

I don't have a way presently to stand up a Synapse instance such that CircleCI could connect to it. Even so, do you want to take a swing at staging the fixture data? (s3://dbt-external-tables-testing, gs://dbt-external-tables-testing)

@dataders dataders changed the title Add support for Synapse & Azure SQL External Tables Add support for Azure Synapse External Tables Nov 26, 2020
@dataders
Copy link
Collaborator Author

dataders commented Nov 26, 2020

@jtcohen6 I took all your suggestions and ran with them.I'm 86% confident there's a typo somewhere, we'll find it as soon as we get the env_vars plugged into CircleCI.

  1. Azure Synapse can:
    1. only read from Azure Storage accounts, so I uploaded the files to an ADLS Gen2 Container and gave anonymous list and read access
    2. NOT read JSON files... yet?
  2. I somehow broke the tests for all the other DWs... 🤷
  3. I shared the Synapse test instance creds separately.

@jtcohen6
Copy link
Collaborator

Nice! I plugged in the Synapse creds to CircleCI, though now it looks like we're getting the error:

False is not of type 'string'

I wonder if encrypt: no is being coerced to falsy?

As far as fixing tests on the other databases, could you add these lines to integration_tests/dbt_project.yml:

sources:
  dbt_external_tables_integration_tests:
    ...
    sqlserver_external:
      +enabled: "{{ target.type == 'sqlserver' }}"

@dataders
Copy link
Collaborator Author

dataders commented Nov 28, 2020

I wonder if encrypt: no is being coerced to falsy?
@jtcohen6 yep I was referencing the master branch of the adapter rather than the current release. fixed now.

a lot of naming shenanigans that caused lots of head+keyboard here...

  • dbt-utils's tests don't work as t-sql doesn't play nicely with three-part table names. I hacked and added copied macros here until we have a long term solution?
  • if you use a '.' in an external data source or external file format name you must bracketize the name (I hard coded them in where they appear, but perhaps there's a neater way)
  • wasbs:// vs abfss://... the former (and older) method was the only thing I could get to work... 🤷

Some potential outstanding items:

  • add parquet files (so at least we can test more functionality?
  • think about how OPENROWSET() might allow for
    • table creating without specifying schema and
    • partitioning by filename and filepath

@dataders dataders requested a review from jtcohen6 November 28, 2020 01:44
@jtcohen6
Copy link
Collaborator

jtcohen6 commented Nov 29, 2020

Heroic work here, @swanderz. This feels like it's in really solid, workable shape. There's some reorganization I'd like to do, which I can manage after merging.

dbt-utils's tests don't work as t-sql doesn't play nicely with three-part table names. I hacked and added copied macros here until we have a long term solution?

I'm surprised this hasn't come up before. The simplest and most thorough way to solve this is by overriding the Relation + include_policy for the dbt-sqlserver + dbt-synapse adapters, similar to what we've got in the dbt-spark adapter here:

@dataclass
class SQLServerIncludePolicy(Policy):
    database: bool = False
    schema: bool = True
    identifier: bool = True

@dataclass(frozen=True, eq=False, repr=False)
class SQLServerRelation(BaseRelation):
    include_policy: SQLServerIncludePolicy = SQLServerIncludePolicy()

If we could add that to the adapters, then we won't need to override the tests in this package.

if you use a '.' in an external data source or external file format name you must bracketize the name (I hard coded them in where they appear, but perhaps there's a neater way)

As far as one-off syntactical weirdnesses go, I think this is nbd.

Let's open new issues for:

  • adding parquet-formatted files, for testing on all databases — I can take care of this one (Add parquet data source for integration testing #55)
  • SQLServer-specific improvements around adding support for abfss:// and OPENROWSET() — would you mind writing up this one?

@jtcohen6 jtcohen6 merged commit 3c322cd into dbt-labs:master Nov 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants