Skip to content

Latest commit

 

History

History
230 lines (165 loc) · 8.21 KB

README.md

File metadata and controls

230 lines (165 loc) · 8.21 KB

dbt-sqlite

A SQLite adapter plugin for dbt (data build tool)

Please read these docs carefully and use at your own risk. Issues and PRs welcome!

The Use Case

SQLite is an embedded SQL database. It comes included with most Python distributions and requires no installation or configuration. It can be a good choice if your project meets any of these criteria:

  • you store the database file on fast, local storage (not on a network drive)
  • the amount of data is relatively small (GBs, not TBs)
  • you're a data team of one with no need to share access to a database
  • your end goal is to export the results of your pipeline(s) into other systems for multi-user access or into BI/viz tools for analysis (i.e. you're doing ETL vs ELT)
  • your project is a proof of concept, to eventually be moved into another database or data warehouse platform
  • you want others to be able to deploy your data build without the overhead/cost of a full RDBMS or signing up for a data warehouse platform

SQLite can be surprisingly fast, despite the query optimizer not being as sophisticated as other databases and data warehouse platforms. Tip: materialize your models as tables and create indexes in post-hooks to speed up filtering and joins.

How to Use This

Use the right version. Starting with the release of dbt-core 1.0.0, versions of dbt-sqlite are aligned to the same major+minor version of dbt-core.

  • versions 1.9.x of this adapter work with dbt-core 1.9.x
  • versions 1.6.x - 1.8.x are not supported: this adapter fell behind, and it's too much work to go back to support these versions
  • versions 1.5.x of this adapter work with dbt-core 1.5.x
  • versions 1.4.x of this adapter work with dbt-core 1.4.x
  • versions 1.3.x of this adapter work with dbt-core 1.3.x
  • versions 1.2.x of this adapter work with dbt-core 1.2.x
  • versions 1.1.x of this adapter work with dbt-core 1.1.x
  • versions 1.0.x of this adapter work with dbt-core 1.0.x
  • versions 0.2.x of this adapter work with dbt 0.20.x and 0.21.x
  • versions 0.1.x of this adapter work with dbt 0.19.x
  • versions 0.0.x of this adapter work with dbt 0.18.x

Install this package:

# run this to install the latest version
pip install dbt-sqlite

# OR run this to install a specific version
pip install dbt-sqlite==1.0.0

Create an entry in your ~/.dbt/profiles.yml file with the following configuration:

dbt_sqlite:

  target: dev
  outputs:
    dev:
      type: sqlite

      # sqlite locks the whole db on writes so anything > 1 won't help
      threads: 1

      # value is arbitrary
      database: "database"

      # value of 'schema' must be defined in schema_paths below. in most cases,
      # this should be 'main'
      schema: 'main'

      # connect schemas to paths: at least one of these must be 'main'
      schemas_and_paths:
        main: '/my_project/data/etl.db'
        dataset: '/my_project/data/dataset_v1.db'

      # directory where all *.db files are attached as schema, using base filename
      # as schema name, and where new schema are created. this can overlap with the dirs of
      # files in schemas_and_paths as long as there's no conflicts.
      schema_directory: '/my_project/data'

      # optional: list of file paths of SQLite extensions to load. see README for more details.
      extensions:
        - "/path/to/sqlean/crypto.so"
        - "/path/to/sqlean/math.so"
        - "/path/to/sqlean/text.so"

Set profile: 'dbt_sqlite' in your project's dbt_project.yml file.

Notes

  • There is no 'database' portion of relation names in SQLite so it gets stripped from the output of ref() and from SQL everywhere. It still needs to be set in the configuration and is used by dbt internally.

  • Schema are implemented as attached database files. (SQLite conflates databases and schemas.)

    • SQLite automatically assigns 'main' to the file you initially connect to, so this must be defined in your profile. Other schemas defined in your profile get attached when database connection is created.

    • If dbt needs to create a new schema, it will be created in schema_directory as schema_name.db. Dropping a schema results in dropping all its relations, detaching the database file from the session, and deleting the file.

    • Schema names are stored in view definitions, so when you access a non-'main' database file outside dbt, you'll need to attach it using the same name, or the views won't work.

    • SQLite does not allow views in one schema (i.e. database file) to reference objects in another schema. You'll get this error from SQLite: "view [someview] cannot reference objects in database [somedatabase]". You must set materialized='table' in models that reference other schemas.

  • Materializations are simplified: they drop and re-create the model, instead of doing the backup-and-swap-in new model that the other dbt database adapters support. This choice was made because SQLite doesn't support DROP ... CASCADE or ALTER VIEW or provide information about relation dependencies in something information_schema-like. These limitations make it really difficult to make the backup-and-swap-in functionality work properly. Given how SQLite aggressively locks the database anyway, it's probably not worth the effort.

  • It's often idiomatic with dbt to use plentiful CASTs. The results of CASTs in SQLite are tricky and depend on how the model is materialized. In a nutshell, using table materializations gives better results.

    • When materialized as a view, the resulting column type from any CAST (or any expression) will always be empty. The SQLite adapter will regard this column type as 'UNKNOWN'.

    • When materialized as a table, a CAST will result in the specified type for INT, REAL, TEXT; casts to NUMERIC and BOOLEAN result in a 'NUM' column type.

  • To get the best fidelity to your seed data, declare all the column types as TEXT in your seed configurations and create a model to do the casts and conversions.

SQLite Extensions

These modules from SQLean are needed for certain functionality to work:

  • crypto: provides md5 function needed for snapshots
  • math: provides ceil and floor needed for the datediff macro to work
  • text: provides split_part function

Precompiled binaries are available for download from the SQLean github repository page. You can also compile them yourself if you want. Note that some modules depend on other libraries (math for example depends on GLIBC); if an extension fails to load, you may want to try building it yourself.

Point to these module files in your profile config as shown in the example above.

Mac OS seems to ship with SQLite libraries that do not have support for loading extensions compiled in, so this won't work "out of the box." Accordingly, snapshots won't work. If you need snapshot functionality, you'll need to compile SQLite/python or find a python distribution for Mac OS with this support.

Development Notes / TODOs

...

Publishing a release to PyPI

Because I forget...

# assumes ~/.pypirc is already set up

workon dbt-sqlite-devel

vi dbt/adapters/sqlite/__version__.py # update version
vi setup.py # update dbt-core dependency if appropriate

# start clean
rm -rf dist/ build/ *.egg-info

# make sure tools are up to date
python -m pip install --upgrade build setuptools wheel twine

# build
python -m build

# upload to PyPI
python -m twine upload dist/*

git commit
git tag vXXX
git push --tags

# go to github and "Draft a new release"

Running Tests

This runs the test suite and cleans up after itself:

./run_tests_docker.sh

To run tests interactively and be able to examine test artifacts:

docker build . -t dbt-sqlite

docker run --rm -it dbt-sqlite bash

# see output for the locations of artifacts
run_tests.sh -s

Credits

Inspired by this initial work by stephen1000: https://github.com/stephen1000/dbt_sqlite

https://github.com/jwills/dbt-duckdb/ - useful for ideas on working with another embedded database

https://github.com/fishtown-analytics/dbt-spark/ - spark also has two-part relation names (no 'database')