GitHub - Matts52/dbt-ml-inline-preprocessing: This dbt package produces a number of different machine learning preprocessing techniques inline in sql

This dbt package contains macros that can be (re)used across dbt projects.

Note: All methods in this package are meant to be used inline within a select statement.

Current supported tested databases include:

Postgres
Snowflake
DuckDB

Method	Postgres	Snowflake	DuckDB
categorical_impute	✅	✅	✅
numerical_impute	✅	✅	✅
random_impute	✅	✅	✅
label_encode	✅	✅	✅
one_hot_encode	✅	✅	✅
rare_category_encode	✅	✅	✅
exponentiate	✅	✅	✅
interact	✅	✅	✅
k_bins_discretize	✅	✅	✅
log_transform	✅	✅	✅
max_absolute_scale	✅	✅	✅
min_max_scale	✅	✅	✅
numerical_binarize	✅	✅	✅
robust_scale	✅	✅	✅
standardize	✅	✅	✅

Installation Instructions

To import this package into your dbt project, add the following to either the packages.yml or dependencies.yml file:

packages:
  - package: "Matts52/dbt_ml_inline_preprocessing"
    version: [">=0.2.0"]

and run a dbt deps command to install the package to your project.

Check read the docs for more information on installing packages.

dbt Versioning

This package currently support dbt versions 1.1.0 through 2.0.0

Adapter Support

Currently this package supports the Snowflake and Postgres adapters

Installation Instructions
Imputation
Encoding
Numerical Transformation

Imputation

categorical_impute

(source)

This macro returns impute categorical data for a column in a model, source, or CTE

Args:

column (required): Name of the field that is to be imputed
measure (optional): The measure by which to impute the data. It is set to use the 'mode' by default
source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.categorical_impute(
    column='user_type',
    measure='mode',
    relation=ref('my_model'),
   )
}}

numerical_impute

(source)

This macro returns imputed numerical data for a column in a model, source, or CTE

Args:

column (required): Name of the field that is to be imputed
measure (optional): The measure by which to impute the data. It is set to use the 'mean' by default, but also support 'median' and 'percentile'
percentile (optional): If percentile is selected for the measure, this indicates the percentile value to impute into null values
source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.numerical_impute(
    column='purchase_value',
    measure='mean',
    percentile=0.25,
    source_relation=ref('my_model'),
   )
}}

random_impute

(source)

This macro returns randomly imputed data for a column in a model or source based on values that already exist within the column

NOTE: This method assumes that at least one value has been observed in the input column

Args:

column (required): Name of the field that is to be imputed
source_relation (required): a Relation (a ref or source) that contains the list of columns you wish to select from
data_type (required): The data type of this column. Either numerical or categorical
consider_distribution (optional): Boolean of whether or not the distribution of existing values should be taken into account. Macro will run faster when this is set to false. Default is false

Usage:

{{ dbt_ml_inline_preprocessing.random_impute(
    column='user_type',
    source_relation=ref('my_model'),
    data_type=`categorical`,
    consider_distribution=true
   )
}}

Encoding

label_encode

(source)

This macro returns a the labels encoded with individual integers from 0 to n-1. Note that this has the side effect of re-ordering the dataset.

Args:

column (required): Name of the field that is to be encoded

Usage:

{{ dbt_ml_inline_preprocessing.label_encode(
    column='user_type',
   )
}}

one_hot_encode

(source)

This macro returns one hot encoded fields from a categorical column

NOTE: One hot encoded fields will have the naming convention is_{column_name}_{value}

Args:

column (required): Name of the field that is to be one hot encoded
source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.one_hot_encode(
    source_relation=ref('my_model'),
    source_column='purchase_value',
   )
}}

rare_category_encode

(source)

This macro encodes rarely occuring categorical values with 'Other', leaving the rest of the categorical column values as is

Args:

column (required): Name of the field that is to be rare category encoded
cutoff (optional): The percentage value (in decimal) that is to serve as the point where any values occuring with a lesser frequency are rare category encoded. Default is 0.05 (ie 5%)

Usage:

{{ dbt_ml_inline_preprocessing.rare_category_encode(
    column='user_type',
    cutoff=0.10,
   )
}}

Numerical Transformation

exponentiate

(source)

This macro returns the given column after applying a exponential transformation to the numerical data. Often this is useful for when values are in logged form. By default the base will be e (the exponential constant)

Args:

column (required): Name of the field that is to be exponentiated
base (optional): The base of the exponentiation to apply. By default this is 2.71, indicating that the exponential constant e should be used.

Usage:

{{ dbt_ml_inline_preprocessing.exponentiate(
    column='log_purchase_value',
    base=10,
   )
}}

interact

(source)

This macro creates an interaction term between two numerical columns

Args:

column_one (required): Name of the first field in the interaction term
column_two (required): Name of the second field in the interaction term
interaction (optional): The interaction to apply. Options are "multaplicative", "additive", "exponential". Default is "multiplicative"

Usage:

{{ dbt_ml_inline_preprocessing.interact(
    column='purchase_value',
    base=10,
    offset=1,
   )
}}

k_bins_discretize

(source)

This macro returns the given column after discretizing it into a specified number of bins

Args:

column (required): Name of the field that is to be k bins discretized
k (required): The number of bins to discretize into
strategy (optional): The method by which to discretize the column into bins. Supported options are "quantile" to discretize into equal sized bins and "uniform" to bin into equal width bins. Default is "quantile"

Usage:

{{ dbt_ml_inline_preprocessing.log_transform(
    column='purchase_value',
    k=5,
    strategy='quantile',
   )
}}

log_transform

(source)

This macro returns the given column after applying a log transformation to the numerical data

Args:

column (required): Name of the field that is to be log transformed
base (optional): The base of the log function that is transforming the column. Default is 10
offset (optional): Value to be added to all values in the column before log transforming. Common use case is when zero values are included in the column. Default is 0

Usage:

{{ dbt_ml_inline_preprocessing.log_transform(
    column='purchase_value',
    base=10,
    offset=1,
   )
}}

max_absolute_scale

(source)

This macro transforms the given column by dividing each value by the maximum absolute value within the column. This transforms the range of values within the column to be [-1, 1]

Args:

column (required): Name of the field that is to be transformed

Usage:

{{ dbt_ml_inline_preprocessing.max_absolute_scale(
    column='user_rating',
   )
}}

min_max_scale

(source)

This macro transforms the given column to have a specified minimum and specified maximum, and scaling all values to fit that range. This transforms the range of values within the column to be [new minimum, new maximum]

Args:

column (required): Name of the field that is to be transformed
new_min (optional): The new minimum value to scale towards. Default is 0.0
new_max (optional): The new maximum value to scale towards. Default is 1.0

Usage:

{{ dbt_ml_inline_preprocessing.min_max_scale(
    column='user_rating',
    new_min=0,
    new_max=10,
   )
}}

min_max_scale

(source)

This macro transforms the given column to have a specified minimum and specified maximum, and scaling all values to fit that range. This transforms the range of values within the column to be [new minimum, new maximum]

Args:

column (required): Name of the field that is to be transformed
new_min (optional): The new minimum value to scale towards
new_max (optional): The new maximum value to scale towards

Usage:

{{ dbt_ml_inline_preprocessing.min_max_scale(
    column='user_rating',
    new_min=0,
    new_max=10,
   )
}}

numerical_binarize

(source)

This macro transforms the given numerical column into binary value based on either a specified cutoff value or percentile

Args:

column (required): Name of the field that is to be transformed
cutoff (required): The value that serves as the boundary point for the binary variable. This should be a value between 0 and 1 for percentile cutoff's. Default is 0.5
strategy (optional): The method with which to set the boundary point for the binary variable, options are "percentile" and "value". Default is 'percentile'
direction (optional): The direction that the 1 value should signify for the binary variable. Options are ">", ">=", "<", and "<=". Default is ">="
source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.numerical_binarize(
        'input',
        strategy='percentile',
        cutoff=0.2,
        direction='>',
        source_relation=ref('data_numerical_binarize')
    )
}}

robust_scale

(source)

This macro transforms the given column into IQR scaled values to more effectively deal with concerning outlier datapoints

Args:

column (required): Name of the field that is to be transformed
iqr (optional): The interquantile range to consider and scale by. Expects a number between 0 and 1 excluse. Default is 0.5, leading to a interquantile range stretching from the 25th percentile to the 75th percentile
source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.robust_scale(
    column='user_rating',
    iqr=0.5
    source_relation=ref('my_model')
   )
}}

standardize

(source)

This macro transforms the given column into a normal distribution. Transforms to a standard normal distribution by default

Args:

column (required): Name of the field that is to be transformed
target_mean (optional): The new mean that the column assumes. Default is 0
target_stddev (optional): The new standard deviation that the column assumes. Default is 1

Usage:

{{ dbt_ml_inline_preprocessing.standardize(
    column='user_rating',
    target_mean=0,
    target_stddev=0,
   )
}}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
docs		docs
duckdb		duckdb
integration_tests		integration_tests
macros		macros
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dbt_project.yml		dbt_project.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation Instructions

dbt Versioning

Adapter Support

Imputation

categorical_impute

numerical_impute

random_impute

Encoding

label_encode

one_hot_encode

rare_category_encode

Numerical Transformation

exponentiate

interact

k_bins_discretize

log_transform

max_absolute_scale

min_max_scale

min_max_scale

numerical_binarize

robust_scale

standardize

About

Releases 3

Packages

License

Matts52/dbt-ml-inline-preprocessing

Folders and files

Latest commit

History

Repository files navigation

Installation Instructions

dbt Versioning

Adapter Support

Imputation

categorical_impute

numerical_impute

random_impute

Encoding

label_encode

one_hot_encode

rare_category_encode

Numerical Transformation

exponentiate

interact

k_bins_discretize

log_transform

max_absolute_scale

min_max_scale

min_max_scale

numerical_binarize

robust_scale

standardize

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Packages