Skip to content

This dbt package produces a number of different machine learning preprocessing techniques inline in sql

License

Notifications You must be signed in to change notification settings

Matts52/dbt-ml-inline-preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This dbt package contains macros that can be (re)used across dbt projects.

Note: All methods in this package are meant to be used inline within a select statement.

Current supported tested databases include:

  • Postgres
  • Snowflake
  • DuckDB
Method Postgres Snowflake DuckDB
categorical_impute âś… âś… âś…
numerical_impute âś… âś… âś…
random_impute âś… âś… âś…
label_encode âś… âś… âś…
one_hot_encode âś… âś… âś…
rare_category_encode âś… âś… âś…
exponentiate âś… âś… âś…
interact âś… âś… âś…
k_bins_discretize âś… âś… âś…
log_transform âś… âś… âś…
max_absolute_scale âś… âś… âś…
min_max_scale âś… âś… âś…
numerical_binarize âś… âś… âś…
robust_scale âś… âś… âś…
standardize âś… âś… âś…

Installation Instructions

To import this package into your dbt project, add the following to either the packages.yml or dependencies.yml file:

packages:
  - package: "Matts52/dbt_ml_inline_preprocessing"
    version: [">=0.2.0"]

and run a dbt deps command to install the package to your project.

Check read the docs for more information on installing packages.

dbt Versioning

This package currently support dbt versions 1.1.0 through 2.0.0

Adapter Support

Currently this package supports the Snowflake and Postgres adapters



Imputation

categorical_impute

(source)

This macro returns impute categorical data for a column in a model, source, or CTE

Args:

  • column (required): Name of the field that is to be imputed
  • measure (optional): The measure by which to impute the data. It is set to use the 'mode' by default
  • source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.categorical_impute(
    column='user_type',
    measure='mode',
    relation=ref('my_model'),
   )
}}

numerical_impute

(source)

This macro returns imputed numerical data for a column in a model, source, or CTE

Args:

  • column (required): Name of the field that is to be imputed
  • measure (optional): The measure by which to impute the data. It is set to use the 'mean' by default, but also support 'median' and 'percentile'
  • percentile (optional): If percentile is selected for the measure, this indicates the percentile value to impute into null values
  • source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.numerical_impute(
    column='purchase_value',
    measure='mean',
    percentile=0.25,
    source_relation=ref('my_model'),
   )
}}

random_impute

(source)

This macro returns randomly imputed data for a column in a model or source based on values that already exist within the column

NOTE: This method assumes that at least one value has been observed in the input column

Args:

  • column (required): Name of the field that is to be imputed
  • source_relation (required): a Relation (a ref or source) that contains the list of columns you wish to select from
  • data_type (required): The data type of this column. Either numerical or categorical
  • consider_distribution (optional): Boolean of whether or not the distribution of existing values should be taken into account. Macro will run faster when this is set to false. Default is false

Usage:

{{ dbt_ml_inline_preprocessing.random_impute(
    column='user_type',
    source_relation=ref('my_model'),
    data_type=`categorical`,
    consider_distribution=true
   )
}}

Encoding

label_encode

(source)

This macro returns a the labels encoded with individual integers from 0 to n-1. Note that this has the side effect of re-ordering the dataset.

Args:

  • column (required): Name of the field that is to be encoded

Usage:

{{ dbt_ml_inline_preprocessing.label_encode(
    column='user_type',
   )
}}

one_hot_encode

(source)

This macro returns one hot encoded fields from a categorical column

NOTE: One hot encoded fields will have the naming convention is_{column_name}_{value}

Args:

  • column (required): Name of the field that is to be one hot encoded
  • source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.one_hot_encode(
    source_relation=ref('my_model'),
    source_column='purchase_value',
   )
}}

rare_category_encode

(source)

This macro encodes rarely occuring categorical values with 'Other', leaving the rest of the categorical column values as is

Args:

  • column (required): Name of the field that is to be rare category encoded
  • cutoff (optional): The percentage value (in decimal) that is to serve as the point where any values occuring with a lesser frequency are rare category encoded. Default is 0.05 (ie 5%)

Usage:

{{ dbt_ml_inline_preprocessing.rare_category_encode(
    column='user_type',
    cutoff=0.10,
   )
}}

Numerical Transformation

exponentiate

(source)

This macro returns the given column after applying a exponential transformation to the numerical data. Often this is useful for when values are in logged form. By default the base will be e (the exponential constant)

Args:

  • column (required): Name of the field that is to be exponentiated
  • base (optional): The base of the exponentiation to apply. By default this is 2.71, indicating that the exponential constant e should be used.

Usage:

{{ dbt_ml_inline_preprocessing.exponentiate(
    column='log_purchase_value',
    base=10,
   )
}}

interact

(source)

This macro creates an interaction term between two numerical columns

Args:

  • column_one (required): Name of the first field in the interaction term
  • column_two (required): Name of the second field in the interaction term
  • interaction (optional): The interaction to apply. Options are "multaplicative", "additive", "exponential". Default is "multiplicative"

Usage:

{{ dbt_ml_inline_preprocessing.interact(
    column='purchase_value',
    base=10,
    offset=1,
   )
}}

k_bins_discretize

(source)

This macro returns the given column after discretizing it into a specified number of bins

Args:

  • column (required): Name of the field that is to be k bins discretized
  • k (required): The number of bins to discretize into
  • strategy (optional): The method by which to discretize the column into bins. Supported options are "quantile" to discretize into equal sized bins and "uniform" to bin into equal width bins. Default is "quantile"

Usage:

{{ dbt_ml_inline_preprocessing.log_transform(
    column='purchase_value',
    k=5,
    strategy='quantile',
   )
}}

log_transform

(source)

This macro returns the given column after applying a log transformation to the numerical data

Args:

  • column (required): Name of the field that is to be log transformed
  • base (optional): The base of the log function that is transforming the column. Default is 10
  • offset (optional): Value to be added to all values in the column before log transforming. Common use case is when zero values are included in the column. Default is 0

Usage:

{{ dbt_ml_inline_preprocessing.log_transform(
    column='purchase_value',
    base=10,
    offset=1,
   )
}}

max_absolute_scale

(source)

This macro transforms the given column by dividing each value by the maximum absolute value within the column. This transforms the range of values within the column to be [-1, 1]

Args:

  • column (required): Name of the field that is to be transformed

Usage:

{{ dbt_ml_inline_preprocessing.max_absolute_scale(
    column='user_rating',
   )
}}

min_max_scale

(source)

This macro transforms the given column to have a specified minimum and specified maximum, and scaling all values to fit that range. This transforms the range of values within the column to be [new minimum, new maximum]

Args:

  • column (required): Name of the field that is to be transformed
  • new_min (optional): The new minimum value to scale towards. Default is 0.0
  • new_max (optional): The new maximum value to scale towards. Default is 1.0

Usage:

{{ dbt_ml_inline_preprocessing.min_max_scale(
    column='user_rating',
    new_min=0,
    new_max=10,
   )
}}

min_max_scale

(source)

This macro transforms the given column to have a specified minimum and specified maximum, and scaling all values to fit that range. This transforms the range of values within the column to be [new minimum, new maximum]

Args:

  • column (required): Name of the field that is to be transformed
  • new_min (optional): The new minimum value to scale towards
  • new_max (optional): The new maximum value to scale towards

Usage:

{{ dbt_ml_inline_preprocessing.min_max_scale(
    column='user_rating',
    new_min=0,
    new_max=10,
   )
}}

numerical_binarize

(source)

This macro transforms the given numerical column into binary value based on either a specified cutoff value or percentile

Args:

  • column (required): Name of the field that is to be transformed
  • cutoff (required): The value that serves as the boundary point for the binary variable. This should be a value between 0 and 1 for percentile cutoff's. Default is 0.5
  • strategy (optional): The method with which to set the boundary point for the binary variable, options are "percentile" and "value". Default is 'percentile'
  • direction (optional): The direction that the 1 value should signify for the binary variable. Options are ">", ">=", "<", and "<=". Default is ">="
  • source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.numerical_binarize(
        'input',
        strategy='percentile',
        cutoff=0.2,
        direction='>',
        source_relation=ref('data_numerical_binarize')
    )
}}

robust_scale

(source)

This macro transforms the given column into IQR scaled values to more effectively deal with concerning outlier datapoints

Args:

  • column (required): Name of the field that is to be transformed
  • iqr (optional): The interquantile range to consider and scale by. Expects a number between 0 and 1 excluse. Default is 0.5, leading to a interquantile range stretching from the 25th percentile to the 75th percentile
  • source_relation (required for some databases): a Relation (a ref or source) that contains the list of columns you wish to select from

Usage:

{{ dbt_ml_inline_preprocessing.robust_scale(
    column='user_rating',
    iqr=0.5
    source_relation=ref('my_model')
   )
}}

standardize

(source)

This macro transforms the given column into a normal distribution. Transforms to a standard normal distribution by default

Args:

  • column (required): Name of the field that is to be transformed
  • target_mean (optional): The new mean that the column assumes. Default is 0
  • target_stddev (optional): The new standard deviation that the column assumes. Default is 1

Usage:

{{ dbt_ml_inline_preprocessing.standardize(
    column='user_rating',
    target_mean=0,
    target_stddev=0,
   )
}}

About

This dbt package produces a number of different machine learning preprocessing techniques inline in sql

Resources

License

Stars

Watchers

Forks

Packages

No packages published