This dbt package contains macros that can be (re)used across dbt projects.
Note: All methods in this package are meant to be used inline within a select statement.
Current supported tested databases include:
- Postgres
- Snowflake
- DuckDB
Method | Postgres | Snowflake | DuckDB |
---|---|---|---|
categorical_impute | âś… | âś… | âś… |
numerical_impute | âś… | âś… | âś… |
random_impute | âś… | âś… | âś… |
label_encode | âś… | âś… | âś… |
one_hot_encode | âś… | âś… | âś… |
rare_category_encode | âś… | âś… | âś… |
exponentiate | âś… | âś… | âś… |
interact | âś… | âś… | âś… |
k_bins_discretize | âś… | âś… | âś… |
log_transform | âś… | âś… | âś… |
max_absolute_scale | âś… | âś… | âś… |
min_max_scale | âś… | âś… | âś… |
numerical_binarize | âś… | âś… | âś… |
robust_scale | âś… | âś… | âś… |
standardize | âś… | âś… | âś… |
To import this package into your dbt project, add the following to either the packages.yml
or dependencies.yml
file:
packages:
- package: "Matts52/dbt_ml_inline_preprocessing"
version: [">=0.2.0"]
and run a dbt deps
command to install the package to your project.
Check read the docs for more information on installing packages.
This package currently support dbt versions 1.1.0 through 2.0.0
Currently this package supports the Snowflake and Postgres adapters
(source)
This macro returns impute categorical data for a column in a model, source, or CTE
Args:
column
(required): Name of the field that is to be imputedmeasure
(optional): The measure by which to impute the data. It is set to use the 'mode' by defaultsource_relation
(required for some databases): a Relation (aref
orsource
) that contains the list of columns you wish to select from
Usage:
{{ dbt_ml_inline_preprocessing.categorical_impute(
column='user_type',
measure='mode',
relation=ref('my_model'),
)
}}
(source)
This macro returns imputed numerical data for a column in a model, source, or CTE
Args:
column
(required): Name of the field that is to be imputedmeasure
(optional): The measure by which to impute the data. It is set to use the 'mean' by default, but also support 'median' and 'percentile'percentile
(optional): If percentile is selected for the measure, this indicates the percentile value to impute into null valuessource_relation
(required for some databases): a Relation (aref
orsource
) that contains the list of columns you wish to select from
Usage:
{{ dbt_ml_inline_preprocessing.numerical_impute(
column='purchase_value',
measure='mean',
percentile=0.25,
source_relation=ref('my_model'),
)
}}
(source)
This macro returns randomly imputed data for a column in a model or source based on values that already exist within the column
NOTE: This method assumes that at least one value has been observed in the input column
Args:
column
(required): Name of the field that is to be imputedsource_relation
(required): a Relation (aref
orsource
) that contains the list of columns you wish to select fromdata_type
(required): The data type of this column. Eithernumerical
orcategorical
consider_distribution
(optional): Boolean of whether or not the distribution of existing values should be taken into account. Macro will run faster when this is set to false. Default is false
Usage:
{{ dbt_ml_inline_preprocessing.random_impute(
column='user_type',
source_relation=ref('my_model'),
data_type=`categorical`,
consider_distribution=true
)
}}
(source)
This macro returns a the labels encoded with individual integers from 0 to n-1. Note that this has the side effect of re-ordering the dataset.
Args:
column
(required): Name of the field that is to be encoded
Usage:
{{ dbt_ml_inline_preprocessing.label_encode(
column='user_type',
)
}}
(source)
This macro returns one hot encoded fields from a categorical column
NOTE: One hot encoded fields will have the naming convention is_{column_name}_{value}
Args:
column
(required): Name of the field that is to be one hot encodedsource_relation
(required for some databases): a Relation (aref
orsource
) that contains the list of columns you wish to select from
Usage:
{{ dbt_ml_inline_preprocessing.one_hot_encode(
source_relation=ref('my_model'),
source_column='purchase_value',
)
}}
(source)
This macro encodes rarely occuring categorical values with 'Other', leaving the rest of the categorical column values as is
Args:
column
(required): Name of the field that is to be rare category encodedcutoff
(optional): The percentage value (in decimal) that is to serve as the point where any values occuring with a lesser frequency are rare category encoded. Default is 0.05 (ie 5%)
Usage:
{{ dbt_ml_inline_preprocessing.rare_category_encode(
column='user_type',
cutoff=0.10,
)
}}
(source)
This macro returns the given column after applying a exponential transformation to the numerical data. Often this is useful for when values are in logged form. By default the base will be e
(the exponential constant)
Args:
column
(required): Name of the field that is to be exponentiatedbase
(optional): The base of the exponentiation to apply. By default this is 2.71, indicating that the exponential constante
should be used.
Usage:
{{ dbt_ml_inline_preprocessing.exponentiate(
column='log_purchase_value',
base=10,
)
}}
(source)
This macro creates an interaction term between two numerical columns
Args:
column_one
(required): Name of the first field in the interaction termcolumn_two
(required): Name of the second field in the interaction terminteraction
(optional): The interaction to apply. Options are "multaplicative", "additive", "exponential". Default is "multiplicative"
Usage:
{{ dbt_ml_inline_preprocessing.interact(
column='purchase_value',
base=10,
offset=1,
)
}}
(source)
This macro returns the given column after discretizing it into a specified number of bins
Args:
column
(required): Name of the field that is to be k bins discretizedk
(required): The number of bins to discretize intostrategy
(optional): The method by which to discretize the column into bins. Supported options are "quantile" to discretize into equal sized bins and "uniform" to bin into equal width bins. Default is "quantile"
Usage:
{{ dbt_ml_inline_preprocessing.log_transform(
column='purchase_value',
k=5,
strategy='quantile',
)
}}
(source)
This macro returns the given column after applying a log transformation to the numerical data
Args:
column
(required): Name of the field that is to be log transformedbase
(optional): The base of the log function that is transforming the column. Default is 10offset
(optional): Value to be added to all values in the column before log transforming. Common use case is when zero values are included in the column. Default is 0
Usage:
{{ dbt_ml_inline_preprocessing.log_transform(
column='purchase_value',
base=10,
offset=1,
)
}}
(source)
This macro transforms the given column by dividing each value by the maximum absolute value within the column. This transforms the range of values within the column to be [-1, 1]
Args:
column
(required): Name of the field that is to be transformed
Usage:
{{ dbt_ml_inline_preprocessing.max_absolute_scale(
column='user_rating',
)
}}
(source)
This macro transforms the given column to have a specified minimum and specified maximum, and scaling all values to fit that range. This transforms the range of values within the column to be [new minimum, new maximum]
Args:
column
(required): Name of the field that is to be transformednew_min
(optional): The new minimum value to scale towards. Default is 0.0new_max
(optional): The new maximum value to scale towards. Default is 1.0
Usage:
{{ dbt_ml_inline_preprocessing.min_max_scale(
column='user_rating',
new_min=0,
new_max=10,
)
}}
(source)
This macro transforms the given column to have a specified minimum and specified maximum, and scaling all values to fit that range. This transforms the range of values within the column to be [new minimum, new maximum]
Args:
column
(required): Name of the field that is to be transformednew_min
(optional): The new minimum value to scale towardsnew_max
(optional): The new maximum value to scale towards
Usage:
{{ dbt_ml_inline_preprocessing.min_max_scale(
column='user_rating',
new_min=0,
new_max=10,
)
}}
(source)
This macro transforms the given numerical column into binary value based on either a specified cutoff value or percentile
Args:
column
(required): Name of the field that is to be transformedcutoff
(required): The value that serves as the boundary point for the binary variable. This should be a value between 0 and 1 for percentile cutoff's. Default is 0.5strategy
(optional): The method with which to set the boundary point for the binary variable, options are "percentile" and "value". Default is 'percentile'direction
(optional): The direction that the 1 value should signify for the binary variable. Options are ">", ">=", "<", and "<=". Default is ">="source_relation
(required for some databases): a Relation (aref
orsource
) that contains the list of columns you wish to select from
Usage:
{{ dbt_ml_inline_preprocessing.numerical_binarize(
'input',
strategy='percentile',
cutoff=0.2,
direction='>',
source_relation=ref('data_numerical_binarize')
)
}}
(source)
This macro transforms the given column into IQR scaled values to more effectively deal with concerning outlier datapoints
Args:
column
(required): Name of the field that is to be transformediqr
(optional): The interquantile range to consider and scale by. Expects a number between 0 and 1 excluse. Default is 0.5, leading to a interquantile range stretching from the 25th percentile to the 75th percentilesource_relation
(required for some databases): a Relation (aref
orsource
) that contains the list of columns you wish to select from
Usage:
{{ dbt_ml_inline_preprocessing.robust_scale(
column='user_rating',
iqr=0.5
source_relation=ref('my_model')
)
}}
(source)
This macro transforms the given column into a normal distribution. Transforms to a standard normal distribution by default
Args:
column
(required): Name of the field that is to be transformedtarget_mean
(optional): The new mean that the column assumes. Default is 0target_stddev
(optional): The new standard deviation that the column assumes. Default is 1
Usage:
{{ dbt_ml_inline_preprocessing.standardize(
column='user_rating',
target_mean=0,
target_stddev=0,
)
}}