Explicitly interpolate only cols 'x' and 'n'; handle categorical and other columns #849

pausz · 2025-01-30T20:59:07Z

What this PR introduces

This draft PR shows how we have been handling categorical columns (and other columns) that may exist in the dataframes passed to calibration components.

The functions that conform the simulated data to the expected time index, iterate over a list ['x', 'n'] rather than over the list of available columns in actual.
After actual 'x' and 'n' columns have been conformed, we handle the other columns.

Not sure my proposed solution is optimal if dataframes are large though.

Description

For calibration pipelines using typhoidsim, our 'extraction' functions that generate dataframes for calibration components, include a couple of other columns, in addition to the expected 'x' and 'n'.

def extract_reference_data(): ...

        expected_data = pd.DataFrame(data={"n": n,
                                           "x": x,
                                           "age_bin": selected_age_bin,
                                           "year_bin": year_bin.values.flatten()},
                                     index=pd.Index(year_index, name="t"))

For our specific use case each dataframe has one single row, representing a "single time point", which is really a time bin (or year bin). As time index we use the halfway point of the year/time bin, but it's really useful to have a more informative label with the actual boundaries of the bin. That's what we store in the column 'year_bin'.

In addition we have multiple age bins in our target/reference data, and we use one calibration component per age bin.
Storing this categorical information enable us to:

(i) reuse and recycle the extraction function outside the context of a calibration. They are very convenient to extract a dataframe from a sim in testing scripts.

(ii) concatenate all the dataframes with simulated and expected data after a calibration, and easily plot something like this (using seaborn functions):

Checklist

Code commented & docstrings added
New tests were needed and have been added
A new version number was needed & changelog has been updated
A new PyPI version needs to be released

…other columns.

Explicitly interpolate only cols 'x' and 'n'; handle categorical and …

0d4e998

…other columns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explicitly interpolate only cols 'x' and 'n'; handle categorical and other columns #849

Explicitly interpolate only cols 'x' and 'n'; handle categorical and other columns #849

pausz commented Jan 30, 2025 •

edited

Loading

Explicitly interpolate only cols 'x' and 'n'; handle categorical and other columns #849

Are you sure you want to change the base?

Explicitly interpolate only cols 'x' and 'n'; handle categorical and other columns #849

Conversation

pausz commented Jan 30, 2025 • edited Loading

What this PR introduces

Description

Checklist

pausz commented Jan 30, 2025 •

edited

Loading