Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly interpolate only cols 'x' and 'n'; handle categorical and other columns #849

Draft
wants to merge 1 commit into
base: rc2.3_calib
Choose a base branch
from

Conversation

pausz
Copy link
Contributor

@pausz pausz commented Jan 30, 2025

Hi 👋 @daniel-klein

What this PR introduces

This draft PR shows how we have been handling categorical columns (and other columns) that may exist in the dataframes passed to calibration components.

  • The functions that conform the simulated data to the expected time index, iterate over a list ['x', 'n'] rather than over the list of available columns in actual.
  • After actual 'x' and 'n' columns have been conformed, we handle the other columns.

Not sure my proposed solution is optimal if dataframes are large though.

Description

For calibration pipelines using typhoidsim, our 'extraction' functions that generate dataframes for calibration components, include a couple of other columns, in addition to the expected 'x' and 'n'.

def extract_reference_data(): ...

        expected_data = pd.DataFrame(data={"n": n,
                                           "x": x,
                                           "age_bin": selected_age_bin,
                                           "year_bin": year_bin.values.flatten()},
                                     index=pd.Index(year_index, name="t"))

For our specific use case each dataframe has one single row, representing a "single time point", which is really a time bin (or year bin). As time index we use the halfway point of the year/time bin, but it's really useful to have a more informative label with the actual boundaries of the bin. That's what we store in the column 'year_bin'.

In addition we have multiple age bins in our target/reference data, and we use one calibration component per age bin.
Storing this categorical information enable us to:

  • (i) reuse and recycle the extraction function outside the context of a calibration. They are very convenient to extract a dataframe from a sim in testing scripts.

Screenshot from 2025-01-30 19-09-04

  • (ii) concatenate all the dataframes with simulated and expected data after a calibration, and easily plot something like this (using seaborn functions):

Screenshot from 2025-01-30 19-09-33

Screenshot from 2025-01-30 19-11-12

Checklist

  • Code commented & docstrings added
  • New tests were needed and have been added
  • A new version number was needed & changelog has been updated
  • A new PyPI version needs to be released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant