Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DuckDB MIMIC-III concepts implementation #1529

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 105 additions & 42 deletions mimic-iii/buildmimic/duckdb/README.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,78 @@
# DuckDB
# MIMIC-III in DuckDB

The script in this folder creates the schema for MIMIC-IV and
The scripts in this folder create the schema for MIMIC-III and
loads the data into the appropriate tables for
[DuckDB](https://duckdb.org/).

The Python script (`import_duckdb.py`) also includes the option to
add the [concepts views](../../concepts/README.md) to the database.
This makes it much easier to use the concepts views as you do not
have to install and setup PostgreSQL or use BigQuery.

DuckDB, like SQLite, is serverless and
stores all information in a single file.
Unlike SQLite, an OLTP database,
DuckDB is an OLAP database, and therefore optimized for analytical queries.
This will result in faster queries for researchers using MIMIC-IV
This will result in faster queries for researchers using MIMIC-III
with DuckDB compared to SQLite.
To learn more, please read their ["why duckdb"](https://duckdb.org/docs/why_duckdb)
page.

The instructions to load MIMIC-III into a DuckDB
only require:
1. DuckDB to be installed and
2. Your computer to have a POSIX-compliant terminal shell,
which is already found by default on any Mac OSX, Linux, or BSD installation.

To use these instructions on Windows,
you need a Unix command line environment,
which you can obtain by either installing
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
or [Cygwin](https://www.cygwin.com/).

## Set-up

### Quick overview

1. [Install](https://duckdb.org/docs/installation/) the CLI version of DuckDB
2. [Download](https://physionet.org/content/mimiciii/1.4/) the MIMIC-III files
3. Create DuckDB database and load data
## Download MIMIC-III files

### Install DuckDB
[Download](https://physionet.org/content/mimiciii/1.4/)
the CSV files for MIMIC-III by any method you wish.
(These scripts should also work with the much smaller
[demo version](https://physionet.org/content/mimiciii-demo/1.4/#files-panel)
of the dataset.)

Follow instructions on their website to
[install](https://duckdb.org/docs/installation/)
the CLI version of DuckDB.
The easiest way to download them is to open a terminal then run:

You will need to place the `duckdb` binary in a folder on your environment path,
e.g. `/usr/local/bin`.
```
wget -r -N -c -np -nH --cut-dirs=1 --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciii/1.4/
```

### Download MIMIC-III files
Replace `YOURUSERNAME` with your physionet username.

[Download](https://physionet.org/content/mimiciii/1.4/)
the CSV files for MIMIC-III by any method you wish.
This will make you `mimic_data_dir` be `mimiciii/1.4`.

The intructions assume the CSV files are in the folder structure as follows:
The rest of these intructions assume the CSV files are in the folder structure as follows:

```
mimic_data_dir
mimic_data_dir/
ADMISSIONS.csv.gz
CALLOUT.csv.gz
...
```

The CSV files can be uncompressed (end in `.csv`) or compressed (end in `.csv.gz`).

The easiest way to download them is to open a terminal then run:

```
wget -r -N -c -np -nH --cut-dirs=1 --user YOURUSERNAME --ask-password https://physionet.org/files/mimiciii/1.4/
```
## Shell script method (`import_duckdb.sh`)

Replace `YOURUSERNAME` with your physionet username.
Using this script to load MIMIC-III into a DuckDB
only requires:
1. DuckDB to be installed (the `duckdb` executable must be in your PATH)
2. Your computer to have a POSIX-compliant terminal shell,
which is already found by default on any Mac OSX, Linux, or BSD installation.

This will make you `mimic_data_dir` be `mimiciii/1.4`.
To use these instructions on Windows,
you need a Unix command line environment,
which you can obtain by either installing
[Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10)
or [Cygwin](https://www.cygwin.com/).

### Install DuckDB

Follow instructions on their website to
[install](https://duckdb.org/docs/installation/)
the CLI version of DuckDB.

You will need to place the `duckdb` binary in a folder on your environment path,
e.g. `/usr/local/bin`.

# Create DuckDB database and load data

The last step requires creating a DuckDB database and
loading the data into it.
### Create DuckDB database and load data

You can do all of this will one shell script, `import_duckdb.sh`,
located in this repository.
Expand Down Expand Up @@ -102,6 +105,66 @@ The script will print out progress as it goes.
Be patient, this can take minutes to hours to load
depending on your computer's configuration.

## Python script method (`import_duckdb.py`)

This method does not require the DuckDB executable, it only requires the DuckDB Python
module and the [SQLGlot](https://github.com/tobymao/sqlglot) Python module, both of which can be
easily installed with `pip`.

### Install dependencies

Install the dependencies by using the included `requirements.txt` file:

```sh
python3 -m pip install -r ./requirements.txt
```

### Create DuckDB database and load data

Create the MIMIC-III database with `import_duckdb.py` like so:

```sh
python ./import_duckdb.py /path/to/mimic_data_dir ./mimic3.db
```

...where `/path/to/mimic_data_dir` is the path containing the .csv or .csv.gz
data files downloaded above.

This command will create the `mimic3.db` file in the current directory. Be aware that
for the full MIMIC-III v1.4 dataset the resulting file will be about 34GB in size.
This process will take some time, as with the shell script version.

The default options will create only the tables and load the data, and assume
that you are running the script from the same directory where this README.md
is located. See the full options below if the defaults are insufficient.

### Create the concepts views

In most cases you will want to create the concepts views at the same time as
the database. To do this, add the `--make-concepts` option:

```sh
python ./import_duckdb.py /path/to/mimic_data_dir ./mimic3.db --make-concepts
```

If you want to add the concepts to a database already created without this
option (or created with the shell script version), you can add the
`--skip-tables` option as well:

```sh
python ./import_duckdb.py /path/to/mimic_data_dir ./mimic3.db --make-concepts --skip-tables
```

### Additional options

There are a few additional options for special situations:

| Option | Description
| - | -
| `--skip-indexes` | Don't create additional indexes when creating tables and loading data. This may be useful in memory-constrained systems or to save a little time.
| `--mimic-code-root [path]` | This argument specifies the location of the mimic-code repository files. This is needed to find the concepts SQL files. This is useful if you are running the script from a different directory than the one where this README.md file is located (the default is `../../../`)
| `--schema-name [name]` | This puts the tables and concepts views into a named schema in the database. This is mainly useful to mirror the behavior of the PostgreSQL version of the database, which places objects in a schema named `mimiciii` by default--if you have existing code designed for the PostgreSQL version, this may make migration easier. Note that--like the PostgreSQL version--the `ccs_dx` view is *not* placed in the specified schema, but in the default schema (which is `main` in DuckDB, not `public` as in PostgreSQL).

# Help

Please see the [issues page](https://github.com/MIT-LCP/mimic-iii/issues) to discuss other issues you may be having.
20 changes: 20 additions & 0 deletions mimic-iii/buildmimic/duckdb/concepts/icustay_hours.sql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency this file would be better placed in mimic-iii/concepts_duckdb/icustay_hours.sql - this seems doable with the python script

Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
WITH all_hours AS (
SELECT
it.icustay_id, /* ceiling the intime to the nearest hour by adding 59 minutes then truncating */
DATE_TRUNC('hour', it.intime_hr + INTERVAL '59' minute) AS endtime, /* create integers for each charttime in hours from admission */
/* so 0 is admission time, 1 is one hour after admission, etc, up to ICU disch */
GENERATE_SERIES(
-24,
CAST(CEIL(EXTRACT(EPOCH FROM it.outtime_hr - it.intime_hr) / 60.0 / 60.0) AS INT)
) AS hr
FROM icustay_times AS it
)
SELECT
ah.icustay_id,
unnest(ah.hr) as hr,
/* endtime now indexes the end time of every hour for each patient */
unnest(list_transform(hr, ahr -> ah.endtime + ahr*INTERVAL 1 hour)) AS endtime
--ah.endtime+ hr*INTERVAL 1 hour as endtime
FROM all_hours AS ah
ORDER BY
ah.icustay_id NULLS LAST
Loading