QCEW Data Processing Tool

This tool is part of a collaboration between the University of Puerto Rico, Mayaguez, and Puerto Rico's Planning Board. Its main objective is to convert raw QCEW data into a DuckDB database with a geographic (geom) column that stores the coordinates of businesses.

Overview

The script takes raw Quarterly Census Employment and Wages (QCEW) data from the data/raw directory, processes it, and stores it in a DuckDB database. The raw data should be organized in subfolders by year (e.g., data/raw/2002, data/raw/2003), with each year folder containing the data for each quarter. The resulting database will be available in the data directory with a .ddb extension.

This tool also incorporates geospatial data, using latitude and longitude values to create a point geometry (geom) for each business, allowing for geographic queries.

Requirements

To run this tool, you will need the following Python packages:

ibis
duckdb
polars
geopandas
tqdm
requests
json
logging
pandas

You can install the necessary dependencies using:

pip install -r requirements.txt

Or utilize the uv to ensure compatibility

uv sync

File Structure

The data should be organized in the following structure:

data/
├── raw/
│   ├── 2002/
│   ├── 2003/
│   └── ...
├── processed/
├── external/
│   └── decode.json
└── data.ddb

data/raw/: This directory contains the raw QCEW data, organized by year and quarter.
data/processed/: This directory is for storing processed data.
data/external/: This directory contains external files, including decode.json, which is required for decoding the raw data files.
data.ddb: The output DuckDB database containing the processed data.

How It Works

Initialization: The script checks for necessary directories (raw, processed, external) and creates them if they don't exist. It also downloads external files, such as decode.json, if not already present. This file holds the Census codification of the data.
Data Processing:
- The tool reads raw data files, cleans them, and extracts relevant fields based on predefined column widths defined in decode.json.
- The cleaned data includes geographic coordinates (latitude and longitude), which are then transformed into a geom column of type Point.
- This processed data is inserted into a DuckDB database.
Group and Aggregate Data:
- The data is grouped by NAICS code (4-digit), year, and quarter, aggregating information such as total wages and total employment.
- Additional calculations are performed for contributions to the social security, Medicare, and other funds.
Joining with External Data:
- The tool also allows for joining the QCEW data with external data (e.g., hactable) based on NAICS codes, facilitating further analysis.

Key Functions

make_qcew_dataset: Processes all the raw QCEW data and inserts it into the DuckDB database.
clean_txt: Cleans and formats the raw text data, extracting relevant fields and generating geographic information.
group_by_naics_code: Groups data by NAICS code and aggregates the total wages and employment.
unique_naics_code: Joins the grouped QCEW data with external data based on the NAICS code.
pull_file: Downloads external files from a given URL (e.g., decode.json).

Usage

Organize your raw QCEW data by year and quarter in the data/raw/ folder.
Ensure that decode.json is in the data/external/ folder.
Run the script to process the data:

python main.py

Logging

The script logs key events and warnings to a file called data_process.log. This includes information about successfully processed files, warnings for empty files, and other runtime details.

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

Contributing

Contributions to this tool are welcome. Please fork the repository and submit a pull request with any improvements or bug fixes.

If you have any questions or need further assistance, feel free to reach out!

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
data		data
docs		docs
notebooks		notebooks
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QCEW Data Processing Tool

Overview

Requirements

File Structure

How It Works

Key Functions

Usage

Logging

License

Contributing

About

Releases

Packages

Contributors 3

Languages

License

EconLabs/jp-QCEW

Folders and files

Latest commit

History

Repository files navigation

QCEW Data Processing Tool

Overview

Requirements

File Structure

How It Works

Key Functions

Usage

Logging

License

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages