Data Quality Assessment Tool

The scripts included in this repository can be used as a way to audit the quality of a dataset through 7 dimensions. Currently the focus is on data generated by IoT sensors, specifically used in the context of assessing various aspects of smart cities, such as AQM, ITMS, and Bangalore Ambulance Data.

A PDF report is generated as the output along with a JSON file. These reports contain the result of the evaluation procedure described below:

For each of the dimensions that are used to quantify the quality of a dataset, the tool aims to provide a metric score between 0 and 1, where 1 is the highest possible score, indicating a 100% score. Currently, the tool is able to assess sseven parameters, namely:

Regularity of Inter-Arrival Time
Outlier Presence in Inter-Arrival Time
Sensor Uptime
Absence of Duplicate Values
Adherence to Attribute Format
Absence of Unknown Attributes
Adherence to Mandatory Attributes

A note to remember is that each dataset has a linked JSON Schema that defines the attributes that can be present in the dataset, whether there are any required attributes, and what units and datatypes these attributes need to have. Additionally, each dataset must have a config file assoociated with it to generate the output reports. The config files are included in this repository.

Regularity of Inter-Arrival Time

The regularity metric of the inter-arrival time conveys how uniform this time interval is for a dataset in relation to the expected behaviour. This metric measures the proximity of the spread of the normal distribution to the mode.

Outlier Presence in Inter-Arrival Time

The outliers of the inter-arrival time is defined as the number of data packets which are received outside the bounds specified by the inter-quartile method.

Sensor Uptime

Sensor uptime is defined as the duration in which the sensor is actively sending data packets at the expected time intervals.

Duplicate Presence

This metric serves to check two columns that are input by the user for any duplicate values in the dataset. A value is considered to be a duplicate if both columns contain the exact same values for any data packet.

Attribute Format Adherence

This metric assesses the level of adherence of the data to its expected format as defined in the data schema. It is quantified by taking the ratio of packets adhering to the expected schemas to the total number of data packets.

Absence of Unknown Attributes

This metric checks whether there are any additional attributes present in the dataset apart from the list of required attributes.

Adherence to Mandatory Attributes

This metric checks whether all the required attributes defined in the schema are present in the dataset.

Some additional information about the dataset is also provided in the report. A more detailed description and evaluation criteria for all these metrics is provided in the output PDF report.

Generating Reports

The first step is to ensure that the IUDX SDK is installed on your computer using the following command:

pip install git+https://github.com/datakaveri/iudx-python-sdk

Once inside the directory where the repo was cloned, run:

pip install .

Running the tool

Clone the repo from:

git clone https://github.com/novoneel-iudx/data-quality-assessment.git

Required libraries and packages

Once in the scripts folder, run the following command to install the package and library dependencies:

pip install -r requirements.txt

Present in the config folder is a config file in JSON format with the name of the dataset prepended to it. This file requires one to input the name of the datafile as well as select the attributes that one would like to check for duplicates. In this case, the name of the datafile and the appropriate attributes for selection are already included in the file as below:

observationDateTime
id for AQM data & trip_id for ITMS data

Present in the data folder in the repository is a sample dataset of ITMS data from Surat, as well as a sample dataset of AQM data from Pune. Inside the schemas folder are the corresponding schemas for these datasets. In order to generate the report, simply run the following command:

python3 DQReportGenerator.py

and enter the name of the config file when prompted.

Ensure that the datasets are in JSON format and are located in the data folder.

The output report file will be generated in a JSON format and and a .pdf format and will be saved in the outputReports folder. The plots and visualizations required for generating the PDF report will be stored in the plots folder and will be populated as the script is running. These will be overwritten everytime the script is run and do not need to be stored locally long-term.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Quality Assessment Tool

Regularity of Inter-Arrival Time

Outlier Presence in Inter-Arrival Time

Sensor Uptime

Duplicate Presence

Attribute Format Adherence

Absence of Unknown Attributes

Adherence to Mandatory Attributes

Generating Reports

Running the tool

Required libraries and packages

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
config		config
data		data
outputReports		outputReports
plots		plots
schemas		schemas
scripts		scripts
README.md		README.md

novoneel-iudx/data-quality-assessment

Folders and files

Latest commit

History

Repository files navigation

Data Quality Assessment Tool

Regularity of Inter-Arrival Time

Outlier Presence in Inter-Arrival Time

Sensor Uptime

Duplicate Presence

Attribute Format Adherence

Absence of Unknown Attributes

Adherence to Mandatory Attributes

Generating Reports

Running the tool

Required libraries and packages

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages