Data Quality Assessment Framework Documentation

1. Project Overview

The Data Quality Assessment Framework is a comprehensive tool designed to evaluate the quality, readiness, and completeness of both structured (CSV, Parquet, JSON) and unstructured (PDF, Images, Audio) datasets. It generates detailed JSON reports and PDF summaries, providing scores based on various metrics like missingness, variance, format consistency, and documentation presence.

2. Installation & Setup

Prerequisites

Python 3.8+
AWS Credentials (if running in Lambda/S3 mode)
OpenAI API Key (for role inference)

Installation

Clone the repository.
Install dependencies:
```
pip install -r requirements.txt
```

Configuration

Create a .env file in the root directory with the following variables:

# Required for Column/Role Inference
OPENAI_API_KEY=your_openai_api_key

# Required for CAT API Updates (if used)
ELASTIC_ID=your_elastic_id
ELASTIC_PASS=your_elastic_password

# Required for AWS Lambda / S3 Usage
S3_BUCKET_NAME=your_input_bucket_name
S3_REPORTS_BUCKET_NAME=your_output_bucket_name

3. Usage

Local Execution

You can run the assessment locally by running the 'local_lambda_tester.py' file. Ensure that the path is pointing to the data in your local directory. Also ensure that your .env file is configured with your OpenAI API key.

# Example event (simulate API Gateway event)
event = {
    "body": '{"folder_key": "path/to/local/dataset/folder/"}'
}

For Structured Data:

python structured_main.py

Prompts for directory input or modifies main to accept arguments.

For Unstructured Data:

python unstructured_main.py

AWS Lambda Execution

The lambda_handler.py is designed to be deployed on AWS Lambda. It triggers on events (e.g., S3 uploads) or manual invocations.

Event Payload:

{
  "folder_key": "path/to/dataset/folder/"
}

4. Module Descriptions

Core Modules

structured_main.py: Entry point for structured data. Orchestrates loading, inference, metric calculation, and reporting.
unstructured_main.py: Entry point for unstructured data. Handles metadata extraction and similar orchestration.
lambda_handler.py: AWS Lambda wrapper. Handles S3 downloads, selects the appropriate main module based on file types, and uploads reports back to S3.

Report Modules (`report/`)

input_handler.py: Loads data from directories (supports CSV, Parquet, JSON).
aggregate_structured.py: Runs all structured metrics and compiles the raw report.
aggregate_unstructured.py: Runs all unstructured metrics and compiles the raw report.
scoring_structured.py / scoring_unstructured.py: Computes the final weighted scores and percentages.
json_writer.py: Saves the raw and final reports to JSON.
pdf_writer.py: Generates a visual PDF report from the JSON data.
post_to_cat_api.py: Updates an external API (CAT) with the readiness score.

Metrics Modules

Structured Metrics (`structured_metrics/`)

quality.py: Checks for missing values (rows/cols) and duplicates.
variance_correctness.py: Analyzes numeric variance and categorical distribution.
standardization.py: Checks file formats and date/timestamp consistency.
relevance_completeness.py: Checks for region coverage.
documentation.py: Checks for the presence of data dictionaries/readmes.
llm_api.py: Uses OpenAI to infer the semantic roles of columns (e.g., "this is a date", "this is a region").

Unstructured Metrics (`unstructured_metrics/`)

metadata_parser.py: Extracts metadata from files.
file_format_check.py: Validates file extensions.
file_openability.py: Tests if files can be opened/read.
file_duplicates.py: Checks for duplicate files.
consistency.py: Checks if all files in a dataset are of the same type.
llm_api.py: Infers roles from metadata.

5. Output Artifacts

For each dataset, the tool generates:

*_raw_readiness_report.json: Detailed metric results.
*_final_readiness_report.json: Scored and summarized report.
data_readiness_report.pdf: A user-friendly PDF summary.

Data Flow Diagrams

1. High-Level Overview

This diagram shows the general flow from entry points to final outputs.

graph TD
    subgraph Input
        Lambda[Lambda Handler]
        Local[Local Execution]
    end

    subgraph Processing
        S_Main[Structured Main]
        U_Main[Unstructured Main]
    end

    subgraph Output
        JSON[JSON Reports]
        PDF[PDF Report]
        API[CAT API Update]
    end

    Lambda -->|Structured| S_Main
    Lambda -->|Unstructured| U_Main
    Local -->|Structured| S_Main
    Local -->|Unstructured| U_Main

    S_Main --> JSON
    S_Main --> PDF
    S_Main --> API

    U_Main --> JSON
    U_Main --> PDF
    U_Main --> API

2. Structured Data Flow Detail

Detailed flow within the Structured Data processing module.

graph TD
    Start([Start: structured_main.py]) --> Load[Load Data: input_handler]
    Load --> Infer[Infer Column Roles: llm_api]
    Infer --> Raw[Generate Raw Report: aggregate_structured]
    
    subgraph Metrics Calculation
        Raw --> M1[Check Missing/Duplicates]
        Raw --> M2[Check Variance/Coverage]
        Raw --> M3[Check Formats/Encoding]
    end

    M1 --> Score[Compute Aggregate Score: scoring_structured]
    M2 --> Score
    M3 --> Score
    
    Score --> Write[Write Outputs: json_writer]
    Write --> PDF[Generate PDF: pdf_writer]
    PDF --> End([End])

3. Unstructured Data Flow Detail

Detailed flow within the Unstructured Data processing module.

graph TD
    Start([Start: unstructured_main.py]) --> Meta[Extract Metadata: metadata_parser]
    Meta --> Infer[Infer Roles: llm_api]
    Infer --> Raw[Generate Raw Report: aggregate_unstructured]

    subgraph Metrics Calculation
        Raw --> M1[Check File Duplicates/Types]
        Raw --> M2[Check Openability/Format]
        Raw --> M3[Check Metadata Coverage]
    end

    M1 --> Score[Compute Aggregate Score: scoring_unstructured]
    M2 --> Score
    M3 --> Score

    Score --> Write[Write Outputs: json_writer]
    Write --> PDF[Generate PDF: pdf_writer]
    PDF --> End([End])

Name		Name	Last commit message	Last commit date
Latest commit History 365 Commits
data		data
outputReports		outputReports
plots/pretty		plots/pretty
report		report
structured_metrics		structured_metrics
tests		tests
unstructured_metrics		unstructured_metrics
.dockerignore		.dockerignore
Dockerfile		Dockerfile
README.md		README.md
data_flow.md		data_flow.md
documentation.md		documentation.md
lambda_handler.py		lambda_handler.py
local_lambda_tester.py		local_lambda_tester.py
requirements.txt		requirements.txt
structured_main.py		structured_main.py
unstructured_main.py		unstructured_main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Quality Assessment Framework Documentation

1. Project Overview

2. Installation & Setup

Prerequisites

Installation

Configuration

3. Usage

Local Execution

AWS Lambda Execution

4. Module Descriptions

Core Modules

Report Modules (`report/`)

Metrics Modules

Structured Metrics (`structured_metrics/`)

Unstructured Metrics (`unstructured_metrics/`)

5. Output Artifacts

Data Flow Diagrams

1. High-Level Overview

2. Structured Data Flow Detail

3. Unstructured Data Flow Detail

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Quality Assessment Framework Documentation

1. Project Overview

2. Installation & Setup

Prerequisites

Installation

Configuration

3. Usage

Local Execution

AWS Lambda Execution

4. Module Descriptions

Core Modules

Report Modules (report/)

Metrics Modules

Structured Metrics (structured_metrics/)

Unstructured Metrics (unstructured_metrics/)

5. Output Artifacts

Data Flow Diagrams

1. High-Level Overview

2. Structured Data Flow Detail

3. Unstructured Data Flow Detail

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Report Modules (`report/`)

Structured Metrics (`structured_metrics/`)

Unstructured Metrics (`unstructured_metrics/`)

Packages