MultiModal Data Mask: Preserving Essential Context in Data Masking through Automated Synthetic Data Generation

Inspiration

In the era of big data and AI, preserving user privacy while maintaining data utility is a significant challenge. Traditional data masking methods often strip away crucial context—such as race, gender, age, and emotions—that is vital for training accurate and fair AI models. This loss of context can lead to models that are biased or perform poorly, ultimately impacting the end-users who rely on these technologies. We wanted to address this issue by developing a solution that preserves essential context while ensuring compliance with Personally Identifiable Information (PII) regulations.

What It Does

Our solution allows researchers and developers to:

Upload Multimedia Data: Users can upload datasets (e.g., driver dashcam footage) that contain sensitive information.
Specify Context to Preserve: Users define which vital contexts (like facial expressions, age, gender, etc.) should be maintained in the dataset.
Generate Synthetic Data: The system processes the input data to produce a new dataset where sensitive identifiers are anonymized, but the specified contexts are preserved.
Prepare for Model Training: The resulting dataset is compliant with PII standards and ready for use in training machine learning models without sacrificing essential contextual information.

How We Built It

We developed a pipeline that integrates multiple state-of-the-art AI models and frameworks to achieve our goal:

1. Data Collection and Preprocessing

Curated Dataset: We collected driver dashcam footage from YouTube, trimming them into 15-30 minute clips to create our test dataset.
BlabitMF Framework: The clips were processed using the BlabitMF framework, allowing users to set key contexts to preserve.

2. Dense Video Captioning with PLLaVA

Extension to Video Data: We utilized PLLaVA (Pooling LLaVA), which extends image-language pre-training models to video data for dense captioning.
Pooling Strategy: PLLaVA employs a pooling strategy that smooths feature distributions along the temporal dimension, reducing the dominance of certain tokens and improving performance.
Context Extraction: It provides detailed descriptions of each scene, capturing attributes like age, facial expressions, gender, and more.

3. Audio Transcription with Gemini

Speech Recognition: We used Gemini to transcribe audio from the videos, ensuring that auditory context is also preserved.
Integration: The transcriptions were combined with the visual captions to create a comprehensive understanding of each scene.

4. Lifelike Portrait Generation with CosmicMan

Text-to-Image Generation: Descriptions from PLLaVA were converted into prompts for CosmicMan, a model specialized in generating high-fidelity human images.
Annotate Anyone Paradigm: CosmicMan uses a new data production approach called "Annotate Anyone", enabling the creation of high-quality data with accurate annotations.
Decomposed-Attention-Refocusing: Introduces the Daring training framework to model the relationship between dense text descriptions and image pixels effectively.
Preserving Context: Generates lifelike portraits that retain the specified vital contexts like race, gender, age, and emotions.

5. Video Animation with EchoMimic

Audio-Driven Animation: The portraits and audio transcriptions were input into EchoMimic, which creates lifelike animations driven by audio.
Editable Landmark Conditioning: Utilizes editable landmarks to ensure the animated face aligns with the desired expressions and mouth movements.
Model Architecture:
- Motion Module: Captures motion dynamics from the audio.
- Denoising UNet & Reference UNet: Enhance image quality and ensure consistency.
- Face Locator: Precisely aligns facial features for realistic animations.
Anonymization: Generates videos that maintain the same audio and expressions but with different (anonymized) faces.

6. Final Output

Synthetic Dataset: The synthesized videos are compiled into a dataset that is ready for model fine-tuning.
Compliance and Utility: The dataset meets PII compliance standards while preserving essential contexts, enabling the training of accurate and unbiased models.

Challenges We Ran Into

Integration Complexity: Combining multiple advanced AI models with different input and output formats required careful handling and customization.
Data Privacy vs. Context Preservation: Balancing the anonymization of PII with the preservation of vital context was a significant challenge.
Resource Intensiveness: Processing high-resolution video data and generating synthetic content is computationally intensive.

Accomplishments That We're Proud Of

Successful Integration: Manged to integrate PLLaVA, CosmicMan, and EchoMimic into a functional pipeline.
Context Preservation: Found a solution that maintains essential contextual information without compromising user privacy.
Scalable Solution: Developed a method that can be extended to other data types and industries.

What We Learned

Importance of Context: Preserving vital context in data is crucial for training effective AI models.
Model Compatibility: Gained insights into adapting and integrating different AI models to work together.
Data Ethics: Deepened our understanding of data privacy regulations and ethical considerations in AI.

What's Next for MultiModal Data Mask

Expand to Other Domains: Apply the solution to other areas like healthcare, finance, and surveillance where context preservation is critical.
Performance Optimization: Improve the efficiency of the pipeline to handle larger datasets and reduce processing time.
User Interface Development: Create a user-friendly interface to make the tool accessible to non-technical users.
Feedback Incorporation: Gather user feedback to refine features and address additional challenges.

Technical Details

Programming Language: Python

Frameworks and Libraries:

PLLaVA: Adapted for video dense captioning using a pooling strategy to extend image-language models to video data.
CosmicMan: Utilized for generating high-fidelity human images with precise text-image alignment via the Daring training framework.
EchoMimic: Employed for audio-driven portrait animations using editable landmark conditioning.
Gemini: Used for accurate audio transcription.
BlabitMF Framework: Facilitated data preprocessing and user input for context preservation.

Inspiration from GitHub Projects

PLLaVA: Learned from the implementation of parameter-free extensions from images to videos, achieving state-of-the-art performance in video captioning tasks.
CosmicMan: Adopted techniques from their new data production paradigm and attention refocusing methods to generate high-fidelity images.
EchoMimic: Integrated concepts of lifelike audio-driven animations and editable landmark conditioning to produce realistic video outputs.

Technologies Used

In our project, we utilized a range of programming languages, frameworks, platforms, and tools to build an integrated solution that preserves essential context in data masking through automated synthetic data generation. Below is a detailed list of the technologies used, incorporating technical details from the GitHub repositories of the projects we integrated.

Programming Languages

Python: The primary programming language used for development due to its extensive libraries and support for machine learning, data processing, and AI frameworks.

Frameworks and Libraries

Deep Learning Frameworks

PyTorch: Used as the main deep learning framework for implementing neural networks and machine learning models.
- Documentation: PyTorch Official Documentation
Transformers (by Hugging Face): Utilized for working with pre-trained transformer models, handling tokenization, and model architectures.
- Documentation: Transformers Documentation

Computer Vision and Image Processing

OpenCV: Used for image and video processing tasks, such as reading video frames and manipulating images.
- Documentation: OpenCV Documentation
NumPy: For numerical computations and handling arrays and matrices, essential for image and audio data manipulation.
- Documentation: NumPy Documentation

Audio Processing

Whisper (by OpenAI): Used for accurate audio transcription and processing within the EchoMimic project.
- GitHub Repository: Whisper GitHub
- Documentation: Whisper Documentation

Data Handling and Utilities

Pandas: For data manipulation and analysis, particularly useful in handling datasets and preprocessing.
- Documentation: Pandas Documentation
Matplotlib: For plotting and visualizations during development and debugging.
- Documentation: Matplotlib Documentation

Integrated Projects and Their Technologies

1. PLLaVA (Pooling LLaVA)

Purpose: Extended image-language pre-training models to video data for dense captioning.
GitHub Repository: PLLaVA GitHub

Technologies Used in PLLaVA:

PyTorch: For model implementation and training.
Hugging Face Transformers: For handling transformer-based models.
Accelerate (by Hugging Face): For distributed training and hardware acceleration.
- Documentation: Accelerate Documentation
CUDA: For GPU acceleration during model training and inference.
- Documentation: CUDA Toolkit Documentation

Additional Details from Documentation:

Requirements:
- Python 3.10
- Torch 2.2.1+cu118 (or appropriate CUDA version)
Setup Instructions:
- Uses requirements.txt for managing dependencies.
- Provides scripts for model preparation and evaluation.

2. CosmicMan

Purpose: Specialized in generating high-fidelity human images with meticulous appearance and precise text-image alignment.
GitHub Repository: CosmicMan GitHub

Technologies Used in CosmicMan:

Stable Diffusion: Used for high-quality image generation.
- GitHub Repository: Stable Diffusion GitHub
Diffusers (by Hugging Face): A library for diffusion models, facilitating the use of Stable Diffusion.
- Documentation: Diffusers Documentation
Gradio: For creating user-friendly web interfaces and demos.
- Documentation: Gradio Documentation
PyTorch: Core framework for model development.
CUDA: For GPU acceleration.

Additional Details from Documentation:

Requirements:
- Python 3.10
- Torch with appropriate CUDA version
Setup Instructions:
- Install dependencies via requirements.txt.
- Provides inference scripts and training code.
Models and Checkpoints:
- Uses pre-trained models like CosmicMan-SDXL and CosmicMan-SD available on Hugging Face.

3. EchoMimic

Purpose: Lifelike audio-driven portrait animations through editable landmark conditioning.
GitHub Repository: EchoMimic GitHub

Technologies Used in EchoMimic:

PyTorch: For implementing neural networks and training models.
Diffusers: For working with diffusion models, enhancing image generation quality.
Gradio: To build interactive demos and user interfaces.
Whisper (by OpenAI): For audio transcription and processing.
FFmpeg: For video and audio processing tasks.
- Official Website: FFmpeg
CUDA: For accelerating computations on GPUs.

Additional Details from Documentation:

Requirements:
- Python 3.8 or higher
- Torch with appropriate CUDA version
Setup Instructions:
- Install dependencies via requirements.txt.
- Download FFmpeg-static and set FFMPEG_PATH.
- Provides detailed instructions for running inference and demos.
Models and Checkpoints:
- Pre-trained models are available and need to be downloaded.
- Utilizes denoising_unet.pth, reference_unet.pth, among others.

Development Tools and Platforms

Visual Studio Code (VS Code): Used as the integrated development environment (IDE) for coding, debugging, and managing the project.
- Official Website: Visual Studio Code
Git and GitHub: For version control and collaboration.
- Git Documentation: Git Documentation
- GitHub: GitHub
Anaconda/Conda: For managing Python environments and dependencies, ensuring consistency across different development setups.
- Anaconda Distribution: Anaconda

Cloud Services and GPU Acceleration

NVIDIA CUDA: Leveraged for GPU acceleration, crucial for training and running deep learning models efficiently.
- Documentation: CUDA Toolkit Documentation
Hugging Face Model Hub: Used for hosting and accessing pre-trained models and datasets.
- Website: Hugging Face Model Hub

APIs and Interfaces

Gradio: Employed to create web-based interfaces for model demos, allowing interactive testing and showcasing of functionalities.
- Documentation: Gradio Documentation
Hugging Face Transformers API: For interacting with transformer models, tokenizers, and pipelines.
- Documentation: Transformers API

Data and Datasets

CosmicMan-HQ 1.0 Dataset: A large-scale dataset used in CosmicMan for training high-fidelity human image generation models.
BlabitMF Framework: Utilized for data preprocessing, allowing users to specify key contexts to preserve during data masking.
YouTube Videos: Sourced driver dashcam footage from YouTube, processed into clips for creating the test dataset.

Additional Tools and Libraries

PyYAML: For parsing YAML configuration files, facilitating easy configuration management.
- Documentation: PyYAML Documentation
FFmpeg-static: A static build of FFmpeg used within EchoMimic for handling video processing tasks without the need for a separate FFmpeg installation.
TorchVision and TorchAudio: PyTorch libraries for vision and audio utilities, providing datasets, models, and transforms.
- Documentation:
- TorchVision Documentation
- TorchAudio Documentation
Accelerate (by Hugging Face): Used in PLLaVA for efficient training across multiple GPUs and hardware acceleration.
- Documentation: Accelerate Documentation

Version Control and Collaboration

Git: For tracking changes in the codebase, enabling collaboration among team members.
- Documentation: Git Documentation
GitHub: Hosted the repositories for PLLaVA, CosmicMan, and EchoMimic, facilitating collaboration and version control.
- Repositories:
- PLLaVA GitHub
- CosmicMan GitHub
- EchoMimic GitHub

By integrating these technologies, frameworks, and tools, we developed a system capable of processing multimedia data to preserve essential context while ensuring privacy compliance. The combination of advanced AI models, efficient data processing libraries, and user-friendly interfaces allowed us to create a solution that addresses the challenges of traditional data masking methods.

References to Documentation and Repositories

BabitMF: BMF GitHub
Documentation: Website
- Repository: PLLaVA GitHub
- Documentation: Included within the repository, detailing installation, usage, and technical details.
PLLaVA:
- Repository: PLLaVA GitHub
- Documentation: Included within the repository, detailing installation, usage, and technical details.
CosmicMan:
- Repository: CosmicMan GitHub
- Documentation: Provided in the repository, with instructions on setup, inference, and training.
EchoMimic:
- Repository: EchoMimic GitHub
- Documentation: Comprehensive guidelines available in the repository, covering installation, model usage, and demos.

Built With

cosmicman
cuda
diffusers
diffusion
echomimic
hugging-face-transformers
openai
pllava
python
pytorch
runpod
stable-diffusion
vllm
whisper

Updates

Anna Sims started this project — Sep 30, 2024 11:25 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.