MultiModal Data Mask: Preserving Essential Context in Data Masking through Automated Synthetic Data Generation
Inspiration
In the era of big data and AI, preserving user privacy while maintaining data utility is a significant challenge. Traditional data masking methods often strip away crucial context—such as race, gender, age, and emotions—that is vital for training accurate and fair AI models. This loss of context can lead to models that are biased or perform poorly, ultimately impacting the end-users who rely on these technologies. We wanted to address this issue by developing a solution that preserves essential context while ensuring compliance with Personally Identifiable Information (PII) regulations.
What It Does
Our solution allows researchers and developers to:
- Upload Multimedia Data: Users can upload datasets (e.g., driver dashcam footage) that contain sensitive information.
- Specify Context to Preserve: Users define which vital contexts (like facial expressions, age, gender, etc.) should be maintained in the dataset.
- Generate Synthetic Data: The system processes the input data to produce a new dataset where sensitive identifiers are anonymized, but the specified contexts are preserved.
- Prepare for Model Training: The resulting dataset is compliant with PII standards and ready for use in training machine learning models without sacrificing essential contextual information.
How We Built It
We developed a pipeline that integrates multiple state-of-the-art AI models and frameworks to achieve our goal:
1. Data Collection and Preprocessing
- Curated Dataset: We collected driver dashcam footage from YouTube, trimming them into 15-30 minute clips to create our test dataset.
- BlabitMF Framework: The clips were processed using the BlabitMF framework, allowing users to set key contexts to preserve.
2. Dense Video Captioning with PLLaVA
- Extension to Video Data: We utilized PLLaVA (Pooling LLaVA), which extends image-language pre-training models to video data for dense captioning.
- Pooling Strategy: PLLaVA employs a pooling strategy that smooths feature distributions along the temporal dimension, reducing the dominance of certain tokens and improving performance.
- Context Extraction: It provides detailed descriptions of each scene, capturing attributes like age, facial expressions, gender, and more.
3. Audio Transcription with Gemini
- Speech Recognition: We used Gemini to transcribe audio from the videos, ensuring that auditory context is also preserved.
- Integration: The transcriptions were combined with the visual captions to create a comprehensive understanding of each scene.
4. Lifelike Portrait Generation with CosmicMan
- Text-to-Image Generation: Descriptions from PLLaVA were converted into prompts for CosmicMan, a model specialized in generating high-fidelity human images.
- Annotate Anyone Paradigm: CosmicMan uses a new data production approach called "Annotate Anyone", enabling the creation of high-quality data with accurate annotations.
- Decomposed-Attention-Refocusing: Introduces the Daring training framework to model the relationship between dense text descriptions and image pixels effectively.
- Preserving Context: Generates lifelike portraits that retain the specified vital contexts like race, gender, age, and emotions.
5. Video Animation with EchoMimic
- Audio-Driven Animation: The portraits and audio transcriptions were input into EchoMimic, which creates lifelike animations driven by audio.
- Editable Landmark Conditioning: Utilizes editable landmarks to ensure the animated face aligns with the desired expressions and mouth movements.
- Model Architecture:
- Motion Module: Captures motion dynamics from the audio.
- Denoising UNet & Reference UNet: Enhance image quality and ensure consistency.
- Face Locator: Precisely aligns facial features for realistic animations.
- Anonymization: Generates videos that maintain the same audio and expressions but with different (anonymized) faces.
6. Final Output
- Synthetic Dataset: The synthesized videos are compiled into a dataset that is ready for model fine-tuning.
- Compliance and Utility: The dataset meets PII compliance standards while preserving essential contexts, enabling the training of accurate and unbiased models.
Challenges We Ran Into
- Integration Complexity: Combining multiple advanced AI models with different input and output formats required careful handling and customization.
- Data Privacy vs. Context Preservation: Balancing the anonymization of PII with the preservation of vital context was a significant challenge.
- Resource Intensiveness: Processing high-resolution video data and generating synthetic content is computationally intensive.
Accomplishments That We're Proud Of
- Successful Integration: Manged to integrate PLLaVA, CosmicMan, and EchoMimic into a functional pipeline.
- Context Preservation: Found a solution that maintains essential contextual information without compromising user privacy.
- Scalable Solution: Developed a method that can be extended to other data types and industries.
What We Learned
- Importance of Context: Preserving vital context in data is crucial for training effective AI models.
- Model Compatibility: Gained insights into adapting and integrating different AI models to work together.
- Data Ethics: Deepened our understanding of data privacy regulations and ethical considerations in AI.
What's Next for MultiModal Data Mask
- Expand to Other Domains: Apply the solution to other areas like healthcare, finance, and surveillance where context preservation is critical.
- Performance Optimization: Improve the efficiency of the pipeline to handle larger datasets and reduce processing time.
- User Interface Development: Create a user-friendly interface to make the tool accessible to non-technical users.
- Feedback Incorporation: Gather user feedback to refine features and address additional challenges.
Technical Details
- Programming Language: Python
Frameworks and Libraries:
- PLLaVA: Adapted for video dense captioning using a pooling strategy to extend image-language models to video data.
- CosmicMan: Utilized for generating high-fidelity human images with precise text-image alignment via the Daring training framework.
- EchoMimic: Employed for audio-driven portrait animations using editable landmark conditioning.
- Gemini: Used for accurate audio transcription.
- BlabitMF Framework: Facilitated data preprocessing and user input for context preservation.
Inspiration from GitHub Projects
- PLLaVA: Learned from the implementation of parameter-free extensions from images to videos, achieving state-of-the-art performance in video captioning tasks.
- CosmicMan: Adopted techniques from their new data production paradigm and attention refocusing methods to generate high-fidelity images.
- EchoMimic: Integrated concepts of lifelike audio-driven animations and editable landmark conditioning to produce realistic video outputs.
Technologies Used
In our project, we utilized a range of programming languages, frameworks, platforms, and tools to build an integrated solution that preserves essential context in data masking through automated synthetic data generation. Below is a detailed list of the technologies used, incorporating technical details from the GitHub repositories of the projects we integrated.
Programming Languages
- Python: The primary programming language used for development due to its extensive libraries and support for machine learning, data processing, and AI frameworks.
Frameworks and Libraries
Deep Learning Frameworks
- PyTorch: Used as the main deep learning framework for implementing neural networks and machine learning models.
- Documentation: PyTorch Official Documentation
- Transformers (by Hugging Face): Utilized for working with pre-trained transformer models, handling tokenization, and model architectures.
- Documentation: Transformers Documentation
Computer Vision and Image Processing
- OpenCV: Used for image and video processing tasks, such as reading video frames and manipulating images.
- Documentation: OpenCV Documentation
- NumPy: For numerical computations and handling arrays and matrices, essential for image and audio data manipulation.
- Documentation: NumPy Documentation
Audio Processing
- Whisper (by OpenAI): Used for accurate audio transcription and processing within the EchoMimic project.
- GitHub Repository: Whisper GitHub
- Documentation: Whisper Documentation
Data Handling and Utilities
- Pandas: For data manipulation and analysis, particularly useful in handling datasets and preprocessing.
- Documentation: Pandas Documentation
- Matplotlib: For plotting and visualizations during development and debugging.
- Documentation: Matplotlib Documentation
Integrated Projects and Their Technologies
1. PLLaVA (Pooling LLaVA)
- Purpose: Extended image-language pre-training models to video data for dense captioning.
- GitHub Repository: PLLaVA GitHub
Technologies Used in PLLaVA:
- PyTorch: For model implementation and training.
- Hugging Face Transformers: For handling transformer-based models.
- Accelerate (by Hugging Face): For distributed training and hardware acceleration.
- Documentation: Accelerate Documentation
- CUDA: For GPU acceleration during model training and inference.
- Documentation: CUDA Toolkit Documentation
Additional Details from Documentation:
- Requirements:
- Python 3.10
- Torch 2.2.1+cu118 (or appropriate CUDA version)
- Setup Instructions:
- Uses
requirements.txtfor managing dependencies. - Provides scripts for model preparation and evaluation.
- Uses
2. CosmicMan
- Purpose: Specialized in generating high-fidelity human images with meticulous appearance and precise text-image alignment.
- GitHub Repository: CosmicMan GitHub
Technologies Used in CosmicMan:
- Stable Diffusion: Used for high-quality image generation.
- GitHub Repository: Stable Diffusion GitHub
- Diffusers (by Hugging Face): A library for diffusion models, facilitating the use of Stable Diffusion.
- Documentation: Diffusers Documentation
- Gradio: For creating user-friendly web interfaces and demos.
- Documentation: Gradio Documentation
- PyTorch: Core framework for model development.
- CUDA: For GPU acceleration.
Additional Details from Documentation:
- Requirements:
- Python 3.10
- Torch with appropriate CUDA version
- Setup Instructions:
- Install dependencies via
requirements.txt. - Provides inference scripts and training code.
- Install dependencies via
- Models and Checkpoints:
- Uses pre-trained models like CosmicMan-SDXL and CosmicMan-SD available on Hugging Face.
3. EchoMimic
- Purpose: Lifelike audio-driven portrait animations through editable landmark conditioning.
- GitHub Repository: EchoMimic GitHub
Technologies Used in EchoMimic:
- PyTorch: For implementing neural networks and training models.
- Diffusers: For working with diffusion models, enhancing image generation quality.
- Gradio: To build interactive demos and user interfaces.
- Whisper (by OpenAI): For audio transcription and processing.
- FFmpeg: For video and audio processing tasks.
- Official Website: FFmpeg
- CUDA: For accelerating computations on GPUs.
Additional Details from Documentation:
- Requirements:
- Python 3.8 or higher
- Torch with appropriate CUDA version
- Setup Instructions:
- Install dependencies via
requirements.txt. - Download FFmpeg-static and set
FFMPEG_PATH. - Provides detailed instructions for running inference and demos.
- Install dependencies via
- Models and Checkpoints:
- Pre-trained models are available and need to be downloaded.
- Utilizes
denoising_unet.pth,reference_unet.pth, among others.
Development Tools and Platforms
- Visual Studio Code (VS Code): Used as the integrated development environment (IDE) for coding, debugging, and managing the project.
- Official Website: Visual Studio Code
- Git and GitHub: For version control and collaboration.
- Git Documentation: Git Documentation
- GitHub: GitHub
- Anaconda/Conda: For managing Python environments and dependencies, ensuring consistency across different development setups.
- Anaconda Distribution: Anaconda
Cloud Services and GPU Acceleration
- NVIDIA CUDA: Leveraged for GPU acceleration, crucial for training and running deep learning models efficiently.
- Documentation: CUDA Toolkit Documentation
- Hugging Face Model Hub: Used for hosting and accessing pre-trained models and datasets.
- Website: Hugging Face Model Hub
APIs and Interfaces
- Gradio: Employed to create web-based interfaces for model demos, allowing interactive testing and showcasing of functionalities.
- Documentation: Gradio Documentation
- Hugging Face Transformers API: For interacting with transformer models, tokenizers, and pipelines.
- Documentation: Transformers API
Data and Datasets
- CosmicMan-HQ 1.0 Dataset: A large-scale dataset used in CosmicMan for training high-fidelity human image generation models.
- BlabitMF Framework: Utilized for data preprocessing, allowing users to specify key contexts to preserve during data masking.
- YouTube Videos: Sourced driver dashcam footage from YouTube, processed into clips for creating the test dataset.
Additional Tools and Libraries
- PyYAML: For parsing YAML configuration files, facilitating easy configuration management.
- Documentation: PyYAML Documentation
- FFmpeg-static: A static build of FFmpeg used within EchoMimic for handling video processing tasks without the need for a separate FFmpeg installation.
- TorchVision and TorchAudio: PyTorch libraries for vision and audio utilities, providing datasets, models, and transforms.
- Documentation:
- TorchVision Documentation
- TorchAudio Documentation
- Accelerate (by Hugging Face): Used in PLLaVA for efficient training across multiple GPUs and hardware acceleration.
- Documentation: Accelerate Documentation
Version Control and Collaboration
- Git: For tracking changes in the codebase, enabling collaboration among team members.
- Documentation: Git Documentation
- GitHub: Hosted the repositories for PLLaVA, CosmicMan, and EchoMimic, facilitating collaboration and version control.
- Repositories:
- PLLaVA GitHub
- CosmicMan GitHub
- EchoMimic GitHub
By integrating these technologies, frameworks, and tools, we developed a system capable of processing multimedia data to preserve essential context while ensuring privacy compliance. The combination of advanced AI models, efficient data processing libraries, and user-friendly interfaces allowed us to create a solution that addresses the challenges of traditional data masking methods.
References to Documentation and Repositories
- BabitMF: BMF GitHub
- Documentation: Website
- Repository: PLLaVA GitHub
- Documentation: Included within the repository, detailing installation, usage, and technical details.
- PLLaVA:
- Repository: PLLaVA GitHub
- Documentation: Included within the repository, detailing installation, usage, and technical details.
- CosmicMan:
- Repository: CosmicMan GitHub
- Documentation: Provided in the repository, with instructions on setup, inference, and training.
- EchoMimic:
- Repository: EchoMimic GitHub
- Documentation: Comprehensive guidelines available in the repository, covering installation, model usage, and demos.

Log in or sign up for Devpost to join the conversation.