Teja’s Blog

Flow Models

2025-08-11T16:40:16+00:00

Flow-Based Generative Models:

Imagine a cloud of particles drifting through space. At first, this cloud looks like random Gaussian noise. As time flows, the cloud reshapes itself, stretching and bending, until finally it resembles the distribution of real-world data we want to generate.

How can we describe this transformation mathematically? The answer begins with the continuity equation.

The Continuity Equation: Conserving Probability

Probability mass behaves like an incompressible fluid — it cannot just vanish or be created from nothing. The way it moves is governed by the continuity equation:

\[\frac{\partial}{\partial t} p_t(x) + \nabla_x \cdot \big( p_t(x) \, u_t(x) \big) = 0.\]

Here:

$p_t(x)$ is the probability density at time $t$,
$u_t(x)$ is the velocity field describing how points at $x$ move,
$\nabla_x \cdot$ is the divergence operator, capturing how mass flows in or out of a region.

👉 Intuitively, this says: if probability density decreases at some point $x$, it must be because the density is flowing away, carried along by the velocity field $u_t(x)$.

This law of conservation is the backbone of all flow-based models.

The Probability Path: Watching Distributions Evolve

Now let’s step back. We want to morph a simple distribution (like Gaussian noise) into a complicated data distribution (like natural images). To describe this morphing, we introduce the idea of a probability path:

\[\{ p_t(x) \}_{t \in [0,1]}, \quad p_0(x) = p_\text{source}(x), \; p_1(x) = p_\text{target}(x).\]

This path is just a smooth sequence of probability densities indexed by time $t$.

At $t=0$, the distribution is pure noise.
At $t=1$, it has become the target data distribution.
For values of $t$ in between, we see the intermediate “shapes” of the distribution.

Think of it as a movie of the distribution continuously warping from one form into another.

The Velocity Field: Moving Individual Particles

While the probability path describes how the whole distribution changes, we also want to describe how individual particles move inside this evolving cloud. That’s where the velocity field $u_t(x)$ comes in.

For a single particle trajectory $X_t$:

\[\frac{d}{dt} X_t = u_t(X_t).\]

This equation tells us that if a particle is at position $X_t$ at time $t$, its next step is determined by the velocity field at that point.

👉 So, the velocity field is the “microscopic rule” that, when applied to all particles, produces the macroscopic evolution of the probability distribution.

The Flow: A Global Transformation

If we follow each particle’s trajectory over time, we get a flow map:

\[\Phi_t : x_0 \mapsto x_t,\]

where $\Phi_t$ is the solution of the ODE:

\[\frac{d}{dt} x_t = u_t(x_t), \quad x_0 \sim p_0.\]

This map $\Phi_t$ tells us how to transport an initial sample $x_0$ (from the source distribution) into a new point $x_t$ at time $t$.

The flow of particles created by this map gives rise to
The flow of probability densities $p_t(x)$ described earlier.

The continuity equation ensures the two views (microscopic particles and macroscopic densities) stay perfectly in sync.

Putting It All Together

Here’s the full picture:

The continuity equation guarantees probability mass is conserved: $\frac{\partial}{\partial t} p_t(x) = -\nabla_x \cdot \big(p_t(x) u_t(x)\big).$
The velocity field $u_t(x)$ determines how individual samples move: $\frac{d}{dt} X_t = u_t(X_t).$
Following these trajectories defines the flow $\Phi_t$ that maps the source distribution into the target: $\{X_t\}_{t \in [0,1]} \;\;\Rightarrow\;\; \{p_t(x)\}_{t \in [0,1]}.$

In simple words:

The probability path shows how distributions change in time.
The velocity field shows how points move in time.
The flow is the overall transformation carrying one distribution into another.
And the continuity equation is the glue that ensures consistency between them.

With this story, we’ve built the mathematical foundation for flow-based generative models: they are nothing more than learning the right velocity field $u_t(x)$ so that the flow transforms Gaussian noise into real data.

Code Part

Pages from My IITM Chapter

2025-06-22T16:40:16+00:00

This is my journey at IIT Madras.

Sri Gurubhyo Namaha

In 2023, when it was time to choose a professor for my MS at IIT Madras, I made the decision to work under Prof. Kaushik Mitra. It wasn’t just his deep expertise in the field of computer vision that drew me to him—it was also his spiritual depth and character. I believed that with him, I could grow not just academically, but also personally.

In today’s world, with the power of the internet, anyone can pick up technical knowledge. But becoming a better person—one with strong character and values—takes more than just online resources. You need to be around people who inspire you, and for me, Prof. Mitra was that person. I saw in him someone I could learn from on multiple levels.

The lab he leads has a strong, positive ecosystem, and I believe that’s a direct reflection of who he is as a person and a mentor. I feel genuinely fortunate to have had the opportunity to work under him, learn from him, and contribute to the lab. Whatever perks or opportunities I’ve gained after completing my MS, I owe a great deal to his guidance and support.

📸 Sarvam Shree Jagannath.

The initial days were tough—managing both education and a job at the same time was one of the biggest challenges I’ve ever faced. I came to IIT with a dream to build something big, and I knew it would take everything I had. Social life took a backseat; hanging out with friends or maintaining a circle just didn’t seem possible at the time.

But along the way, I was fortunate to meet people who became true friends. They influenced me in the best ways and reminded me that friendships are just as important as the degree you earn here. My undergraduate journey had its share of struggles, but IIT gave me a second chance to turn things around—both academically and personally. From education to friendships, this place helped fix what I thought was broken.

The HINDI group

Being a South Indian and an introvert, I never really felt the need to hang out beyond my comfort zone—especially with people from completely different backgrounds. But life had other plans, and that’s how I met Hrithik—my first friend at IIT.

He showed up during both the tough times and the fun ones, constantly reminding me that we’re not just here to study, but also to live a little and enjoy the journey. And honestly, the time we spent together was just lit. Late-night dinners, long walks, random deep talks about life and lab issues—those moments made this place feel like home.

Despite being… well, let’s just say a little stupid, Hrithik is one of the most honest and genuinely good people I’ve met. There’s a fine line between being crazy and being normal—and I’m pretty sure he’s got a foot on either side of that line. But that’s exactly what made it all so fun.

Looking back, I realize that these are the memories that really stick. You’re only in your 20s once, and this kind of chaos—you don’t get to relive it later. I’m just glad I had someone like Hrithik to share it with.

📸 You never live in your 20's again.

I Here by decarle this man(Vinayak) as a part of my Family

I honestly don’t care what anyone thinks—but if there’s one person who deserves credit for pushing me to finish my MS and making things fall into place, it’s this guy. It all started with a random call where he said, “Let’s cook something in 3D computer vision.” I was hesitant at first. But because of our shared bond over Krishna Bhakti, I said, “Yes, let’s cook.” That decision turned out to be one of the best I’ve ever made.

Since then, he’s become more than a collaborator—he’s like family. He works insanely hard, doesn’t have a big circle (just like me), and we’ve both walked similar paths—facing the same pain, sharing the same ambitions. We’ve worked on multiple projects together and published two papers. But the real win? His family has become like my own, and that connection has added a whole new layer of joy to this journey.

With mutual trust between our families, we even got to travel together to Milan, Italy, to present our work at ECCV. On top of that, we took two unforgettable spiritual trips—one to the North and one to the South—to seek blessings from Lord Vishnu and Lord Shiva. Those moments were nothing short of magical.

What’s even more inspiring is that despite being younger than me, he’s taught me so much—about responsibility, making the right choices in life, and staying grounded in both education and spirituality.

📸 Now vinayak's family is my family too.

Its not Computer Vision but its Computational Imaging Lab

I’ve always loved images—how they’re captured, the kind of cameras used, how to make them look beautiful, and how to fix the blur or other degradations. That’s pretty much what I did at IIT. Nothing too fancy on the outside—it sounds simple. But once you dive into the actual coding and research, you realize how tough it really is.

Whenever I hit roadblocks, I’d look around at my lab and the people in it. The support system here is something special. The team is humble, and that humility starts from the top—with Prof. Kaushik Mitra. His leadership creates an atmosphere where we know we’re in it together, and no matter how difficult things get, we’ll find a way through.

Research, especially in AI these days, isn’t easy. The tough times are many, and the good times come rarely—but they do come. And when they do, they feel earned. What really helped me push my work forward and get my papers accepted at conferences was the constant support from my labmates. I honestly couldn’t have done it without them.

📸 Labmates and willkommen celebrations.

I know this is not the end, but we need to leave the heaven(IITM) now.

My journey at IIT Madras began in a lab at HTIC, where Aswath and I used to constantly wonder about the future—Will we make it? Will we graduate together? Those questions were always floating around as we worked side by side.

Vinayak played a key role too—always pushing me, encouraging me to submit my MS thesis on time so that we could all graduate together. And in the end, we did it. We made it.

This day—the day we graduated—is easily one of the most memorable moments of my life. What made it truly special was being able to bring my parents along. Seeing them proud, smiling, and standing beside me on that day—that feeling is something I’ll carry with me forever.

Here’s a small gallery of those graduation memories, moments that remind me not just of academic achievement, but of friendships, teamwork, and the journey that brought us here.

📸 We started from the bottom and we are here together at the end as well.

Hare Krishna

Auto Deep Learning

2025-01-22T16:40:16+00:00

Basic Training Process

Suppose the goal is to understand the types of cars present in your city by conducting a survey that gathers information such as the car’s brand and certain characteristics (e.g., color, type, size). This can be solved by building two types of deep learning models:

Detection model: To detect the presence and location of cars in the images/videos.
Classification model: To classify the brand and other characteristics of each detected car.

My work has primarily involved video datasets, with a focus on building detection and classification models. However, this general deep learning development pipeline can be applied across various systems beyond just videos.

The typical process involves the following stages:

1. Data Collection

Gather raw video footage from different sources across the city.
Since the raw video contains a mixture of useful and irrelevant footage, it is essential to carefully select relevant frames.
This often requires manual effort: cherry-picking frames where cars are clearly visible and ignoring noisy or irrelevant data.

2. Annotation

After selecting frames, the next step is manual annotation:
- Drawing bounding boxes around cars (for detection).
- Labeling brand names and characteristics (for classification).
This step is time-consuming and crucial because model performance heavily depends on the quality of the annotations.

3. Model Training

Train deep learning models on the annotated dataset.
The training process usually involves:
- Tuning hyperparameters,
- Performing multiple iterations of training and evaluation,
- Conducting stress testing under various conditions (e.g., different lighting, angles, occlusions) to ensure robustness.

Data Cleaning and Annotation in Self-Driving Car Datasets

In the context of self-driving cars, not all data collected is relevant for training deep learning models. It is essential to clean the data and discard irrelevant or redundant information. Below, we outline some common scenarios and techniques for data cleaning and annotation.

1. Data Cleaning

In a self-driving car dataset, certain situations may involve irrelevant data that should be discarded. Some examples include:

Example Scenarios

Standing at Traffic Signals: If the car is stationary at a traffic light for an extended period, the frames captured during this time may not contain significant changes, rendering them useless for model training.
Running on Empty Streets: Similarly, when the car is running on an empty street, there may be little to no interaction with other vehicles or pedestrians. These frames could provide minimal information and should be discarded.

Techniques for Data Cleaning

Several computer vision techniques can assist in cleaning the dataset by identifying and removing redundant or irrelevant frames:

Optical Flow: Optical flow can be used to track the movement between consecutive frames. If there is little to no movement between frames, it indicates minimal changes, which could be discarded. This helps in capturing only the “important” frames that contain meaningful visual changes.
Embedding Space Analysis:
- Image Embeddings: Models like ResNet or CLIP can be used to create embeddings for each frame. By comparing the embeddings of consecutive frames, we can identify near-duplicate frames. If two frames are too similar in the embedding space, one of them can be discarded.
- Text Embeddings via VQA Models: Visual Question Answering (VQA) models can be used to extract textual information from images. By using VQA models and performing a semantic comparison of text embeddings, we can identify frames with repetitive or irrelevant content based on the questions and answers extracted from the frames.

2. Annotation

Annotation is a critical step in training deep learning models for computer vision tasks, especially when working with image or video datasets. Accurate and consistent annotations are essential for model performance.

Techniques for Annotation

To automate or assist in the annotation process, we can use several advanced models and tools:

Grounding Models for Annotation:
- GDINO and GSAM are grounding models that can automatically annotate images by associating objects or regions of interest within the image with textual descriptions or labels.
VQA Models for Confirmation:
- Visual Question Answering (VQA) models can be leveraged to confirm or enhance annotations. By asking the model specific questions related to the content of an image (e.g., “What objects are present?” or “Is there a car in the image?”), we can confirm or refine the accuracy of manual annotations.

1. Data Pruning

In the realm of self-driving cars, training deep learning models on massive datasets of images and videos can be highly resource-intensive. Not only does it require significant computational power, but it also consumes large amounts of time and money. This raises an important question: Can we reduce the number of training images without compromising model performance? The answer lies in data pruning.

What is Data Pruning?

Data pruning involves selecting a subset of the dataset that is both representative and diverse while discarding irrelevant or redundant images. The goal is to retain a dataset that captures all the key variations and nuances of the real-world environment without the unnecessary bulk. This helps optimize the training process by reducing the computational cost, speeding up training, and, in some cases, preventing overfitting.

How Data Pruning Works

The most common approach to pruning data is by utilizing embedding space techniques. Here’s how this can be applied in practice:

Embedding Generation: By generating embeddings of images using models such as VQA (Visual Question Answering) or traditional core vision models like ResNet or CLIP, each image or frame is mapped to a high-dimensional space. These embeddings effectively capture the essential features of each image, enabling a comparison of how similar or diverse the images are with respect to each other.
Cluster Sampling: Once the embeddings are generated, the next step is to perform clustering on these embeddings. Clustering algorithms (e.g., k-means, DBSCAN) group similar images together. By analyzing these clusters, we can select samples from only the most representative clusters, ensuring diversity while minimizing redundancy. This is crucial because self-driving cars encounter a wide variety of scenarios, from driving in urban streets, navigating intersections, to detecting pedestrians or cyclists. The diversity of these scenarios must still be captured, but redundant frames from similar situations (such as multiple frames from empty streets) can be pruned.

Benefits of Data Pruning in Self-Driving Car Models

Data pruning in the context of self-driving car models can provide several key advantages:

Reduced Training Time: By selecting only the most relevant and diverse images from the dataset, we drastically reduce the size of the training data, which in turn speeds up the training process.
Cost Efficiency: With reduced data requirements, the computational cost associated with training decreases, which is essential for maintaining cost-effective operations.
Maintained Accuracy and Diversity: Pruning does not mean simply removing images at random; it ensures that the data selected maintains high representational accuracy. As a result, the model continues to perform well in recognizing and predicting scenarios it may encounter in real-world driving environments (e.g., recognizing cars at various angles, handling occlusions, understanding traffic signs, etc.).
Avoiding Overfitting: Using a large but redundant dataset can lead to overfitting, where the model memorizes the training data rather than generalizing to new, unseen data. Data pruning helps combat this by ensuring that the model is trained on a diverse yet compact set of images that are not too repetitive, allowing the model to generalize better.

By adopting data pruning techniques, we can create an efficient, scalable, and cost-effective training pipeline for self-driving car models, ensuring the model is both accurate and practical for deployment.

2. Training the Model

Training a deep learning model for self-driving car systems follows the traditional deep learning pipeline, but with additional considerations unique to the challenges in autonomous driving. Below is an overview of the typical steps involved:

1. Data Augmentation

Data augmentation is a crucial step in improving the model’s ability to generalize. This involves artificially increasing the size of the training dataset by applying transformations to the existing data. For self-driving car models, the following augmentations are commonly used:

Motion Blur: Simulating the effect of blurred images due to fast-moving objects or shaky cameras. This helps the model learn to handle real-world conditions where blur might occur due to vehicle movement or external factors.
Night-Time Images: Adding variations to simulate driving in low-light conditions, such as night-time or poorly lit environments. This helps the model perform well under different lighting conditions.
Pixelation: Simulating lower-resolution images or pixelated frames that may be captured in real-world scenarios (e.g., due to network latency or foggy conditions). This ensures the model can still detect important objects even in degraded image quality.
Weather Conditions: Augmenting the data with images affected by different weather conditions such as rain, fog, snow, or sunlight glare. This helps the model learn how to detect objects and navigate the environment under challenging weather conditions, which is a common real-world scenario for self-driving cars.

2. Model Hyperparameter Tuning

Once the model architecture is chosen, the next step is to tune the hyperparameters. Hyperparameter tuning is a critical process in achieving optimal performance. Common hyperparameters include:

Learning Rate: Controls the step size during optimization. A learning rate that is too high may lead to unstable training, while a rate that is too low can slow down the learning process.
Batch Size: Determines how many training examples are used in each forward/backward pass. Larger batches may speed up training but require more memory, while smaller batches may lead to better generalization.
Epochs: The number of times the entire training dataset is passed through the model. More epochs may increase accuracy but could also risk overfitting.
Regularization Parameters: Techniques such as Dropout or L2 regularization help prevent overfitting, which is especially important in self-driving car models where generalization to various environments and conditions is critical.

Hyperparameter tuning often involves techniques like grid search, random search, or more advanced methods like Bayesian optimization to find the most optimal set of hyperparameters for the given task.

3. Model Testing

After training, the model is evaluated using testing and validation datasets. This step assesses the model’s performance on unseen data and ensures that it generalizes well to real-world scenarios. Testing for self-driving car models may involve:

Accuracy: Evaluating how well the model detects and classifies objects (e.g., other vehicles, pedestrians, traffic signs) in various conditions.
Performance under Edge Cases: Testing the model under challenging edge cases, such as low visibility (fog, heavy rain), high-speed driving, or complex urban environments.
Real-World Simulations: Using simulation tools (e.g., CARLA, NVIDIA DriveSim) to test the model’s decision-making abilities in virtual environments that simulate various real-world driving scenarios.

4. Pushing to Server for Production

Once the model performs well on the test set, the final step is to push it to the production server for deployment. This involves:

Deployment: The model is integrated into the self-driving car’s software stack, where it is tested in real-world environments. Continuous monitoring of the model’s performance is essential to ensure safety and accuracy. Any issues identified may lead to further iterations of training and testing.
Model Versioning: It’s important to keep track of model versions and updates to ensure that any improvements or bug fixes are properly deployed, and the vehicle is running the most up-to-date and safe model.

By following this process, self-driving car systems can be trained, tested, and deployed effectively, enabling them to handle a wide range of driving scenarios and improve over time.

3. Model Pruning

In real-world applications, especially in resource-constrained environments such as edge devices or embedded systems (e.g., self-driving cars, mobile devices), model size and computation power are significant constraints. To address this, model pruning is employed to reduce the complexity of the model without sacrificing much of its accuracy.

The goal is to shrink the model’s computational footprint, allowing it to run faster and on lower-power devices while maintaining acceptable performance.

Techniques for Model Pruning

Here are several common techniques used for model pruning and optimization:

1. Model Quantization

Model quantization involves reducing the precision of the numbers used in the model (typically from 32-bit floating-point to 16-bit or even 8-bit integers). This technique has the following benefits:

Reduced Model Size: Quantizing the weights and activations decreases the model size significantly, allowing it to fit into memory-limited devices.
Faster Inference: Lower precision operations are computationally faster, leading to a speed-up in inference time, which is critical in real-time systems like self-driving cars.
Power Efficiency: Reduced computation load also leads to lower power consumption, which is crucial for devices that are battery-powered or have limited computational resources.

2. Model Pruning and Sparsity

Model pruning is the process of removing certain weights from the network that are deemed unnecessary for the model’s performance. This reduces the number of parameters in the model and leads to a more sparse network. Pruning can be done in various ways:

Weight Pruning: This involves removing weights that have little effect on the output (often weights with small values are pruned).
Neuron Pruning: Involves removing entire neurons or layers from the network that contribute little to the model’s predictions. For example, neurons that are inactive during training or have low activations can be pruned.
Structured Pruning: Rather than pruning individual weights or neurons, structured pruning removes entire filters, channels, or blocks. This can lead to significant reductions in computation and memory usage.

Benefits:

Model Size Reduction: Pruning significantly reduces the model’s size, making it more deployable on devices with limited storage.
Speedup: The reduced number of parameters means fewer computations, leading to faster inference times.
Memory Efficiency: Sparse models can be stored more efficiently, using specialized data structures like sparse matrices to save memory.

3. Neural Network Architecture Search (NAS)

Neural Architecture Search (NAS) is an automated technique used to find the most efficient model architecture for a given task. NAS searches for optimal architectures by exploring different configurations of layers, neurons, and connections, often within predefined constraints like model size or computation.

Benefits:

Optimized Model Architecture: NAS helps discover more efficient architectures that require fewer resources while maintaining or improving performance.
Automated Search: It automates the design of models, making it easier to find architectures that perform well under computational constraints.

4. Model Distillation

Model distillation involves training a smaller, more efficient model (called the “student”) to mimic the behavior of a larger, more complex model (called the “teacher”). The smaller model learns from the outputs of the larger model, effectively inheriting the knowledge of the teacher while being much lighter.

Benefits:

Smaller Models: The student model is typically much smaller than the teacher model, making it suitable for deployment on resource-constrained devices.
Retaining Accuracy: Distillation can often lead to a smaller model that retains much of the accuracy of the larger model, making it a highly effective technique for model compression.
Faster Inference: The distilled model is designed to be efficient, thus speeding up inference, which is especially useful in real-time systems like autonomous driving.

Combining Techniques

These pruning techniques can be combined to achieve further optimization:

Pruning + Quantization: After pruning, quantization can further reduce the size and improve the performance of the model.
Distillation + Pruning: A smaller distilled model can be further pruned for even more efficiency.

Conclusion

For systems like self-driving cars, where real-time inference is critical, model pruning and optimization techniques are necessary to ensure the model runs efficiently on embedded systems. By reducing the model size, improving speed, and lowering power consumption, pruning helps strike the right balance between accuracy and computational efficiency.

4. Model Dependability

Model dependability refers to the ability to trust a model’s predictions and understand when the model might fail or give unreliable results. Ensuring dependability is critical, especially in high-stakes applications like self-driving cars where safety and accuracy are paramount.

Methods to Assess Model Dependability

Grad-CAM (Gradient-weighted Class Activation Mapping)

Grad-CAM is a popular technique that helps understand the areas in an image that a model is focusing on when making predictions. It produces a heatmap that highlights which regions of an image contribute most to the model’s output. This can help in the following ways:
- Understanding Model Attention: By visualizing which parts of an image the model is attending to, we can gain insights into its decision-making process.
- Debugging: If the model is focusing on irrelevant or unintended parts of an image (e.g., a shadow instead of the traffic sign), we can identify potential sources of error or bias.
- Validating Model Trustworthiness: Grad-CAM helps confirm that the model is relying on meaningful features (e.g., car headlights, road signs) rather than noise or unrelated elements.
Grad-CAM can be particularly useful in self-driving cars where understanding what the model sees and reacts to can be the difference between safe and unsafe driving decisions.
Out-of-Distribution (OOD) Testing and Morphing Images

OOD testing is an essential aspect of assessing how well a model handles extreme cases or unseen scenarios that are far from the data distribution on which it was trained. Morphing images or creating synthetic data pushes the model to deal with cases that it might not have encountered during training. The following are common areas of focus during OOD testing:
- Shape: Altering the shape of objects in the image (e.g., skewing or distorting cars, pedestrians, or traffic signs) to test if the model can generalize to new forms.
- 3D Pose: Modifying the 3D pose of objects (e.g., rotating cars or pedestrians) to ensure the model can correctly interpret objects in various orientations.
- Texture: Changing the texture of objects (e.g., altering the texture of road surfaces or vehicles) to check if the model can still recognize objects under different visual conditions.
- Context: Testing how well the model performs when the context around the object changes (e.g., a car appearing in an unusual background like a foggy environment or a snowy landscape).
- Weather: Creating scenarios with different weather conditions (e.g., rain, snow, fog) to determine how environmental factors affect the model’s performance. For instance, self-driving cars need to understand traffic signals even in poor visibility conditions.
- Occlusion: Simulating occlusion (e.g., parts of an object being hidden by another object, like a pedestrian behind a car or a traffic sign partially blocked by a tree) to see how the model handles partial information and still makes correct predictions.
OOD testing allows us to better understand where and how the model might fail, ensuring robustness and reliability in real-world scenarios, especially when the system encounters unexpected or extreme situations.

Why is Model Dependability Important?

For critical systems like autonomous driving, medical imaging, or aviation systems, it is essential that the model is dependable and can be relied upon in all situations. If the model cannot perform well under unusual conditions (e.g., rain, snow, night time), it could lead to catastrophic consequences.

By using techniques like Grad-CAM and OOD testing, we can improve model dependability, ensure that the model behaves as expected in diverse real-world scenarios, and minimize the risk of failures.

Building Better Deep Learning Models

2025-01-22T16:40:16+00:00

Building Deep Learning Models

In my previous blog, I discuss the AutoDL pipeline, which you can explore here: AutoDL Blog. The final module of this pipeline focuses on model dependability—helping us understand when a model performs well and when it fails. One key insight is that no matter how much you manipulate or augment poor-quality data, it will not improve the model’s output. Therefore, it’s crucial to break free from this cycle and identify where the issue lies if the model’s accuracy plateaus.

Deep learning models are inherently probabilistic and heavily dependent on the data they are trained on. This makes them particularly vulnerable to out-of-distribution data, posing a significant challenge both for the model and for human interpretation.

To overcome these challenges, improving the priors and representations within the model is essential. For instance, in self-driving cars, while RGB cameras may struggle in low-light conditions, thermal cameras can provide valuable insights by detecting heat signatures.

Advancements in model architectures and algorithms can help push the boundaries of performance, enabling more robust handling of edge cases and out-of-distribution scenarios. Incorporating multimodal inputs, such as text, speech, or novel sensors (e.g., thermal cameras), can further enhance model robustness by providing better priors and improving the model’s ability to handle complex, real-world applications.

Overview of Key Technologies

MultiModal Stack

In this approach, new modalities are integrated alongside existing ones, such as text, speech, or advanced cameras, to enhance model performance. The goal is to leverage complementary information from different modalities—information that may not be present in the current modality but exists in another. The model then selects the most relevant information from both sources to make a more informed decision.

Model Explainability

On the left, the person is not visible in the low-light RGB image, but is clearly seen in the thermal image. On the right, the SPAD camera captures high-resolution output without read noise due to its hardware, offering enhanced visibility even in low-light conditions, such as at night.

Computer Vision Stack

Lensless Imaging
Lensless imaging leverages computational methods to reconstruct images without traditional optical lenses. It captures light patterns using a sensor array and processes the data using algorithms to generate high-quality 3D images.
Thermal Cameras
Thermal cameras detect infrared radiation emitted by objects, converting it into visible images. These cameras are especially useful in low-light conditions and for detecting temperature anomalies, commonly used in medical imaging, surveillance, and night vision.
SPAD Cameras (Single-Photon Avalanche Diode)
SPAD cameras are highly sensitive sensors that detect single photons, enabling ultra-low light imaging. They are used in applications such as time-of-flight (ToF) imaging, LiDAR systems, and quantum optics, providing high-resolution depth information.
Depth Cameras (LiDAR)
LiDAR uses laser pulses to measure distances, creating precise 3D maps of environments. It is a key technology in autonomous vehicles, robotics, and any application requiring accurate depth sensing, providing detailed environmental awareness.

Algorithm Stack

Generative AI

GANs (Generative Adversarial Networks)
GANs consist of two networks—a generator and a discriminator—that work in opposition. The generator creates data, while the discriminator evaluates it. This adversarial process improves the quality of generated content over time, commonly used for image synthesis and enhancement tasks.
Diffusion Models
Diffusion models generate data by progressively denoising random noise, reversing a process of gradual degradation. They are known for their ability to create high-quality, diverse images and are applied to tasks like image synthesis and inpainting.
Flow-Based Models
Flow-based models transform a simple distribution (e.g., Gaussian noise) into a complex distribution by learning invertible transformations. They are particularly suited for tasks requiring exact likelihood computation, such as density estimation and image generation.
NeRFs (Neural Radiance Fields)
NeRFs model 3D scenes by representing the interaction of light within the scene, enabling the generation of highly realistic 3D views from 2D images. They are commonly used in virtual reality, 3D rendering, and photorealistic scene generation.
Gaussian Splatting
Gaussian splatting involves representing 3D scenes or objects using points with associated Gaussian distributions. This technique provides an efficient and accurate way to synthesize 3D objects, enhancing volumetric rendering quality.

Image Restoration/Cleaning

U-Net Architectures
U-Net-based architectures are widely used for image restoration and segmentation. They excel in capturing fine spatial details and contextual information. Notable U-Net variants include:
- Restormer: Optimized for image denoising and deblurring tasks, providing enhanced restoration quality.
- UFormer: Combines U-Net with transformers to improve feature extraction and restoration accuracy, especially for high-quality image reconstruction.
- AutoDir: An autoencoder-based U-Net variant designed to perform image restoration in an unsupervised manner.

Slides

Deep learning Stack

2025-01-22T16:40:16+00:00

Deep Learning Stack of Blogs

I am documenting a series of blogs based on what I have learned throughout my journey. These blogs capture the experiments, insights, and concepts I explored. Since my primary focus is computer vision rather than large language models (LLMs), the content leans more towards vision-centric topics.

Blog Topic	Link
Auto Deep Learning: Automating training and learning from model failures	AutoDL
Building Better Deep Learning Models: Recovering and improving models after failures	Building Better DL Model

Notes

The following are my personal notes and learning resources. These documents contain my study notes, insights, and references on various topics:

ECCV 2024

2024-08-22T16:40:16+00:00

🌍 ECCV 2024 - My Experience in Milan

The European Conference on Computer Vision (ECCV) is one of the most prestigious events in the field of computer vision, held annually in Europe. This year, it was hosted in the charming city of Milan, Italy 🇮🇹, and I had the incredible opportunity to be part of it!

In this blog, I’ll be sharing my experiences at ECCV 2024, and of course, showcasing beautiful moments captured in Italy. 📸

🎉 ECCV Conference Insights

🖥️ Virtual Try-On Workshop

📄 Published Paper on DIVA
At ECCV 2024, I had the honor of presenting my paper, DIVA: Deep Indic Virtual Apparel Try-On, a project that holds deep personal and professional significance. As the world of virtual try-ons continues to evolve, the focus of our work was to tackle a niche yet crucial problem: virtual try-ons for Indic clothing.

The journey began with recognizing a gap in existing virtual try-on systems — most were not designed to handle the diversity and cultural richness of Indian garments. From the graceful sarees to the elaborate lehengas, these garments have unique shapes, drapes, and fabrics that present significant challenges for virtual modeling.

To solve this, we introduced the IndicViton Dataset: a carefully curated collection of high-resolution images (720 × 540) that showcase a variety of Indian apparel in multiple poses and orientations. This dataset plays a critical role in training models that can capture the nuances of these garments.

The heart of our work lies in the DIVA model — a diffusion-based framework that handles the complexities of multi-pose garment images with remarkable visual accuracy. With DIVA, we could ensure that the virtual try-ons not only look realistic but also retain the cultural authenticity of the attire.

At the heart of the paper is a recognition that the world of virtual try-ons must evolve to better reflect the cultural and regional diversity of clothing across the globe. By focusing on Indian fashion, our work hopes to make virtual clothing experiences more inclusive and accurate, catering to the unique needs of the Indian market and beyond.

Key Highlights:

IndicViton Dataset: A high-resolution (720 × 540) dataset featuring garments in multiple poses and orientations, helping achieve better results in virtual try-ons.
Diffusion-Based Model (DIVA): A cutting-edge model that handles multi-pose garment images with high visual fidelity, ensuring accurate virtual try-on outcomes.
Culturally Specific Focus: Tailored to the diverse Indian apparel, including dhoti, sarees, kurtas, lehengas, and more. Usage of scribble maps as priors to achieve this.

Publication
The research was presented at ECCV 2024.
👉 Access the paper here

🤝 Networking with Industry Peers

I had the opportunity to connect with professionals from Amazon and TCS, who are working on virtual try-on solutions for the Indian market. We exchanged ideas, discussed challenges, and explored the growing potential in the virtual try-on space.

🌏 Global Insights

I also interacted with researchers from Korea and China, learning about their innovative methods in virtual try-ons. It was incredible to get a glimpse of their approaches and gain insights into global advancements in this exciting field.

📸 ECCV Photos

Check out some of my favorite moments from ECCV 2024 below!

📸 **ECCV Workshop Pictures**

Insights on Academic and Research Success

Focus on Quality over Quantity
Aim to write impactful, high-quality papers, following the example of researchers like Kaiming He, rather than producing a high volume of low-impact work.
Future-Forward Problem Statements
Select research problems that are ahead of their time, focusing on areas that the industry is unlikely to tackle in the next 5 years. This ensures your work remains relevant and groundbreaking.
Iterative Adaptation in Research
Approach research topics like developing a taste for coffee—initially challenging, but over time, you’ll adapt to a particular flavor or niche that resonates with you.
Deep Expertise in a Single Area
Develop deep expertise in one domain rather than spreading efforts thinly across multiple topics. Specialization often yields greater recognition and deeper contributions.
Fundamental Innovations Over Trends
Focus on addressing fundamental challenges rather than following hyped topics. Unique, foundational contributions often yield long-term impact and recognition.
Admission Processes
Understand that professors may not have complete autonomy in selecting students; initial screenings and decisions are often made at the university level based on set criteria.
Conferences as Networking Hubs
Attending conferences like ECCV not only helps in sharing your work but also offers opportunities to network with leading researchers, identify trending topics, and gather feedback to refine your research direction.
Industry-Academia Divide
Recognize that academia allows freedom to explore novel, uncharted problems, while industry research may focus on immediate applications and profitability. Use this distinction to define your academic contributions.
Patience in Research Journey
Breakthroughs often take time. Commit to your research journey with patience, persistence, and adaptability, as impactful results may not come immediately.

Prof. Michel J. Black (left) and Prof. Shree Nayar (right)

What to Do at a Conference

Prepare by Reviewing Papers
Prior to attending, obtain the list of presented papers and review their work in advance. This preparation helps you identify specific researchers or sessions of interest, allowing you to ask insightful questions in person. Avoid going in unprepared, as the abundance of information can be overwhelming and lead to quick saturation. Expect a block of papers related to 3D, 2D, diffusion models, federated learning, etc.
Build a Strong Network
Make as many connections as possible! Take photos of attendee ID cards to remember individuals and later connect with them. Engage in conversations to learn more about their research and foster potential collaborations. I had a chance to talk with academic cousins, professors, and their students.
Attend Workshops Strategically
Workshops often provide a condensed overview of state-of-the-art techniques and ideas within an hour. Attending these sessions allows you to absorb key advancements efficiently. Since I am more of a 2D vision guy, I would be interested in 3D vision and solving 3D problems in 2D.
Participate in Poster Sessions
Poster sessions are excellent for one-on-one interactions with researchers. Use this opportunity to delve deeper into their work, discuss methodologies, and explore potential applications of their research.
Engage in Q&A Sessions
Ask questions during presentations to clarify concepts, challenge ideas, or spark discussions. This not only helps you understand the work better but also establishes you as an engaged participant.
Take Detailed Notes
Keep a notebook or digital device handy to jot down key takeaways, interesting ideas, or potential future research directions. Notes will help you revisit and reflect on what you’ve learned after the conference.
Exchange Ideas
Share your work with peers to get feedback and suggestions. Conversations about your research can lead to new insights or even collaborations.
Follow Up Post-Conference
After the conference, reach out to the people you met. This follow-up helps solidify connections and keeps the conversation alive for potential collaborations or mentorship. This is where I make most of my connections—the last day is just for that! Say hello to as many people as you can.
Explore Industry Stalls
If the conference includes industry exhibits, visit booths to learn about the latest tools, datasets, or collaborations that could enhance your research. You can also ask people about referrals or problems they are trying to solve.
Plan Breaks and Downtime
Conferences can be intense, so schedule short breaks to rest and process information. Use this time to review notes, organize contacts, or prepare for upcoming sessions.
Attend Keynotes and Plenaries
Keynote speeches often provide a broad perspective on current trends and future directions in the field. Make it a priority to attend these sessions.

🌍 Visiting Nearby Places

Milan

Rome

Introduction to Pybullet

2022-09-22T16:40:16+00:00

Pybullet: Robots and Cameras

Welcome to the tutorial on using PyBullet, the physics engine for simulating rigid body dynamics. In this blog post, we will be diving into the basics of PyBullet and how to use it to simulate physics in your own projects. Whether you’re a beginner or an experienced developer, this tutorial will provide you with the knowledge and tools you need to get started with PyBullet. So, let’s get started and learn how to create realistic physics simulations with PyBullet!

UR5 robot interaction with a torid soft body.

In this blog, we will be utilizing a UR5 robot to demonstrate the capabilities of PyBullet. We will guide you through the process of loading the robot into the simulation and show you how to manipulate its movement. Additionally, we will also explore the use of cameras to view the robot from different angles, providing a more realistic representation of its motion

URDF (Unified Robot Description Format) is a file format used to describe the physical structure and kinematics of a robot. It is an XML-based format that is used to define the robot’s links, joints, and sensors. The URDF file contains information about the robot’s geometry, mass properties, joint limits, and other parameters. It is used to define the robot’s model for physics simulation and visualization.

URDF is widely used in robotics, it is used in popular robot simulators like Gazebo and PyBullet, as well as in many robot operating systems like ROS. URDF files can be created manually or generated from CAD models using various tools like xacro, URDF exporter from SolidWorks, Inventor, etc. The URDF file can be loaded into a robot simulator, and the robot’s model can be used for physics simulation, motion planning, and visualization.

PyBullet, a physics simulation engine, also provides support for simulating soft-body dynamics. Soft-bodies are objects that can bend, stretch, and deform, unlike rigid bodies that maintain a fixed shape. Examples of soft-bodies include fabrics, ropes, and other flexible materials.

In PyBullet, soft-bodies are represented using the Bullet Soft Body library. This library allows users to create and simulate soft-bodies using a variety of methods, including cloth, rope, and mesh simulations. The library also provides several parameters that can be used to control the behavior of the soft-body, such as stiffness, damping, and mass.

The user can create a soft-body using the p.createSoftBody function. This function takes several arguments such as the shape of the soft-body, its mass, and its collision shape. After creating the soft-body, the user can apply forces and torques to it, and the library will simulate its dynamics accordingly.

One of the main advantages of using soft-bodies in PyBullet is the ability to create realistic simulations of flexible materials such as fabrics, ropes, and other flexible materials. This can be useful in many applications such as robotics, animation, and gaming.

Interaction with the soft body.

Close view of robot interacting with the toroid.

python code

Install the dependencies and load the necessary headers

#install the dependencies and load the necessary headers
import pybullet as p
import pybullet_data
import time

#connect to the physics server
p.connect(p.DIRECT)
#allow to find the assets (URDF, obj, textures etc)
p.setAdditionalSearchPath(pd.getDataPath())

p.setGravity(0,0,-10)

#load the ground plane
planeId = p.loadURDF("plane.urdf")

Load the robot

startPos is a list of 3 floating-point numbers that represent the initial position of an object in 3D space. The values in the list correspond to the x, y, and z coordinates respectively. In this case, the initial position is set to the origin (0,0,0) which is the point (x,y,z) = (0,0,0) in the 3D space.

startOrientation is a variable that holds the initial orientation of an object in the form of a quaternion. A quaternion is a 4D mathematical object that can be used to represent rotations in 3D space. In this case, startOrientation is set by calling the p.getQuaternionFromEuler function, which takes a list of 3 floating-point numbers representing the Euler angles (in radians) of the object’s initial orientation. The [0,0,0] passed as the argument corresponds to the yaw, pitch and roll of the object in that order.

Together, startPos and startOrientation define the initial position and orientation of an object in the simulation. These values can be used to specify the initial state of an object when it is added to the simulation.

#place the robot at the base position
startPos = [0,0,0]
startOrientation = p.getQuaternionFromEuler([0,0,0])

#load the robot urdf file
boxId = p.loadURDF("/content/pybullet-works/notebooks/meshes/ur5.urdf",startPos, startOrientation)

Adding camera to the scence

The variables pitch, roll, and yaw define the rotation of the camera in 3D space. pitch represents the rotation around the x-axis, roll represents the rotation around the y-axis, and yaw represents the rotation around the z-axis. In this case, the pitch is set to -10 degrees, which will make the camera look slightly downwards, the roll is set to 0, and the yaw is not defined.

upAxisIndex is an integer variable that represents the up axis of the camera. This variable is used to specify which axis of the camera is pointing upwards. In this case, the value is set to 2, which corresponds to the z-axis.

camDistance is a variable that represents the distance of the camera from the target position. In this case, the camera is 1.5 units away from the target position.

pixelWidth and pixelHeight define the resolution of the image captured by the camera. In this case, the image will be 640 pixels wide and 480 pixels tall.

nearPlane and farPlane define the near and far clipping planes of the camera, respectively. Objects closer than the near plane or farther than the far plane will not be visible in the captured image. In this case, the near plane is set to 0.01 units and the far plane is set to 100 units.

fov is the field of view of the camera, measured in degrees. In this case, the field of view is set to 60 degrees.

viewMatrix and projectionMatrix are 4x4 matrices that define the position and configuration of the camera. viewMatrix is computed using the p.computeViewMatrixFromYawPitchRoll function, which takes several arguments such as the target position, distance, yaw, pitch, roll, and up axis index of the camera. projectionMatrix is computed using the p.computeProjectionMatrixFOV function, which takes the field of view, aspect ratio, near and far clipping planes of the camera.

Finally, the p.getCameraImage function is used to capture an image of the simulation. This function takes several arguments such as the width, height, viewMatrix, and projectionMatrix of the camera and returns an image in the form of an array.

%%time
camTargetPos = [0, 0, 0]
cameraUp = [0, 0, 1]
cameraPos = [1, 1, 1]
p.setGravity(0, 0, -10)

from google.colab import widgets
import numpy as np
import random
import time
from matplotlib import pylab
grid = widgets.Grid(2, 2)
yaw = 0
for r in range(2):
  for c in range(2):
    yaw += 60
    with grid.output_to(r, c):
      grid.clear_cell()
      pylab.figure(figsize=(10, 5))
      pitch = -10.0
      roll = 0
      upAxisIndex = 2
      camDistance = 1.5
      pixelWidth = 640
      pixelHeight = 480
      nearPlane = 0.01
      farPlane = 100
      fov = 60
      viewMatrix = p.computeViewMatrixFromYawPitchRoll(camTargetPos, camDistance, yaw, pitch,
                                                                  roll, upAxisIndex)
      aspect = pixelWidth / pixelHeight
      projectionMatrix = p.computeProjectionMatrixFOV(fov, aspect, nearPlane, farPlane)
          
      img_arr = p.getCameraImage(pixelWidth,pixelHeight,viewMatrix,projectionMatrix)
      w = img_arr[0]  #width of the image, in pixels
      h = img_arr[1]  #height of the image, in pixels
      rgb = img_arr[2]  #color data RGB
      dep = img_arr[3]  #depth data
      print("w=",w,"h=",h)
      np_img_arr = np.reshape(rgb, (h, w, 4))
      np_img_arr = np_img_arr * (1. / 255.)
      pylab.imshow(np_img_arr, interpolation='none', animated=True, label="pybullet")

Robot in all views.

Create an animated png


!pip install numpngw
from numpngw import write_apng
from IPython.display import Image


frames=[] #frames to create animated png
for r in range(60):
    yaw += 6
    pitch = -10.0
    roll = 0
    upAxisIndex = 2
    camDistance = 1.5
    pixelWidth = 320
    pixelHeight = 200
    nearPlane = 0.01
    farPlane = 100
    fov = 60
    viewMatrix = p.computeViewMatrixFromYawPitchRoll(camTargetPos, camDistance, yaw, pitch,
                                                                roll, upAxisIndex)
    aspect = pixelWidth / pixelHeight
    projectionMatrix = p.computeProjectionMatrixFOV(fov, aspect, nearPlane, farPlane)
        
    img_arr = p.getCameraImage(pixelWidth,pixelHeight,viewMatrix,projectionMatrix)
    w = img_arr[0]  #width of the image, in pixels
    h = img_arr[1]  #height of the image, in pixels
    rgb = img_arr[2]  #color data RGB
    dep = img_arr[3]  #depth data
    #print("w=",w,"h=",h)
    np_img_arr = np.reshape(rgb, (h, w, 4))
    frame = np_img_arr[:, :, :3]
    frames.append(frame)
print("creating animated png, please about 5 seconds")
%time write_apng("example6.png", frames, delay=100)
%time Image(filename="example6.png")

UR5 robot interaction with a torid soft body.