Brandon’s Academic Website

[COMPSCI 180] Project F: NeRF

2024-12-12T00:00:00-08:00

Implementing NeRF

Introduction

NeRF is a novel scene synthesis technique first proposed by Mildenhall et al., which attempts to visualize complex geometries at high resolutions via three main methodological features:

March camera rays through a scene to first generated sampled set of 3D points.
Using those points and their corresponding 2D viewing directions, represented by extrinsic matrices (camera viewpoints), one can try to predict the density and color of a specific point.
Lastly, to transform the density and colors of pixels into a scene representation, NeRF uses classical volumetric rendering techniques to accumulate the colors and densities into 2-dimensional images.

Several of these approaches are powered by MLPs.

For this precanned final project of COMSPCI 180, we aim to implement an easier version of NeRF.

Methods

Building a 2D Neural Field

A vector field is essentially a function that maps each n-dimensional coordinate to a vector. A neural field, temrinologically similar, is a neural network architecture that maps some n-dimensional coordinate to a particular 3-dimensional datapoint that represents its color. Fundamentally, we train a neural network $F: (x, y) \rightarrow (r, g, b)$ to predict the color of each pixel on an image.

As an exercise, let us discuss how to build such neural fields in the easiest case like for a two-dimensional image. In NeRF, a similar approach is taken: a three-dimensional world coordinate and a two-dimensional representation of the camera viewing direction are used as the input of a neural network to output the color of a pixel (along with its density, but let us abstract this away first). However, as NeRF has mentioned, prior works have failed replicating high-frequency details of an image with this approach. To allow for this variability, we need the field to be constructed with higher-dimensional inputs.

To construct higher-dimensional inputs from low-dimensional points, we introduce positional encoding. Simply put, it’s an expanded representation of some low-dimensional point. In our implementation, we use the following:

\[PE(x) = \{x, \sin(2^0 \pi x), \cos(2^0 \pi x), \cdots, \sin(2^{L-1} \pi x), \cos(2^{L-1} \pi x)\}\]

where the hyperparameter $L$ controls the dimensionality of our expanded representation. This is similar to the samely named idea used in transformer architectures.

Therefore, to train a 2D neural field that predicts the color of any pixel in a 2D image, we will first train the neural field on the 2D image per se, and use the MSE of the predicted and true color of each pixel as the training objective. Additionally, we evaluate our results via the metric Peak Signal-to-Noise Ratio (PSNR), formulated as follows:

\[{\rm PSNR} = 10 \log_{10} \bigg( \frac{1}{\rm MSE} \bigg)\]

Pixel, Camera, World

In the 2D neural field, each pixel is one pair of predictor and response variables. This works similar in 3D neural field. In the 3D neural field, we predict the color of a specific pixel, from an image shot by an arbitrary camera. The arbitrary camera’s viewpoint (and its location) with respect to the world is defined by a rotation and a translation. Therefore, a 3D neural field can be thought of as a 2D neural field simply augmented by camera information. However, it is not easy to introduce these information into a single architecture. First, we discuss how to leverage camera information to help our coming pixel-predicting tasks.

First, let us discuss how to convert the coordinate of a pixel from a camera into a “world-coordinate” that can serve as a shared coordinate system across pictures of all cameras.

The coordinate of a pixel on an image can be transformed into a camera coordinate system via the following relation:

\[\begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix}\]

The camera-to-world coordinate conversion can be expressed in terms of the rotation and translation of a camera:

\[\begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R}_{3 \times 3} & \mathbf{t} \\ 0_{1 \times 3} & 1 \end{bmatrix} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix}\]

Without diving too much into the details of these expressions, let us consider a pixel-to-camera-to-world coordinate transformation constructed along these relations.

Ray and Volumetric Rendering

In the training process of a 3D neural field, a ray is synonymous to a pixel. A ray is simply the line of light passing through one particular pixel for an image coming from a specific camera. It is the unit of “datapoint” in our model.

We can sample points along a ray (known to be a world corodinate) by first computing the origin of the ray $\mathbf{r_o}$, then add a parametrized multiple of the ray’s direction $t \mathbf{r_d}$. The particular mathematical operations to compute these components of a parametric line (ray) can be found in the assignment instructions.

With these rays, we sample points along them to obtain the density of datapoints and the colors that a ray observe at the particular datapoint. This allows us to consider the depth of a camera’s sight. Then, we may use volumetric rendering techniques, which takes in a series of densities and color observed along a ray to compute the “expected” color that ray would perceive, which would therefore grant us the color of the corresponding pixel.

Training a 3D Neural Field

To train a 3D neural field, we follow this procedure for each gradient step:

Sample 10,000 rays. Sample 64 points per ray.
For each point, obtain its predicted density and color using the neural field
Use the volumetric rendering technique to predict the color of each ray’s corresponding pixel.
Evaluate our predictions on MSE as we do when training a 2D field.

Bells and Whistles explanation: by simulating an infinitely large density at the end of a ray, we can convince the volumetric rendering principles into thinking there is a board of certain color at the end of the ray, and therefore, whenever the ray predicts not hitting any object or 3D point (originally resulting in black color), we convince it to think it has hit a white or gray board instead. The concrete changes only involve concatenating a large density value (say 100) and a targeted color different than black at the end of model predictions for densities and colors. However, this trick should only be done on evaluation, as we still want to train the model to think empty locations in scenes are dark.

Experiments

We conduct three main experiments:

[Part A] of the project asks us to train a two-dimensional neural field on one given and one self-defined image.
[Part B] of the project asks us to train and evaluate a three-dimensional neural field for a given scene, and reach a PSNR of $23$ at any iteration of our freedom.
[The Bells and Whistles] Change background of spherical rendering video.

Results

Part A: 2D Neural Field

We construct a 2D Neural Field on the following images:

Here are the neural fields produced throughout training

Additionally, we assess the impact of two hyperparameters on the final result of these neural fields: the maximum frequency controller $L$ and the learning rate of the process.

The learning rate of the process had minimal influences:

The maximum frequency controller, on the other hand, had substantial influences. Particularly, with a larger $L$, the maximum permitted frequency in the image increases, and therefore allows for finer details int he image. Vice versa.

Part B and B&W: 3D Neural Field

To train a 2D neural field, we implement the architecture and volumetric rendering functions assigned in the manual, alongside other coordinate system transformations. We also select the following different hyperparameters from the spec, as the spec’s results are somewhat irreproducible:

A learning rate of $8e-4$ with per-epoch decay until $8e-5$.
Training for 100 epoch, where each epoch involves 400 gradient steps, each with a batch of 10,000 rays and 64 samples per ray.

Here is a demonstration of the camera rays, the first image having all rays sampled from a single camera, and the second image having rays sampled across all 100 training cameras.

Here is the resulting training curve of our 3D neural field:

Here is how the validation set images change throughout the training process:

And here is a spherical rendering of our neural field on the test set cameras (transposed):

Its background color can be changed, but there will be dynamic dark noises as shadows are ill-defined in the scenes. The following GIFs result from 7000 gradient steps of training. The resulting dark signal can be thought of as an “overfitting” phenomenon where the model is overly convinced that empty scene locations must be dark.

[COMPSCI 180] Project 5: I Forward, I Denoise, I Diffuse

2024-11-22T00:00:00-08:00

COMPSCI 180 Project 5 Writeup

Project 5A: The Applications of Diffusing

The random seed used at here is always 180.

Setup Procedures

To set up for this project, we need to:

Acquire access to a text-conditioned diffusion model. This assignment uses the DeepFloyd IF diffusion model.
Organize the upsampling code in the notebook.
Precompute the text embedding of several image prompts to be used later.

Here is a demonstration of the prepared model’s capabilities with inference step $20$:

and with inference step $40$:

Preliminaries

Diffusion models are generative models that create samples from a distribution via learning to denoise some noised version of an image.

Particularly, we would denote $x_0$ as an object that is completely clean, such as a picture that we want to generate. Then, the intuition of learning a diffusion process is that we would observe how adding noise destructs the appearance of an image. This is known as a noising process, where we proceed from $x_t$ to $x_{t’ > t}$.

By observing the reversed trajectory of that destruction process, we learn the generation process, going from $x_{t’ > t}$ to $x_t$. The name of this process comes from its literal interpretation– it’s a denoising process.

Specific equations formulate these processes, normally with the help of normal random bariables $\mathcal{N}$.

Implementing the Forward Process [1.1]

The forward process, also known as the noising process, is defined as:

\[q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}} x_0, (1 - \bar{\alpha_t}) \mathbf{I})\]

Be very cautious that $\mathbf{I}$ is a matrix of $1$s, not an identity matrix. This comes from the primitive form of generating $x_t$:

\[x_t = \sqrt{\bar{\alpha}} x_0 + (1 - \bar{\alpha_t}) \epsilon, \epsilon \sim \mathcal{N}(0, 1)\]

Throughout the post, we will be using this test image quite often, so remember its existence…

Implementing this procedure with $t \in [0, 999]$ on images with height and width of $64$, we obtain the following results as we noise the image for a different amount of timesteps:

Classical and One-Step Denoising [1.2, 1.3]

By applying a Gaussian blur, we hope to remove the noise in an image. However, its results are unideal when the image is very nosiy:

But if we instead try to estimate the noise using the UNet of our diffusion model, then we can achieve an effective estimation of the clean image:

Iterative Denoising [1.4]

Having to denoise through $1000$ steps is costly. Recent work has discussed the possibility of one-shot diffusion, but I haven’t read the paper, so here’s an okay cheap alternative to that. We can use strided timesteps, which can scale the amount of timesteps required down by roughly threefold. Particularly, the noisy image at timestep $t$ now becomes:

\[x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt(\alpha_t) (1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} x_t + v_{\sigma}\]

the definitions of these variables can be found in the instructions. This strided timestep works as an interpolation trick.

Here is a view of the noising process:

And here is a comparison of the denoised results via diffusion and classical approaches:

Diffusion is Your Friend with Conditions [1.5, 1.6]

To sample from the distribution of images that our diffusion model models, we can step through the entire denoising trajectory by starting with images composed of purely random noise.

Interestingly, involving a classifier-free guidance by using noise estimates from conditioned and unconditioned prompts can improve our image quality. Let $\epsilon_c$ be the conditioned noise estimate (where there is an existing prompt we’d like), and $\epsilon_u$ be the unconditioned noise estimate (in our context, it is the noise estimate for a null empty prompt). Then, the new estimate is experssed as:

\[\epsilon = (1 - \gamma) \epsilon_u + \gamma \epsilon_c\]

where the hyperparameter $\gamma$ is free of choice and encouraged by the assignment to be $7$. The samples of this method observe a much higher quality:

Image-to-Image Translation [1.7]

From this point forward, we will benefit all of our approaches with the use of CFG. The first application we’d like to try out is image-to-image translation, which demonstrates a process of arbitrary images slowly leaning towards looking alike to an originally provided image:

This procedure is otherwise known as SDEdit.

We can also apply this trick on web images and hand-drawn images to make it high-quality photos in intermediate steps.

Web image:

SDEdit Results:

Hand drawn image A:

SDEdit Results:

Hand drawn image B:

SDEdit Results:

Impainting [1.7]

Impainting is a trick where we only provide diffusion for an unmasked region of the image, allowing for a portion of the image to be altered. To put this into mathematical formulation, at obtaining $x_t$ for each $t$, we apply the following update as well for a mask $\mathbf{m}$ that is $1$ where new contents should occur:

\[x_t \leftarrow \mathbf{m} x_t + (1 - \mathbf{m})~{\rm forward}(x_{orig}, t)\]

Here are some possible results of this technique:

Setup A:

Result A:

Setup B:

Result B:

Setup C:

Result C:

Text-Conditional Image-to-Image Translation [1.7]

We can also let images shift from one prompt’s content to another by changing the conditioned prompt from “a high quality photo” to something else, like “a rocketship”. Here are some example outputs of the technique:

Visual Anagrams [1.8]

Visual anagrams can be produced by mixing class-conditioned noises of each text prompt. The full algorithms is noted at the assignment instruction, although it misses the CFG part of the formulation.

Here are some results:

Prompt: “an oil painting of an old man” flips into “an oil painting of people around a campfire”

Vertically flipped:

Prompt: “a lithograph of waterfalls” flips into “a lithograph of a skull”

Vertically flipped:

Prompt: “an oil painting of a snowy mountain village” flips into “a photo of a dog”

Vertically flipped:

Hybrid Images [1.9]

Back to making hybrid images! To do so, similarly follow the instructions, but instead of having separate unconditioned noise estimate, use one unconditioned noise estimate with the sum of noised estimates to perform CFG-styled noise estimation.

Prompt: At far, it looks like “a lithogram of a skull,” but upon close inspection, it’s actually “a lithogram of waterfalls”.

Prompt: At far, it looks like “tornado in a desert,” but upon close inspection, it’s actually “the face of a programmer”.
Check particualrly generated image 2.

Prompt: At far, it looks like “sunset in a desert,” but upon close inspection, it’s actually “a bar with dim lighting”.
Check particualrly generated image 3.

Project 5B: We Need to Go Deeper

The random seed used at here is always 0.

Implementing the UNet [1.1]

The UNet is an amalgation of several smaller convolution-based blocks. I implemented it. If you believe me, no need for further action. Else, here are some results.

A demonstration for the noising process across choices of $\sigma$:

The training loss (logarithmically transformed) of the model:

Some denoised results of the model:

The performance of our denoise trained on $\sigma=0.5$ when it encounters a higher noise level:

Time-Conditioned UNet [2.1, 2.2, 2.3]

We can condition the UNet on timestep information to inform it how to denoise images provided the temporal situation of the denoising sequence. Particularly, this occurs with a fully connected block that takes timestep as an input.

The training loss of such model, logarithmically transformed, follows:

Its outputs:

Class-Conditioned UNet [2.4, 2.5]

Similar to the theories of CFG, you can use class-conditioned embeddings to help diffusion models gear towards generating images of particular classes.

The training loss of such model, logarithmically transformed, follows:

Its outputs at the 5th epoch:

Its outputs at the 20th epoch:

[COMPSCI 180] Project 4: Sike! A Mosaic! (and I used some slip days!)

2024-10-20T00:00:00-07:00

COMPSCI 180 Project 4 Writeup

Introduction

In this (rushed) blog post, I detail my doings (and now corrected wrongdoings) in Project 4A (one of them being starting late due to all the other businesses in life).

The assignment discusses image warping and mosaicing. Image warping is a business that we have discussed using a large part of our previous post, while mosaicing is a new topic. What is mosaicing? Mosaicing is stitching two pictures of different perspectives together, forming a panorama of some two sights that have resulted from an observer staying at a fixed global coordinate, but perhaps looking at something from a different angle. An example will be quickly shown in our results section.

We have repeatedly mentioned the word “stitching”, but what really is “stitching”? And how is warpping, an elemental operation for image transformation, related to our works (and my wrongdoings)? We will detail the methodological details of these efforts in the Preliminaries sections, then describe their experimental outcomes using the section following Preliminaries.

Preliminaries of Project 4A

Let us consider a toy example of the two following pictures:

Picture A	Picture B

These are different views of Prof. Efros’ office seen from the same global coordinate (I stand at the exact same location when looking at these views), but the door’s overall orientation is quite different, because the angle at which I look at the door is different. Therefore, one door is slanted towards the left, and the other is slanted towards the right! Amazing!

So how do we stitch these two pictures together? Well, there is a commonality between these two pictures that we can merge with: the door. Let’s just overlap the images by where the door is… except we can’t just do that. The door’s shapes are different. We must transform, say, Picture B’s door’s shape into that of Picture A’s when we overlap the images (and naturally, all other components follow the same transformation). This is where warpping comes in.

Projective Mapping

In the last blogpost, we discussed finding an affine transformation between two triangles by finding a 3-by-3 transformation matrix via homogeneous coordinates. Here, we concern a different form of transformation: projective transformation.

\[\begin{bmatrix} wx' \\ wy' \\ w \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\]

With at least 4 pairs of $(x, y), (x’, y’)$ as well as the definite assumption that $gx + hy + 1 = w$, we can construct the following system of linear equations and use least squares algorithm to find the optimal estimators for projective mapping parameters:

\[\forall (x, y, x', y') \in \mathcal{D}, \begin{cases} x' = ax + by + c - gx x' - hy x' \\ y' = dx + ey + f - gx y' - hy y' \end{cases}\]

here, $\mathcal{D}$ is the set of all correspondence points, such as the corners of the door and doorknobs.

Warp2Stitch

Upon obtaining this projection matrix, it’s warping time. The general technique of warpping can be seen at the post for project 3, but I will kindly reiterate it here.

Forward vs. Inverse Warpping

There are two general patterns for warpping: forward-warpping and inverse-warpping.

In forward warpping, once we infer the operator $\mathcal{T}$, we map the value of each pixel at position $(x, y)$ to its transformed equivalent $\mathcal{T}((x, y))$, and non-integer pixel coordinates will have its values distributed among neighboring pixels. However, this pattern of warpping easily leads to “holes” in the resulting product, where certain pixels do not receive any coloration.

Therefore, our assignment proceeds with an alternative method: inverse warpping. In this paradigm, we infer an operator $\mathcal{T}$ from the source image to a target image, and use the inverse operator: $\mathcal{T}^{-1}$ to do pixel mapping as described above. However, since the source image’s pixel values are all known, we can safely interpolate the unknown pixel values with its neighbors. An efficient manner of doing so is bi-linear interpolation, where for each $(x_t, y_t) = \mathcal{T}^{-1}(x, y)$ resulted, its coloring is inferred as a weighted sum of its neighbor. In our assignment, we apply a variation of this logic that will be described below (out of the unclarity of the original assignment’s instruction, we devise a variation of this interpolation function fitting our context).

Constructing the Blending mask

To construct a smooth transitioning blending mask, we will formulate our mask with the following procedure:

Predefine masks whose pixel values are equivalent to the distance between themselves and the nearest edge, then normalize the mask.
Warp them with their corresponding images.
Construct the final mask as an aggregation of mask for image A and B such that each pixel of the mask has intensity $\frac{\alpha_a}{\alpha_a + \alpha_b}$ for the final mask of Image A, and similarly for Image B with $\frac{\alpha_b}{\alpha_a + \alpha_b}$.
Multiply each warpped image by their corresponding final mask, then add the masked images together. Thanks Ryan!

Overall Flow of Methodology for Warping

The overall flow of mosaic would then seem as follows:

Obtain the homography transformation via least squares to match correspondences from Picture B to Picture A
Transform the corners of Picture A onto Picture B, seeing where it will generally land.
Take a polygon mask using the transformed corners, and obtain all integer coordinates for pixels in there.
Inverse warp onto the transformed polygon mask mentioned in (3), and use bilinear interpolation to fill in the mask
Use an infinity-norm based distance transform mask to blend the two images. A Laplacian kernel would be much preferred, but I need to go revise exam logistics and write SoPs.

Preliminaries of Project 4B

Feature Detection

To detect features that are helpful for alignment, we can consider first what kind of features are best as correspondence points between two images. And that is corners. Corners of a door, or a wall, can often represent a distinct boundary between some objects, and is therefore distinctive of the geographical location of something in a picture. Therefore, when automatically detecting features, we don’t want to cut corners, but we want to find corners.

The Harris corner detector described in lecture (which you should go to) fulfills this purpose with the simplified procedure below:

Calculate the gradient of the image as $I_x$, $I_y$.
Construct the structure tensor matrix, known as $M = \begin{bmatrix} I_x^2 & I_x I_y \ I_x I_y & I_y^2 \end{bmatrix}$, which provides us a structural summary of the change at pixel.
With reasons not mentioned here, we learn that the eigenvalues of $M$ provide us knowledge of what’s a corner. A corner occurs at a pixel whose matrix has two large eigenvalues, such that $R = \det (M) - k ({\rm trace}(M))^2 > 0$.

Now, then, we get a set of pixels that can be Harris corners based on peak-finding algorithms.

However, is a pixel a good feature because it is a Harris corner, or is the pixel a Harris corner because it is a good feature?

The answer is, a pixel is a Harris corner because it’s a potentially good feature, but it might not be useful, because it can be a not-so-prominent peak within the space of $R$ values throughout pixels. Therefore, we use an additional technique called Adaptive Non-Maximal Suppresion (ANMS) to filter the Harris corners down to a set of $250$ or $500$ points.

The ANMS algorithm follows as:

For each Harris corner, compute its minimal suppresion radius as $r_i = \min_j d(x_i - x_j)$ where $f(x_i) < c_{robust} f(x_j)$ for an $x_j$ being one of the Harris corners.
Pick the $n$ pixels with the largest values for $r_i$.

Therefore, upon applying ANMS, we know throughout the Harris corners and the suppressed points, these 500 pixels alone are the promising features.

Feature Matching

Patch Descriptor

To describe a feature (a pixel, a corner), we can use an 8x8 patch around it by sampling with stride 5. That is, for a pixel at $(x, y)$, the pixels we sample as its descriptors can be described with the set of points

\[\mathcal{P} = \{(a, b) | a \in [x - 20, x + 20], b \in [y-20, y+20], (a - x) \mod 5 \equiv 0, (b - y) \mod 5 \equiv 0\}\]

Well, this forms a 9x9 patch, but would I re-formulate a piece of working code to make it work again under this subtle imprecision that doesn’t affect my final result? Nah, I’d continue writing the blog post. The patch also needs to be standardized.

Nearest Neighbor Matching

To match features across image A and B, we can first find nearest neighbor matches of them. That is, we first construct a nearest neighbor instance for features in image A and B, then we see what pairs of features across images A and B are each other’s nearest neighbor. Then, among those points, we filter out all pairs whose following statistic is less than $0.5$ (the paper suggests $0.4$, but I made it more generous judging by my data):

\[\frac{d_{first~NN}}{d_{second-NN}}\]

Surprisingly, my implementation doesn’t use a Nearest Neighbor data structure from any existing libraries. I just use a cdist distance matrix between feature patches.

RANSAC

Lastly, the RANSAC algorithm can be described as an iterative process as follows, which helps us find the largest set of inliers that constructs the homography we want for our eventual warping and stitching procedure.

Take a deep breath for the upcoming code you have to write.
initialize best_error, best_inlier_set.
Iterate the following procedure for $n=1000$ times:
1. Select four random pairs without replacement
2. Compute the homography $H_i$ for these four random pairs.
3. Record the set of pairs such that when applied this homography, the distance between the transformed feature of image A and the corresponding feature of image B are less than 2 pixels away.
4. If the error of homography is smaller than best_error and we have encountered the biggest inliner_set up to now, record our homography and inliner_set, and update our best_error.

Then, you may stitch proud, as your features are aligned.

Rectify and Justify

To make sure our warpping implementation is intact, we would like to try rectifying some aspects of an image, such that a specific set of four pairs of correspondence points should form a square or rectangle in a warpped image.

The results are as follows:

Sike, it’s Mosaic

Here are the mosaics.

Original Picture A	Original Picture B

Results:

Original Picture A	Original Picture B

Results:

Original Picture A	Original Picture B

Results:

Generally, the results match in structure. The color differences may be efficiently resolved using a Laplacian kernel, or be more careful when taking pictures.

Automatic Feature Discovery

Feature Detection: Harris Corners and ANMS

The Harris corner filled my entire picture for any instances, so I’ll showcase my ANMS points here instead!

As you can see, they be zoomin. Therefore, we should suppress more of them with the matching procedure.

Feature Descriptor, Matching, RANSAC

Example patches of feature detectors for picture C1 is shown below:

As you can see, the corners on the flag posted on Prof. Efros’ door has brought some attention to our feature detector!

Furthermore, here are the true inliers found by RANSAC on each image after their nearest neighbor matching procedure discussed in the Preliminaries for 4B section above.

The Outcome of Three Slip Days

ITS STITCHING TIME!

What have I learned?

It’s cool to automatically find features (even though the methods are complicated), and now I appreciate representation learning a bit more for finding these by itself.

[COMPSCI 180] Project 3: Nah, Its Morphin Time

2024-09-28T00:00:00-07:00

COMPSCI 180 Project 3 Writeup

Introduction

Transformations are powerful mathematical tools that allow us to transfigure photos into specific structures with specified disentanglements of its “structural details” and “texture”. Particularly, the popular morphing technique applied onto pairs of pictures, can transform the subject of one image into the subject of another in an almostly seamless manner. And, we can also direct the methods of transformation onto faces, where we elicit the particular structural details and texture-ic properties of faces of specific demographics, in turn extrapolating faces towards different demographics or features via linear algebraic manipulations. In this assignment post, we detail all the techniques described above via the tasks prescribed by the assignment per se.

Preliminaries

Affine Transformations

Considering the viewpoint that all images are fundamentally some tensor, the shifting of one pixel into another location can be naturally considered as a matrix multiplication. Paritcularly, the transformations that may occur via a matrix multiplications are parametrized by its operator. A survey of such transformation has been delivered in many lower-division courses, and so will be summarized with the following picture:

All of these transformations are either linear, in that it serves as a change of basis while obeying linearity (the composition of homoegeneity and superposition properties), or are affine. Fundamentally, affine transformations occur by matrix multiplications and additions, in a linear algebraic sense. They also have the following properties (migrated from the lecture slide):

The origin does not necessarily map to origin, due to a possible translation.
Parallel lines remian parallel
The transformation is closed under composition
The transformation models a change of basis.

Using homogeneous coordinates, which is an augmented coordinate system for any-dimensional coordinates such that $(x_1, \dots, x_n, w) = (x_1 / w, \dots, x_n / w)$, we may unify the description of affine transformations as one single matrix operation:

\[\begin{bmatrix} x' \\ y' \\ w \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ w \end{bmatrix}\]

Particularly, each degree of freedom from the aforementioned parametrization in $a, b, \dots, f$, comes from the possible candidate components of any affine transformations: translation, scaling, rotation, and sheering. Please see this slidework, slide #27 for an in-depth description.

Warpping

Notably, any triangle can be transformed into any other triangle via an affine transformation. For the pacing of this post, let us directly assume we will be transforming triangles into triangles. The fundemental reason of this will be described in the next section, when we discuss morphing.

Finding $\mathcal{T}$

How many “correpondences”, or points on the triangle that definitely correpond to each other, do we need? Since each triangle has three vertices, there are six coordinates (at least) involved in each transformation. Therefore, for three vertices of the triangle $V = {V_n = (x_n, y_n) | n \in {1, 2, 3}}$ and its transformed version $V’$, we find the following system of equations that will enable us to infer the parameters of affine transformation that occurred:

\[\begin{bmatrix} x_1 & y_1 & 0 & 0 & 1 & 0 \\ 0 & 0 & x_1 & y_1 & 0 & 1 \\ x_2 & y_2 & 0 & 0 & 1 & 0 \\ 0 & 0 & x_2 & y_2 & 0 & 1 \\ x_3 & y_3 & 0 & 0 & 1 & 0 \\ 0 & 0 & x_3 & y_3 & 0 & 1 \end{bmatrix} \begin{bmatrix} \hat{a} \\ \hat{b} \\ \hat{d} \\ \hat{e} \\ \hat{c} \\ \hat{f} \end{bmatrix} = \begin{bmatrix} x_1' \\ y_1' \\ x_2' \\ y_2' \\ x_3' \\ y_3' \end{bmatrix}\]

You can arrange the above equation to make the parameter list alphabetically sorted. So, based on the vertices of a transformed triangle, we can infer its transformation operator $\mathcal{T}$.

Forward vs. Inverse Warpping

There are two general patterns for warpping: forward-warpping and inverse-warpping.

Morphing

Cross-Dissolving

Cross-dissolving is a technique that allows us to interpolate the pixels of intersecting, supposedly-“merged” images via linear interpolation (a weighted sum). Particularly, in this method we define a halfway image with a parametrization of $t \in [0, 1]$:

\[(1 - t) \times {image}_1 + t \times {image}_2\]

One conspicuous drawback of this technique is that this only works on aligned images.

Triangulation for Structure

An efficient method to dictate the structure of objects in an image is via defining a triangular gridmap over important landmarks of the object. For example, the picture of one face of a diamond can be defined as several triangles, allowing us to construct a structural map for the prism-ful diamond face. An example of triangular mesh may be seen when we discuss the operational details of this assignment.

In this assignment, we employ the Delaunay triangulation method. In the naive rendition of this triangulation method, we may start with an arbitrary triangulation of our image, and we flip any illegal edge within the triangulation until no more illegal edges exist. Here, an illegal edge refers to an edge that would improve the triangulation by being flipped. A good triangulation is one where the triangles are less narrow. The precise technicalities of these approaches are (somewhat) out of scope for this blog post. Then, the clever rendition of Delaynay triangulation is by solving a dual of the triangulation problem via the use of Voronoi diagrams.

Morphing Sequence

The morphing sequence is defined as a sequence of warpped image over some schedule of $w(t)$ where $t \in [0, 1]$. Particularly, the $k^{th}$ image of a warpping sequence is constructed as:

Decide the warp fraction at timestep $k$, and decide the midway shape of faces as $(1-t) {image}_1 + (t) {image}_2$.
Warp both faces into this shape.
Cross dissolve them at the dissolve fraction at timestep $k$.

This schedule of dissolve and warp fractions can be freely decided as a monotonously increasing function with values starting from $0$ and ending at $1$.

In this assignment, we export such sequence of image as an animated GIF and output it to this website for grading purposes.

Faces

Now, let us discuss your face. My face. As well as many people’s faces. Our faces are interesting. Particularly, our faces can be defined with several landmark features. The Danes dataset of this assignment, for instance, defines 58 landmarks to each of its subject’s face.

Faces can be warpped, morphed into each other via the aforementioned strategies. Interestingly, we can also use the above viewpoints to define populational mean shapes and textures of faces for specific demographics, allowing us to define the “male prototypical” and “female prototypical” face based on the dataset labeling. BASED ON THE DATASET LABELING.

Find the mean shape of one subpopulation’s faces
Warp all faces of that subpopulation into their mean shape, then blend all of them with eqaul weight to obtain the pixel values of a prototypical face.

Upon doing so, we can get the supposed deviation that distinguishes male and female prototypical faces by the following procedure:

Warp the male and female prototypical faces into a halfway shape between them. Let’s call the results “prototypical warps”.
Obtain the “pixelated deviation” of these subpopulations by subtracting the female prototypical warp’s pixel values with the male prototypical warp’s pixel values.

Of course, to simply create a caricature, one can also just subtract the population mean from one’s own picture. This works similarly to an unsharpened mask filter, where the edge component is reinforced by subtracting away the mean of the image at a larger weight. In the sense of creating a caricature, we simply create an image where “our” components are largely reinforced by subtracting away the population face average.

Now, let us head into the assignments.

Defining Correspondences

Correspondences are defined via an online tool from last year’s student, publicized on the instructional website of this assignment. I am summarizing the overall process of picture processing as well as their resulting triangulations via the following image:

Particularly, correspondence points are chosen and selected in specific orders to secure the similarity of triangulations between warpping processes. As mentioned before, this plays a significant role in the successful, undistorted results, and prevents us from morphing our face into abstract art. Particularly, targeting essential structures of faces and preventing collinear triplets of points helps to prevent the deformation of triangular meshes.

I was walked into when I took my photo at my lab’s rotation seat, if you wanted a fun story somewhere in this blog :’)

Computing the Mid-way Face

As mentioned in the methodologies section, computing a midway face involves a three-step procedure:

Computing the average face shape by averaging the correspondence points together– this is why the ordering in correspondence points matters.
Warpping both faces into the average face shape described above, which we perform inverse warpping for.
Average the warpped faces’ pixel values (colors) together.

Particularly, we fill in parts of the faces by constructing polygon masks. This is where a tricky point occurs. If we simply apply our affine transformation on the noted coordinates of polygon masks from original faces onto the midway face, we will find several stripes throughout the mask, which occurs due to integer coordiantes not being covered. This is also a known weakness of forward warpping. Therefore, we adapt an alternative algorithm for warpping, where:

First, let us compute the polygon mask of the warpped triangles, not the original faces’
Then, for each integer coordinate within the polygon, we first obtain a possibly floating-point-involving coordinate that corresponds to a location in the original face’s triangle.
For that triangle, we perform bilinear interpolation to obtain the supposed value of that pixel.

Here, bilinear interpolation involves only the original face, whose pixel values are not to be changed. Therefore, this operation can be vectorized via array arithmetics, and occurs fairly efficiently.

The resulting mid-way face is as follows:

“It’s Morphin’ Time”: The Morph Sequence

The scheduling for the following morphing sequence is specifically tuned to be non-linear such that the morphing aspect of the trajectory is more clearly visualized. This agenda is applied to both the wrap fraction and the dissolve fraction at the same time. The precise methodology should be referred to at the beginning of this post, in the preliminaries section.

Now, let’s look at the morphing sequence GIF:

If the GIF is not running, it’s most likely just a markdown problem. Check Here instead.

Here is a table of all frames involved:

Bill (Zheng) has suggested that I put “Can you feel my heart” onto the GIF. I did that for my friend’s instead, he liked the joke (I think).

The “Mean Face” of A Population

There will be a bit more implementational details written here (because I recently suffer from not seeing any in papers I need to adopt).

The Danes dataset

I’m pretty sure the dataset is not called “Danes”, but it is as linked here. Particularly, we use the dataset with 200+ faces and obtain it from a 2007 capture of webarchive. Talk about finicky.

The Danes dataset has filenames in the format of -.jpg as well as an .asf file of the same signature. Here, the .jpg files are 640x480 image files of faces aligned to the center of each image, while the .asf are point coordinate recording files. For more details, reference their report. In our use case, we designed a parser for .asf files using RegEx. Talk about finicky.

Here are the average geometries for each categories of photos:

Here are some examples of them warpped into the average geometry of their own categories:

Here are the average faces, which you can morph into:

A gallery of morphing GIFs may be reviewed below, where prototypical faces are morphed into one image of the group:

If the GIF is not running, it’s most likely just a markdown problem. Check Here instead.

And finally, I warpped my face into the average geometries (as well as having average faces morph into my geometry):

The pictures above have already experienced a lot of translations for aligning the facial pictures along the dataset’s assets. Talk about finicky :melting_face:.

Caricatures: Extrapolating From The Mean

To produce a caricature, simply apply the method we mentioned at the methodology section, computing the following expression:

\[{\rm my~face} \times (1 - \alpha) + {\rm average~face} \times \alpha\]

Here, a value of $\alpha < 0$ allows us to extrapolate from the population mean and reinforce our features. On the other hand, a value of $\alpha > 1$ performs its opposite, reinforcing the average face’s feature. Below, I provide a table of caricatures across diverse values of $\alpha$ and subpopulations involved in the dataset.

Bells and Whistles: Manipulating the Prototypical Information of Face

In this seciton, we directly apply the gender manipulating method mentioned in our methodolgoy section near the beginning of the post. In the following gender-wise caricature, you can see the male prototype reinforced at the left image, and female prototype reinforced at the right image. Some features, like lighter eyebrows and heavier beards, are conspicuous. Below are some results:

Now I’m suddenly glad I am the way I am (perhaps until I go back and deal with my project).

Conclusion

Roses are red, the assignment is now done.
Now that the blog post is written and I learned a lot about the power of simple linear operations I am impressed and I think I can be gone.

Roses are red, my wandb alerts are booming.
Thank you for reading and grading and now I need to go deal with unsupervised reinforcement learning.

[COMPSCI 180] Project 2: Let The Imagery Blend In

2024-09-21T00:00:00-07:00

COMPSCI 180 Project 2 Writeup

*Note: Although I use the pronoun ``we’’ in this writeup, this is simply becuase I have been converted beyond the past innocence of using “I” at any academic setting due to a recent intoxication of arXiv products. Meanwhile, please bear with the lack of theoretic figures as opposed to in the last post, where some illustrations are provided to explain concepts.

Introduction

In this assignment, we explore a fundamental aspect of signal processing and computer vision– filters. Filters are mathematical objects that allow for the removal of components with certain frequencies within a signal. They are useful when we need to remove a high frequency component of the signal. For example, when a vocal recording has too much high-pitched noise, one can use a low-pass filter to permit all of the low-frequency components in our recording, effectivly excluding the high-frequency noises we mentioned before.

In images, this technique is applied across several purposes, and today we dive mostly into the construction of images and edge detection via use of filters coupled with many familiar mathematical notions: derivatives, gaussian distributions, and convolution. In this writeup, we detail the work we have done across several tasks assigned throughout the assignment.

Preliminaries

This section outlines the preliminary methodologies used within this assignment to fulfill numerous objectives, containing all methodological descriptions of the images’ creation. Experimental details, such as the use of hyperparameter and particular reflections on the product, are detailed in later sections. For external readers, this section serves as an overview and review of techniques; for graders, this section serves as a proof of the writer’s understanding of concepts for relevant grading items. Last but not least, for the writer, this section serves as a trial towards the completion of assignment and a finally coming period of freedom upon the writing’s accomplishment.

Gaussian Kernels

A kernel is a matrix that we can perform cross-correlation with an image on. So, particualrly, a Gaussian kernel is a kernel that can be used to perform cross-correlation, but with its values across the kernel’s matrix distributed based on the Gaussian distribution. That is, following a specific density function, values at the center of the matrix enjoys a much higher weight than those at the edge of the matrix. A demonstration of Gaussian kernels with width $30$ across various standard deviation value selections follows:

Particularly, it follows the following density function:

\[h(u, v) = \frac{1}{2\pi\sigma^2} \exp\left(-\frac{u^2 + v^2}{\sigma^2}\right)\]

where $\sigma^2$ serves as a hyperparameter for smoothing– the larger $\sigma$ becomes, the more smoothing involved in the process. Usually, we choose a kernel width of $6 \sigma$. Unless noted otherwise, we use this standard for kernel width throughout the post.

This format allows for outcomes of cross-correlations to be a weighted average of some block of pixels, where the weights are normalized as they are individual kernel values coming from the Gaussian distribution, which should have an area under curve of $1$ as a density function. In such manner, applying a Gaussian kernel onto a picture is essentially equivalent to creating a smoothed version of the image, which preserves the low-frequency elements of an image. In such manner, as the low-frequency images are stored, we call this act of application a “low-pass filter”.

Image Derivative

For an image as a discrete function $f(x, y)$, recall that the derivative of a continuous function could be computed as:

\[\frac{\delta f(x, y)}{\delta x} = \lim_{\epsilon \rightarrow 0} \frac{f(x + \epsilon, y) - f(x, y)}{\epsilon}\]

which brings us to a discrete equivalent:

\[\frac{\delta f(x, y)}{\delta x} \approx \frac{f(x+1, y) - f(x, y)}{1}\]

or potentially,

\[\frac{\delta f(x, y)}{\delta x} \approx \frac{f(x+1, y) - f(x-1, y)}{2}\]

These computations can in fact be presented as convolutions. Particularly, the primary derivative comes with the filter $\begin{bmatrix} -1 & 1 \end{bmatrix}$, while for the latter $\begin{bmatrix} -1 & 0 & 1 \end{bmatrix}$.

Meanwhile, we can simiarly propose the formulation of an image gradient:

\[\nabla f = \begin{bmatrix} \frac{\delta f}{\delta x} & \frac{\delta f}{\delta y} \end{bmatrix}\]

The edge of the image can even be found from the gradient magnitude:

\[\| \nabla f \| = \sqrt{\left( \frac{\delta f}{\delta x} \right)^2 + \left( \frac{\delta f}{\delta y} \right)^2}\]

The direction of the gradient can also indicate the direction of the lighting (although it also hallucinates some structure):

\[\theta = \arctan \left( \frac{\frac{\delta f}{\delta y}}{\frac{\delta f}{\delta x}} \right)\]

Convolution and Filters: A Signal-listic view of Images

Convolution is a mathematical operation defined as follows.

Let $F$ be the image, $H$ be the kernel, and $G$ be the result of this operation. Then,

\[G[i, j] = \sum_{u = -k}^k \sum_{v = -k}^k H[u, v] F[i-u, j-v]\]

Otherwise denoted as: $ G = H \star F $ It is commutative and associative. Therefore, all of the following expressions hold:

\[\begin{align*} a \star b &= b \star a \\ a \star (b * c) &= (a \star b) \star c \\ a \star (b + c) &= a \star b + a \star c \\ \alpha a \star b = a \star \alpha b &= \alpha (a \star b) \end{align*}\]

At the edge of an image, one can choose to partially apply the kernel or not. Different libraries made different calls. In this project, we choose to preserve the original image size, and make convolutions at the edge of an image based on this policy.

Notbaly, while a Gaussian kernel removes high-frequency components from the image, working as a low-pass filter, convolution of Gaussian kernels is another Gaussian kernel. That is, for a Guassian kernel of width $\sigma$, its self-convolution has width $\sigma\sqrt{2}$.

The application of convolution is diverse. Without addressing the theoretical formulations of it, here is a list of operations performed in our assignment:

Vectorizable Image Gradient Calculation: Provided that the derivative of an image at a point is its surrounding block of gradients multiplied by the $\frac{\delta f}{\delta x}$ operator (and $\frac{\delta f}{\delta y}$ alike), we may apply a convolution of these operators as masks/filters onto our image to obtain an image gradient in the operator’s direction.
Gaussian Low-Pass Filter: As mentioned in prior section, the Gaussian kernel can act as a low-pass filter once applied via convolution onto the image.

My Perspective of Your Image

There are two popular formulations for the composition of an image: (1) a pixel-based function and (2) a mathematical entity in a signal-basis space.

A pixel-based function view of the image maintains that one may imagine an image as some sort of function:

\[f(x, y) = \begin{bmatrix} r(x, y) \\ g(x, y) \\ b(x, y) \end{bmatrix}\]

where $r(x, y)$, $g(x, y)$, and $b(x, y)$ are the red, green, and blue intensities at the point $(x, y)$.

The signal-istic view of image, on the other hand, originates from empirical results like Campbell-Robson contrast sensitivity curve, which converge at the idea that humans are sensitive to different frequencies at different orders. In such case, formulating the image as a sum of many waves should enable us to decompose the image more efficiently along the human perception space (and generally, a biological perception space).

Each signal contains a fundamental building block of:

\[A \sin(\omega x + \phi)\]

as a possible member of the ``signal basis’’. And, it is hypothesized that with enough of these blocks added, we can obtain any signal $f(x)$. Note that, however, each of the building block possesses three degrees of freedom: $A$ (amplitude), $\omega$ (frequency), and $\phi$ (phase). Particularly, the frequency constructs and encodes the fine-ness of this signal. Fourier Transform is an oepration that manages to separate one image into several building blocks (as well as obtaining their coefficients).

Application: the Unsharpen Mask Filter

The signal-istic view of an image introduces us to the idea that an image has a low-frequency component and a high-frequency component. High-frequency components refer to the finer details of an image where change occurs rapidly, such as the edges of an image; low-frequency components refer to the coarser details of an image, such as the general silhouette of a portrait. We have learned that the Gaussian filter helps us extract the low-frequency component of an image. Then, since the image is a sum of low-frequency and high-frequency signals, the signal-istic view suggests that subtracting the low-frequency components of an image away ought to provide us only the high-frequency signals: the sharp edges of an image and the subjects within it. Therefore, we can reinforce the edges of an image by adding the high-frequency signals: $f - f \star g$ onto the original image $f$.

However, it may occur that the high-frequency component of the image does not have a large enough magnitude, so we may want to add a scalar multiple of it rather than its original, forming the computation:

\[f + \alpha (f - f \star g)\]

The mathematical properties of convolution allow us to summarize this sharpening action, called an unsharpen mask filter, as an operation involving one single filter:

\[f \star ((1 + \alpha){\rm unit~impulse} - \alpha g)\]

Gaussian and Laplacian Pyramids

In the last assignment, we have worked with a naive image pyramid that downscales images and upscales them back, allowing for a recursive problem-solving structure:

In a Gaussian pyramid, we apply this structure by attaching a Gaussian filter to the construction of each layer downwards, such that each subsequent layer experiences a low-pass filter and is also downscaled. Computation with a prior layer, which contains an image two-times larger, involves the rescaling of a smaller layer. The Gaussian pyramid therefore contains images of differing low-pass results, which selects lower and lower frequency components. Consequently, subtracting each consecutive layer grants us a range of frequency components between those consecutive layers, essentially creating what we call a bandpass filter (bandpass means middle-pass, as opposed to low-pass or high-pass).

Therefore, based on the above theory, we may construct a modified framework called the Laplacian pyramid, where each layer of the Laplacian pyramid is the difference of some respective difference of Gaussian pyramid’s consecutive layers. By propety, Laplacian pyramids have one layer less than Gaussian pyramids, so the final layer of Laplacian pyramids are formualted to be the last layer of Gaussian pyramid.

To summarize, in a Guassian pyramid, there is a cascading of several low-pass filters, which splits the signals by bands in the filter. To obtain a bandpass pyramid rather than a low-pass pyramid, we can subtract each layer and an upsampled version of its previous layer. This variation, which is called a Laplacian pyramid, performs Laplacian filters which accept higher frequency waves.

On the other hand, a Laplacian pyramid then contains images with only the bandpass frequencies. Usually, we find local structures to be stored in the Laplacian pyramid, which becomes coarser as we traverse into deeper layers of the pyramid. And because of the Laplacian pyramid’s property as a consecutive difference, as we add all the images in a Laplacian pyramid and the lowest frequency picture in the Gaussian pyramid, we can recover the original image.

For external readers, please refer to this slidework from COMPSCI 180 of UC Berkeley for brilliantly illustrated intuitions and insights.

Fun with Filters

Overview of Methodology

In this problem, we consider the cameraman image:

and we want to extract the edges of this image. Following the theoretic insights described in the previous section, two natural solutions stand: (1) using a very-high-pass filter, or (2) using an image gradient magnitude image. In this problem, we consider the latter.

Particualrly, we experiment on three different methods: (1) naively using the image gradient magnitude image, (2) applying a Gaussian filter first, then applying the image gradient magnitude treatment, and (3) applying the convolution of Gaussian filter and image derivative operators, then obtaining the image gradient magnitude.

Notably, by the mathematical property of convolution, methods (2) and (3) are theoretic equivalents. Therefore, a portion of the work below will also concern the equivalence of outcomes between methods (2) and (3), beyond the improvement that they can provide beyond (1) in reducing high-frequency noises coming from the grassland background of cameraman.png.

Naive Image Gradient Magnitude

Via methods described in the theory section, we arrive at the following picture:

Here, each subplot concerns a specific cutoff of the image. Particularly, the shown $f_{clip}$ of image gradient magnitude image $f$ is clipped such that all values above the specified $\alpha$ has pixel-values set to 1, and otherwise 0. The original pixel values are shown in the subplot called Original. Empirically, we see that the $\alpha$ treatment helps to make the elicited result more concrete, but also observe the accompanying noises it can bring upon. Therefore, we choose to first smooth out the noises by applying a low-pass filter (enacted as a Gaussian kernel and convolution), then applying our current technique.

Gaussian Filters Goes Brrrr

Here, the Gaussian filters are created as an outer product of two 1-dimensional Gaussian filters, each with a width of 30 and a standard deviation denoted upon the figure.

Method (2)’s outcome is as attached below:

Meanwhile, method (3)’s outcome is as attached below:

and its filters are:

Here, we witness an empirical similarity of results in methods (2) and (3), and also note the significant improvement of edges’ clarity and noise removal at the rightmost column of methods (2) compared to the best outcome in method (1). Therefore, all qualities of the assignment’s problem are satisfied.

Fun with Frequencies: Image Sharpening

Overview of Methodology

We directly apply the aforementioned unsharpened mask filter in the preliminaries section on every investigated image.

Sharpening Blunt Images

Let’s look at the proposed mask’s effect on some soft/blurry images:

Across all alphas, we see a reinforcement of higher-frequency details in the image. This trend persists in sharper images:

Fun with Frequencies: Hybrid Images

Overview of Methodology

Hybrid images are (possibly cursed) pictures that occur to appear as two different images at the same time. This occurs by mixing the high-frequency aspects of one image and the low-frequency aspects of another, which can cause the image to look differently across different viewing distances. Below, we demonstrate such operation with the example of a Derek-Nutmeg (catboy, or Nutrek, if not Dermeg, but preferrably catboy).

The procedure of creating a hybrid image is outlined as follows:

Use the provided code to align the two images that will be mixed together, forming two images with the same dimension.
Extract the low-frequency aspects of the far-viewable image, and the high-frequency of near-viewable. In the case of a catboy, boy is far-view and cat is near-view.
Average the two images together, and hold the image either absurdly close or far to spot a cat and a boy in the catboy.

catboy

Let us first discuss the catboy.

We may examine our catboy closer:

To prove the theoretic assumptions postulated above, let us look at the frequency maps of low-frequeney Derek variant and high-frequency Nutmeg variant, attached below in the order they are addressed above:

In this manner, we indeed observe that high frequency variant of nutmeg entertains a wide range of higher frequencies, while the low frequency variant of the Derek image entertains mostly low frequencies except some beam of higher frequencies.

Notably, low-frequency components are better off colored than not, as high-frequency components being colored will make them more dominant at a further distance.

Failure cases

There are several failure cases we can discuss, but I’d look at this one:

In this one, it didn’t work because the high-frequency picture of the canny face (captured at top row of above plot) is hard to elicit in clarity, and the two images are also hardly align-able.

Exercises that were Left to The Writer, but now to The Reader

With the above techniques, let us compose two more hybrid images.

Choose a coefficient set of your preference from above to form your favorite Clown-rus. Qualitatively, then, Cyrus’s picture is left more adapted to the clown picture rather than the uncanny picture in the prior subsection.

On the other hand, let us picturize our deer Ryan.

Again, choose a coefficient set of your preference from above to form your favorite deer-ryan. This is inspired by the extensively discussed empirical similarity of Ryan and an anime deer (with some necessary rotations to his face):

Now you may understand why the section is titled this way. The images are not only unbearable to create for its difficult processes, but also unbearable to see, worthy to be called an exercise for both of us writers and readers. Notably, Ryan has posted a picture of my face blended onto the “financial support” meme, and on the other hand, Cyrus hasn’t started the assignment yet.

Fun with Frequencies: Multiresolution Blending

Overview of Methodology

In this task, we concern ourselves with creating absurd entities by blending different images into one, via a Laplacian pyramid that effectively interleaves the details of two pictures with the help of a binary mask. The construction of such pyramids are detailed in the preliminary section at the beginning of this writeup. This submission did not use a stack, we used a pyramid.

In all images, we follow the following procedure to produce our products:

Select two pairs of pictures. I already decided that one pair will be about deer and Ryan, so just need to select another pair.
Create a mask that can combine the two by background-removal, using image editing tools like Adobe Express/Powerpoint.
Hand-refine by a stylus pen.

Peeling the Layers of Oraple

The mask of oraple is simply crafted as a horizontal binary mask. Let us first review the layers of Laplacian pyramids and mask Gaussian pyramid below:

The blended image’s Laplacian layers, without summing, are:

Review the subtitle above each subplot to review what each subplot refers to. Notably, Laplacian layers have been normalized for visibility. Meanwhile, here are some results:

Let us consider another trivial case of blended images with a diagonal binary mask, where I mix two minecraft swords into one.

Irregular Images from Irregular Masks

Now, let us consider some cases for irregular masks, where the masks are created based on the apparoach outlined in the first subsection of this section.

Let us first consider a penguin.

As you see, I blended a pen to a guin.

Now, time for deer Ryan again:

Discussion and Future Work

In this project, I learned the importance of a signal-istic view for images and its application onto image processing. Such view holds great potential for detecting and removing noise for diverse forms of data, potentially applicable for settings that require high data quality, such as deep and reinforcement learning.

[COMPSCI 180] Project 1: Color Channel Alignment

2024-09-08T00:00:00-07:00

COMPSCI 180 Project 1 Writeup.

Introduction

Aligning color channels of an image is a prevalent problem for existing image files where the saved image in channels R, G, B are not well aligned. Specifically, an image might be offered as follows, where directly overlapping the three subfilms of the image results in an unideal picture. Several blog posts on the internet have addressed solutions to this matter, and witnessed successful results using classical methods with minimal linear algebra usage. In this blog post, we discuss the exercise of naive methods that can solve this problem, as well as the extent of limitation that these naive methods, extracted from suggestions in the homework instructions main page. Hopefully, the blog post is a quick write and a quick read, so I can go back and debug PPO afterwards.

Background

In this blog post, there are three major piece of knowledge that will be utilized throughout our methods introduced below: (1) the use of cosine similarity (not its extension, NCC), (2) a general method of exhaustive search for finding optimal displacement in smaller images, and (3) an image pyramid.

Cosine Similarity

Cosine similarity is widely used across many literature to compare vectorized representations, and recently largely applied onto computing the semantic similarity of embeddings. A vital application of cosine similarity may be found in attention mechanisms. Witnessing this success in cosine similarity, we will be comparing displaced imagges via cosine similarity.

An extension of cosine similarity, called Normalized Cross-Correlation (NCC), is a much superior choice. The metric is formulated as follows:

\[NCC(img_1, img_2) = cos(img_1 - \mu_{img_1}, img_2 - \mu_{img_2})\]

This metric proves to be useful for most images except emir, while picking the right hyperparameters for a search process with the cosine metric performs a more consistent alignment across images. Therefore, in this post, we only use cosine similarity as the metric for comparing images.

Optimal Displacement Search in Small Images

To find an optimal displacement (two-dimensions) for small images, we can first target a search-space (say, from offsetting an image with -15 pixels to 15 pixels), and try each two-dimensional vector within the range ${-15, \dots, 15} \times {-15, \dots, 15}$. This means the search-space scales quadratically with the amount of offsets we want to investigate per dimension. In such sense, this method essentially solves the following optimization problem via exhaustive search:

\[\max_{x \in [-15, 15], y \in [-15, 15]} \cos({\rm offset}(img_1, x, y), img_2)\]

Here, the ${\rm offset}$ function is implemented as a img1.roll(x, y, axis=(0, 1)) call. A cropping approach was attempted in the middle of the project, but replaced with the above option for solution of better quality.

Image Pyramid

While the above method is proven effective on smaller images (specifically, with each dimensions being less than 400 pixels), it will take a large amount of time on larger images. Therefore, a school of method employs the approach of first solving the displacement finding problem on a small, downsampled version of the image, then adjusting its estimate of the optimal displacement as the base problem is solved and the image is upsampled. Particularly, the method can be phrased as an iterative procedure with the following steps:

Downsample the image until a base problem size, in our setting when either of the image’s dimension is less than 400 pixels.
Solve the optimal displacement search problem on the image, checking offsets in $[-15, 15]$.
Upscale the displaced image by the original downsample factor.
Solve the optimal displacement search problem on the upscaled image, with a smaller problem size. In our case, we check offsets $[-1, 1]$.
Go back to Step 3 until the image is cannot be further upscaled. This is a U-shaped process, where we begin from the original size of the image onto the first, then second, gradually the kth layer of image pyramid where the image is downscaled to the base problem size, then recede along the pyramid to return to the originally-sized image where a displacement is found.

Recursively, we may phrase the final found displacement with the following mathematical expression. Suppose the $(n-1)^{th}$ downscaled image finds a local displacement of $(x_{n-1}, y_{n-1})$, and the local displacement found on the nth downscaled displaced image is $(x_{local, n}, y_{local, n})$, then the true displacement found at the nth layer of this process would be $\left((x_{n-1} / f_{downscale}) + x_{local, n}, (y_{n-1} / f_{downscale}) + y_{local, n}\right)$.

Methods and Results

Optimal Displacement Search on Low-Res Images via Single-Scale Approach

The overall flow of our solution is as illustrated in the following figure:

First, provided a .jpg file containing all three channels of the image separately, we separate the channels of the image by slicing it into three parts with equivalent heights. Then, we crop the edges of the image away to avoid noisy information by taking $10\%$ of each dimension’s end away. Next, we perform the small single-scale alignment procedure over a problem space of $[-15, 15] \times [-15, 15]$ pixels as stated in our Background section. Upon finding the appropriate displacements, we roll the images according to the found solution to recreate the original image.

A survey of the provided image’s outcomes can be seen as follows:

Image Pyramid-Empowered Multi-Scale Approach

The overall flow of our solution is as illustrated in the following figure:

First, we separate the provided .tif image file into three subparts of equivalent height as shown in the single-scale approach. Then, we follow the image pyramid approach with a problem size of $[-20, 20] \times [-20, 20]$ pixels in the base problem, and $[-1, 1] \times [-1, 1]$ pixels for subsequent upscaled layers. The downscaling factor is $0.5$. Note that, an empirical test has been performed for downscaling factors across $0.25$ to $0.95$, and the chosen value performs most consistently across the desired set of images. The results are exhibited as follows:

Then, here is a table of found displacements that produces the following images:

Name	Red Displacement	Green Displacement
Cathedral*	(12, 3)	(5, 2)
Monastery*	(3, 2)	(-3, 2)
Tobolsk*	(6, 3)	(3, 3)
Church**	(66, -7)	(33, 7)
Emir**	(108, 66)	(58, 33)
Harvesters**	(134, 24)	(66, 24)
Icon**	(100, 33)	(49, 24)
Lady**	(125, 9)	(58, 16)
Melons**	(177, 16)	(92, 15)
Onion Church**	(117, 41)	(58, 33)
Sculpture**	(151, -25)	(41, -8)
Self Portrait**	(177, 41)	(84, 33)
Three Generations**	(117, 16)	(58, 24)
Train**	(100, 41)	(49, 16)

*: Obtained via single-scale; **: Obtained via multi-scale.

Discussion and Conclusion

In this blog post, we successfully replicated preliminary methods for color channel alignment. I will now go back and debug PPO :’)