NeRF is a novel scene synthesis technique first proposed by Mildenhall et al., which attempts to visualize complex geometries at high resolutions via three main methodological features:
Several of these approaches are powered by MLPs.
For this precanned final project of COMSPCI 180, we aim to implement an easier version of NeRF.
A vector field is essentially a function that maps each n-dimensional coordinate to a vector. A neural field, temrinologically similar, is a neural network architecture that maps some n-dimensional coordinate to a particular 3-dimensional datapoint that represents its color. Fundamentally, we train a neural network $F: (x, y) \rightarrow (r, g, b)$ to predict the color of each pixel on an image.
As an exercise, let us discuss how to build such neural fields in the easiest case like for a two-dimensional image. In NeRF, a similar approach is taken: a three-dimensional world coordinate and a two-dimensional representation of the camera viewing direction are used as the input of a neural network to output the color of a pixel (along with its density, but let us abstract this away first). However, as NeRF has mentioned, prior works have failed replicating high-frequency details of an image with this approach. To allow for this variability, we need the field to be constructed with higher-dimensional inputs.
To construct higher-dimensional inputs from low-dimensional points, we introduce positional encoding. Simply put, it’s an expanded representation of some low-dimensional point. In our implementation, we use the following:
\[PE(x) = \{x, \sin(2^0 \pi x), \cos(2^0 \pi x), \cdots, \sin(2^{L-1} \pi x), \cos(2^{L-1} \pi x)\}\]where the hyperparameter $L$ controls the dimensionality of our expanded representation. This is similar to the samely named idea used in transformer architectures.
Therefore, to train a 2D neural field that predicts the color of any pixel in a 2D image, we will first train the neural field on the 2D image per se, and use the MSE of the predicted and true color of each pixel as the training objective. Additionally, we evaluate our results via the metric Peak Signal-to-Noise Ratio (PSNR), formulated as follows:
\[{\rm PSNR} = 10 \log_{10} \bigg( \frac{1}{\rm MSE} \bigg)\]In the 2D neural field, each pixel is one pair of predictor and response variables. This works similar in 3D neural field. In the 3D neural field, we predict the color of a specific pixel, from an image shot by an arbitrary camera. The arbitrary camera’s viewpoint (and its location) with respect to the world is defined by a rotation and a translation. Therefore, a 3D neural field can be thought of as a 2D neural field simply augmented by camera information. However, it is not easy to introduce these information into a single architecture. First, we discuss how to leverage camera information to help our coming pixel-predicting tasks.
First, let us discuss how to convert the coordinate of a pixel from a camera into a “world-coordinate” that can serve as a shared coordinate system across pictures of all cameras.
The coordinate of a pixel on an image can be transformed into a camera coordinate system via the following relation:
\[\begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix} f_x & 0 & o_x \\ 0 & f_y & o_y \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix}\]The camera-to-world coordinate conversion can be expressed in terms of the rotation and translation of a camera:
\[\begin{bmatrix} x_c \\ y_c \\ z_c \\ 1 \end{bmatrix} = \begin{bmatrix} \mathbf{R}_{3 \times 3} & \mathbf{t} \\ 0_{1 \times 3} & 1 \end{bmatrix} \begin{bmatrix} x_w \\ y_w \\ z_w \\ 1 \end{bmatrix}\]Without diving too much into the details of these expressions, let us consider a pixel-to-camera-to-world coordinate transformation constructed along these relations.
In the training process of a 3D neural field, a ray is synonymous to a pixel. A ray is simply the line of light passing through one particular pixel for an image coming from a specific camera. It is the unit of “datapoint” in our model.
We can sample points along a ray (known to be a world corodinate) by first computing the origin of the ray $\mathbf{r_o}$, then add a parametrized multiple of the ray’s direction $t \mathbf{r_d}$. The particular mathematical operations to compute these components of a parametric line (ray) can be found in the assignment instructions.
With these rays, we sample points along them to obtain the density of datapoints and the colors that a ray observe at the particular datapoint. This allows us to consider the depth of a camera’s sight. Then, we may use volumetric rendering techniques, which takes in a series of densities and color observed along a ray to compute the “expected” color that ray would perceive, which would therefore grant us the color of the corresponding pixel.
To train a 3D neural field, we follow this procedure for each gradient step:
Bells and Whistles explanation: by simulating an infinitely large density at the end of a ray, we can convince the volumetric rendering principles into thinking there is a board of certain color at the end of the ray, and therefore, whenever the ray predicts not hitting any object or 3D point (originally resulting in black color), we convince it to think it has hit a white or gray board instead. The concrete changes only involve concatenating a large density value (say 100) and a targeted color different than black at the end of model predictions for densities and colors. However, this trick should only be done on evaluation, as we still want to train the model to think empty locations in scenes are dark.
We conduct three main experiments:
We construct a 2D Neural Field on the following images:

Here are the neural fields produced throughout training


Additionally, we assess the impact of two hyperparameters on the final result of these neural fields: the maximum frequency controller $L$ and the learning rate of the process.
The learning rate of the process had minimal influences:


The maximum frequency controller, on the other hand, had substantial influences. Particularly, with a larger $L$, the maximum permitted frequency in the image increases, and therefore allows for finer details int he image. Vice versa.


To train a 2D neural field, we implement the architecture and volumetric rendering functions assigned in the manual, alongside other coordinate system transformations. We also select the following different hyperparameters from the spec, as the spec’s results are somewhat irreproducible:
Here is a demonstration of the camera rays, the first image having all rays sampled from a single camera, and the second image having rays sampled across all 100 training cameras.


Here is the resulting training curve of our 3D neural field:

Here is how the validation set images change throughout the training process:

And here is a spherical rendering of our neural field on the test set cameras (transposed):

Its background color can be changed, but there will be dynamic dark noises as shadows are ill-defined in the scenes. The following GIFs result from 7000 gradient steps of training. The resulting dark signal can be thought of as an “overfitting” phenomenon where the model is overly convinced that empty scene locations must be dark.

The random seed used at here is always 180.
To set up for this project, we need to:
Here is a demonstration of the prepared model’s capabilities with inference step $20$:

and with inference step $40$:

Diffusion models are generative models that create samples from a distribution via learning to denoise some noised version of an image.
Particularly, we would denote $x_0$ as an object that is completely clean, such as a picture that we want to generate. Then, the intuition of learning a diffusion process is that we would observe how adding noise destructs the appearance of an image. This is known as a noising process, where we proceed from $x_t$ to $x_{t’ > t}$.
By observing the reversed trajectory of that destruction process, we learn the generation process, going from $x_{t’ > t}$ to $x_t$. The name of this process comes from its literal interpretation– it’s a denoising process.
Specific equations formulate these processes, normally with the help of normal random bariables $\mathcal{N}$.
The forward process, also known as the noising process, is defined as:
\[q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}} x_0, (1 - \bar{\alpha_t}) \mathbf{I})\]Be very cautious that $\mathbf{I}$ is a matrix of $1$s, not an identity matrix. This comes from the primitive form of generating $x_t$:
\[x_t = \sqrt{\bar{\alpha}} x_0 + (1 - \bar{\alpha_t}) \epsilon, \epsilon \sim \mathcal{N}(0, 1)\]Throughout the post, we will be using this test image quite often, so remember its existence…

Implementing this procedure with $t \in [0, 999]$ on images with height and width of $64$, we obtain the following results as we noise the image for a different amount of timesteps:

By applying a Gaussian blur, we hope to remove the noise in an image. However, its results are unideal when the image is very nosiy:

But if we instead try to estimate the noise using the UNet of our diffusion model, then we can achieve an effective estimation of the clean image:

Having to denoise through $1000$ steps is costly. Recent work has discussed the possibility of one-shot diffusion, but I haven’t read the paper, so here’s an okay cheap alternative to that. We can use strided timesteps, which can scale the amount of timesteps required down by roughly threefold. Particularly, the noisy image at timestep $t$ now becomes:
\[x_{t'} = \frac{\sqrt{\bar{\alpha}_{t'}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt(\alpha_t) (1 - \bar{\alpha}_{t'})}{1 - \bar{\alpha}_t} x_t + v_{\sigma}\]the definitions of these variables can be found in the instructions. This strided timestep works as an interpolation trick.
Here is a view of the noising process:

And here is a comparison of the denoised results via diffusion and classical approaches:

To sample from the distribution of images that our diffusion model models, we can step through the entire denoising trajectory by starting with images composed of purely random noise.

Interestingly, involving a classifier-free guidance by using noise estimates from conditioned and unconditioned prompts can improve our image quality. Let $\epsilon_c$ be the conditioned noise estimate (where there is an existing prompt we’d like), and $\epsilon_u$ be the unconditioned noise estimate (in our context, it is the noise estimate for a null empty prompt). Then, the new estimate is experssed as:
\[\epsilon = (1 - \gamma) \epsilon_u + \gamma \epsilon_c\]where the hyperparameter $\gamma$ is free of choice and encouraged by the assignment to be $7$. The samples of this method observe a much higher quality:

From this point forward, we will benefit all of our approaches with the use of CFG. The first application we’d like to try out is image-to-image translation, which demonstrates a process of arbitrary images slowly leaning towards looking alike to an originally provided image:

This procedure is otherwise known as SDEdit.
We can also apply this trick on web images and hand-drawn images to make it high-quality photos in intermediate steps.
Web image:

SDEdit Results:

Hand drawn image A:

SDEdit Results:

Hand drawn image B:

SDEdit Results:

Impainting is a trick where we only provide diffusion for an unmasked region of the image, allowing for a portion of the image to be altered. To put this into mathematical formulation, at obtaining $x_t$ for each $t$, we apply the following update as well for a mask $\mathbf{m}$ that is $1$ where new contents should occur:
\[x_t \leftarrow \mathbf{m} x_t + (1 - \mathbf{m})~{\rm forward}(x_{orig}, t)\]Here are some possible results of this technique:
Setup A:

Result A:

Setup B:

Result B:

Setup C:

Result C:

We can also let images shift from one prompt’s content to another by changing the conditioned prompt from “a high quality photo” to something else, like “a rocketship”. Here are some example outputs of the technique:

Visual anagrams can be produced by mixing class-conditioned noises of each text prompt. The full algorithms is noted at the assignment instruction, although it misses the CFG part of the formulation.
Here are some results:
Prompt: “an oil painting of an old man” flips into “an oil painting of people around a campfire”

Vertically flipped:

Prompt: “a lithograph of waterfalls” flips into “a lithograph of a skull”

Vertically flipped:

Prompt: “an oil painting of a snowy mountain village” flips into “a photo of a dog”

Vertically flipped:

Back to making hybrid images! To do so, similarly follow the instructions, but instead of having separate unconditioned noise estimate, use one unconditioned noise estimate with the sum of noised estimates to perform CFG-styled noise estimation.
Prompt: At far, it looks like “a lithogram of a skull,” but upon close inspection, it’s actually “a lithogram of waterfalls”.

Prompt: At far, it looks like “tornado in a desert,” but upon close inspection, it’s actually “the face of a programmer”.
Check particualrly generated image 2.

Prompt: At far, it looks like “sunset in a desert,” but upon close inspection, it’s actually “a bar with dim lighting”.
Check particualrly generated image 3.

The random seed used at here is always 0.
The UNet is an amalgation of several smaller convolution-based blocks. I implemented it. If you believe me, no need for further action. Else, here are some results.
A demonstration for the noising process across choices of $\sigma$:

The training loss (logarithmically transformed) of the model:

Some denoised results of the model:

The performance of our denoise trained on $\sigma=0.5$ when it encounters a higher noise level:

We can condition the UNet on timestep information to inform it how to denoise images provided the temporal situation of the denoising sequence. Particularly, this occurs with a fully connected block that takes timestep as an input.
The training loss of such model, logarithmically transformed, follows:

Its outputs:

Similar to the theories of CFG, you can use class-conditioned embeddings to help diffusion models gear towards generating images of particular classes.
The training loss of such model, logarithmically transformed, follows:

Its outputs at the 5th epoch:

Its outputs at the 20th epoch:


In this (rushed) blog post, I detail my doings (and now corrected wrongdoings) in Project 4A (one of them being starting late due to all the other businesses in life).
The assignment discusses image warping and mosaicing. Image warping is a business that we have discussed using a large part of our previous post, while mosaicing is a new topic. What is mosaicing? Mosaicing is stitching two pictures of different perspectives together, forming a panorama of some two sights that have resulted from an observer staying at a fixed global coordinate, but perhaps looking at something from a different angle. An example will be quickly shown in our results section.
We have repeatedly mentioned the word “stitching”, but what really is “stitching”? And how is warpping, an elemental operation for image transformation, related to our works (and my wrongdoings)? We will detail the methodological details of these efforts in the Preliminaries sections, then describe their experimental outcomes using the section following Preliminaries.
Let us consider a toy example of the two following pictures:
| Picture A | Picture B |
|---|---|
![]() |
![]() |
These are different views of Prof. Efros’ office seen from the same global coordinate (I stand at the exact same location when looking at these views), but the door’s overall orientation is quite different, because the angle at which I look at the door is different. Therefore, one door is slanted towards the left, and the other is slanted towards the right! Amazing!
So how do we stitch these two pictures together? Well, there is a commonality between these two pictures that we can merge with: the door. Let’s just overlap the images by where the door is… except we can’t just do that. The door’s shapes are different. We must transform, say, Picture B’s door’s shape into that of Picture A’s when we overlap the images (and naturally, all other components follow the same transformation). This is where warpping comes in.
In the last blogpost, we discussed finding an affine transformation between two triangles by finding a 3-by-3 transformation matrix via homogeneous coordinates. Here, we concern a different form of transformation: projective transformation.
\[\begin{bmatrix} wx' \\ wy' \\ w \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}\]With at least 4 pairs of $(x, y), (x’, y’)$ as well as the definite assumption that $gx + hy + 1 = w$, we can construct the following system of linear equations and use least squares algorithm to find the optimal estimators for projective mapping parameters:
\[\forall (x, y, x', y') \in \mathcal{D}, \begin{cases} x' = ax + by + c - gx x' - hy x' \\ y' = dx + ey + f - gx y' - hy y' \end{cases}\]here, $\mathcal{D}$ is the set of all correspondence points, such as the corners of the door and doorknobs.
Upon obtaining this projection matrix, it’s warping time. The general technique of warpping can be seen at the post for project 3, but I will kindly reiterate it here.
There are two general patterns for warpping: forward-warpping and inverse-warpping.
In forward warpping, once we infer the operator $\mathcal{T}$, we map the value of each pixel at position $(x, y)$ to its transformed equivalent $\mathcal{T}((x, y))$, and non-integer pixel coordinates will have its values distributed among neighboring pixels. However, this pattern of warpping easily leads to “holes” in the resulting product, where certain pixels do not receive any coloration.
Therefore, our assignment proceeds with an alternative method: inverse warpping. In this paradigm, we infer an operator $\mathcal{T}$ from the source image to a target image, and use the inverse operator: $\mathcal{T}^{-1}$ to do pixel mapping as described above. However, since the source image’s pixel values are all known, we can safely interpolate the unknown pixel values with its neighbors. An efficient manner of doing so is bi-linear interpolation, where for each $(x_t, y_t) = \mathcal{T}^{-1}(x, y)$ resulted, its coloring is inferred as a weighted sum of its neighbor. In our assignment, we apply a variation of this logic that will be described below (out of the unclarity of the original assignment’s instruction, we devise a variation of this interpolation function fitting our context).
To construct a smooth transitioning blending mask, we will formulate our mask with the following procedure:
The overall flow of mosaic would then seem as follows:
To detect features that are helpful for alignment, we can consider first what kind of features are best as correspondence points between two images. And that is corners. Corners of a door, or a wall, can often represent a distinct boundary between some objects, and is therefore distinctive of the geographical location of something in a picture. Therefore, when automatically detecting features, we don’t want to cut corners, but we want to find corners.
The Harris corner detector described in lecture (which you should go to) fulfills this purpose with the simplified procedure below:
Now, then, we get a set of pixels that can be Harris corners based on peak-finding algorithms.
However, is a pixel a good feature because it is a Harris corner, or is the pixel a Harris corner because it is a good feature?
The answer is, a pixel is a Harris corner because it’s a potentially good feature, but it might not be useful, because it can be a not-so-prominent peak within the space of $R$ values throughout pixels. Therefore, we use an additional technique called Adaptive Non-Maximal Suppresion (ANMS) to filter the Harris corners down to a set of $250$ or $500$ points.
The ANMS algorithm follows as:
Therefore, upon applying ANMS, we know throughout the Harris corners and the suppressed points, these 500 pixels alone are the promising features.
To describe a feature (a pixel, a corner), we can use an 8x8 patch around it by sampling with stride 5. That is, for a pixel at $(x, y)$, the pixels we sample as its descriptors can be described with the set of points
\[\mathcal{P} = \{(a, b) | a \in [x - 20, x + 20], b \in [y-20, y+20], (a - x) \mod 5 \equiv 0, (b - y) \mod 5 \equiv 0\}\]Well, this forms a 9x9 patch, but would I re-formulate a piece of working code to make it work again under this subtle imprecision that doesn’t affect my final result? Nah, I’d continue writing the blog post. The patch also needs to be standardized.
To match features across image A and B, we can first find nearest neighbor matches of them. That is, we first construct a nearest neighbor instance for features in image A and B, then we see what pairs of features across images A and B are each other’s nearest neighbor. Then, among those points, we filter out all pairs whose following statistic is less than $0.5$ (the paper suggests $0.4$, but I made it more generous judging by my data):
\[\frac{d_{first~NN}}{d_{second-NN}}\]Surprisingly, my implementation doesn’t use a Nearest Neighbor data structure from any existing libraries. I just use a cdist distance matrix between feature patches.
Lastly, the RANSAC algorithm can be described as an iterative process as follows, which helps us find the largest set of inliers that constructs the homography we want for our eventual warping and stitching procedure.
best_error, best_inlier_set.best_error and we have encountered the biggest inliner_set up to now, record our homography and inliner_set, and update our best_error.Then, you may stitch proud, as your features are aligned.
To make sure our warpping implementation is intact, we would like to try rectifying some aspects of an image, such that a specific set of four pairs of correspondence points should form a square or rectangle in a warpped image.
The results are as follows:


Here are the mosaics.
| Original Picture A | Original Picture B |
|---|---|
![]() |
![]() |
Results:

| Original Picture A | Original Picture B |
|---|---|
![]() |
![]() |
Results:

| Original Picture A | Original Picture B |
|---|---|
![]() |
![]() |
Results:

Generally, the results match in structure. The color differences may be efficiently resolved using a Laplacian kernel, or be more careful when taking pictures.
The Harris corner filled my entire picture for any instances, so I’ll showcase my ANMS points here instead!

As you can see, they be zoomin. Therefore, we should suppress more of them with the matching procedure.
Example patches of feature detectors for picture C1 is shown below:

As you can see, the corners on the flag posted on Prof. Efros’ door has brought some attention to our feature detector!
Furthermore, here are the true inliers found by RANSAC on each image after their nearest neighbor matching procedure discussed in the Preliminaries for 4B section above.

ITS STITCHING TIME!



It’s cool to automatically find features (even though the methods are complicated), and now I appreciate representation learning a bit more for finding these by itself.
]]>
Transformations are powerful mathematical tools that allow us to transfigure photos into specific structures with specified disentanglements of its “structural details” and “texture”. Particularly, the popular morphing technique applied onto pairs of pictures, can transform the subject of one image into the subject of another in an almostly seamless manner. And, we can also direct the methods of transformation onto faces, where we elicit the particular structural details and texture-ic properties of faces of specific demographics, in turn extrapolating faces towards different demographics or features via linear algebraic manipulations. In this assignment post, we detail all the techniques described above via the tasks prescribed by the assignment per se.
Considering the viewpoint that all images are fundamentally some tensor, the shifting of one pixel into another location can be naturally considered as a matrix multiplication. Paritcularly, the transformations that may occur via a matrix multiplications are parametrized by its operator. A survey of such transformation has been delivered in many lower-division courses, and so will be summarized with the following picture:
All of these transformations are either linear, in that it serves as a change of basis while obeying linearity (the composition of homoegeneity and superposition properties), or are affine. Fundamentally, affine transformations occur by matrix multiplications and additions, in a linear algebraic sense. They also have the following properties (migrated from the lecture slide):
Using homogeneous coordinates, which is an augmented coordinate system for any-dimensional coordinates such that $(x_1, \dots, x_n, w) = (x_1 / w, \dots, x_n / w)$, we may unify the description of affine transformations as one single matrix operation:
\[\begin{bmatrix} x' \\ y' \\ w \end{bmatrix} = \begin{bmatrix} a & b & c \\ d & e & f \\ 0 & 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x \\ y \\ w \end{bmatrix}\]Particularly, each degree of freedom from the aforementioned parametrization in $a, b, \dots, f$, comes from the possible candidate components of any affine transformations: translation, scaling, rotation, and sheering. Please see this slidework, slide #27 for an in-depth description.
Notably, any triangle can be transformed into any other triangle via an affine transformation. For the pacing of this post, let us directly assume we will be transforming triangles into triangles. The fundemental reason of this will be described in the next section, when we discuss morphing.
How many “correpondences”, or points on the triangle that definitely correpond to each other, do we need? Since each triangle has three vertices, there are six coordinates (at least) involved in each transformation. Therefore, for three vertices of the triangle $V = {V_n = (x_n, y_n) | n \in {1, 2, 3}}$ and its transformed version $V’$, we find the following system of equations that will enable us to infer the parameters of affine transformation that occurred:
\[\begin{bmatrix} x_1 & y_1 & 0 & 0 & 1 & 0 \\ 0 & 0 & x_1 & y_1 & 0 & 1 \\ x_2 & y_2 & 0 & 0 & 1 & 0 \\ 0 & 0 & x_2 & y_2 & 0 & 1 \\ x_3 & y_3 & 0 & 0 & 1 & 0 \\ 0 & 0 & x_3 & y_3 & 0 & 1 \end{bmatrix} \begin{bmatrix} \hat{a} \\ \hat{b} \\ \hat{d} \\ \hat{e} \\ \hat{c} \\ \hat{f} \end{bmatrix} = \begin{bmatrix} x_1' \\ y_1' \\ x_2' \\ y_2' \\ x_3' \\ y_3' \end{bmatrix}\]You can arrange the above equation to make the parameter list alphabetically sorted. So, based on the vertices of a transformed triangle, we can infer its transformation operator $\mathcal{T}$.
There are two general patterns for warpping: forward-warpping and inverse-warpping.
In forward warpping, once we infer the operator $\mathcal{T}$, we map the value of each pixel at position $(x, y)$ to its transformed equivalent $\mathcal{T}((x, y))$, and non-integer pixel coordinates will have its values distributed among neighboring pixels. However, this pattern of warpping easily leads to “holes” in the resulting product, where certain pixels do not receive any coloration.
Therefore, our assignment proceeds with an alternative method: inverse warpping. In this paradigm, we infer an operator $\mathcal{T}$ from the source image to a target image, and use the inverse operator: $\mathcal{T}^{-1}$ to do pixel mapping as described above. However, since the source image’s pixel values are all known, we can safely interpolate the unknown pixel values with its neighbors. An efficient manner of doing so is bi-linear interpolation, where for each $(x_t, y_t) = \mathcal{T}^{-1}(x, y)$ resulted, its coloring is inferred as a weighted sum of its neighbor. In our assignment, we apply a variation of this logic that will be described below (out of the unclarity of the original assignment’s instruction, we devise a variation of this interpolation function fitting our context).
Cross-dissolving is a technique that allows us to interpolate the pixels of intersecting, supposedly-“merged” images via linear interpolation (a weighted sum). Particularly, in this method we define a halfway image with a parametrization of $t \in [0, 1]$:
\[(1 - t) \times {image}_1 + t \times {image}_2\]One conspicuous drawback of this technique is that this only works on aligned images.
An efficient method to dictate the structure of objects in an image is via defining a triangular gridmap over important landmarks of the object. For example, the picture of one face of a diamond can be defined as several triangles, allowing us to construct a structural map for the prism-ful diamond face. An example of triangular mesh may be seen when we discuss the operational details of this assignment.
In this assignment, we employ the Delaunay triangulation method. In the naive rendition of this triangulation method, we may start with an arbitrary triangulation of our image, and we flip any illegal edge within the triangulation until no more illegal edges exist. Here, an illegal edge refers to an edge that would improve the triangulation by being flipped. A good triangulation is one where the triangles are less narrow. The precise technicalities of these approaches are (somewhat) out of scope for this blog post. Then, the clever rendition of Delaynay triangulation is by solving a dual of the triangulation problem via the use of Voronoi diagrams.
The morphing sequence is defined as a sequence of warpped image over some schedule of $w(t)$ where $t \in [0, 1]$. Particularly, the $k^{th}$ image of a warpping sequence is constructed as:
This schedule of dissolve and warp fractions can be freely decided as a monotonously increasing function with values starting from $0$ and ending at $1$.
In this assignment, we export such sequence of image as an animated GIF and output it to this website for grading purposes.
Now, let us discuss your face. My face. As well as many people’s faces. Our faces are interesting. Particularly, our faces can be defined with several landmark features. The Danes dataset of this assignment, for instance, defines 58 landmarks to each of its subject’s face.
Faces can be warpped, morphed into each other via the aforementioned strategies. Interestingly, we can also use the above viewpoints to define populational mean shapes and textures of faces for specific demographics, allowing us to define the “male prototypical” and “female prototypical” face based on the dataset labeling. BASED ON THE DATASET LABELING.
Upon doing so, we can get the supposed deviation that distinguishes male and female prototypical faces by the following procedure:
Of course, to simply create a caricature, one can also just subtract the population mean from one’s own picture. This works similarly to an unsharpened mask filter, where the edge component is reinforced by subtracting away the mean of the image at a larger weight. In the sense of creating a caricature, we simply create an image where “our” components are largely reinforced by subtracting away the population face average.
Now, let us head into the assignments.
Correspondences are defined via an online tool from last year’s student, publicized on the instructional website of this assignment. I am summarizing the overall process of picture processing as well as their resulting triangulations via the following image:
Particularly, correspondence points are chosen and selected in specific orders to secure the similarity of triangulations between warpping processes. As mentioned before, this plays a significant role in the successful, undistorted results, and prevents us from morphing our face into abstract art. Particularly, targeting essential structures of faces and preventing collinear triplets of points helps to prevent the deformation of triangular meshes.
I was walked into when I took my photo at my lab’s rotation seat, if you wanted a fun story somewhere in this blog :’)
As mentioned in the methodologies section, computing a midway face involves a three-step procedure:
Particularly, we fill in parts of the faces by constructing polygon masks. This is where a tricky point occurs. If we simply apply our affine transformation on the noted coordinates of polygon masks from original faces onto the midway face, we will find several stripes throughout the mask, which occurs due to integer coordiantes not being covered. This is also a known weakness of forward warpping. Therefore, we adapt an alternative algorithm for warpping, where:
Here, bilinear interpolation involves only the original face, whose pixel values are not to be changed. Therefore, this operation can be vectorized via array arithmetics, and occurs fairly efficiently.
The resulting mid-way face is as follows:


The scheduling for the following morphing sequence is specifically tuned to be non-linear such that the morphing aspect of the trajectory is more clearly visualized. This agenda is applied to both the wrap fraction and the dissolve fraction at the same time. The precise methodology should be referred to at the beginning of this post, in the preliminaries section.

Now, let’s look at the morphing sequence GIF:

If the GIF is not running, it’s most likely just a markdown problem. Check Here instead.
Here is a table of all frames involved:

Bill (Zheng) has suggested that I put “Can you feel my heart” onto the GIF. I did that for my friend’s instead, he liked the joke (I think).
There will be a bit more implementational details written here (because I recently suffer from not seeing any in papers I need to adopt).
I’m pretty sure the dataset is not called “Danes”, but it is as linked here. Particularly, we use the dataset with 200+ faces and obtain it from a 2007 capture of webarchive. Talk about finicky.
The Danes dataset has filenames in the format of <person-id>-<category><biological-sex>.jpg as well as an .asf file of the same signature.
Here, the .jpg files are 640x480 image files of faces aligned to the center of each image, while the .asf are point coordinate recording files.
For more details, reference their report.
In our use case, we designed a parser for .asf files using RegEx. Talk about finicky.
Here are the average geometries for each categories of photos:

Here are some examples of them warpped into the average geometry of their own categories:

Here are the average faces, which you can morph into:

A gallery of morphing GIFs may be reviewed below, where prototypical faces are morphed into one image of the group:

If the GIF is not running, it’s most likely just a markdown problem. Check Here instead.
And finally, I warpped my face into the average geometries (as well as having average faces morph into my geometry):

The pictures above have already experienced a lot of translations for aligning the facial pictures along the dataset’s assets. Talk about finicky :melting_face:.
To produce a caricature, simply apply the method we mentioned at the methodology section, computing the following expression:
\[{\rm my~face} \times (1 - \alpha) + {\rm average~face} \times \alpha\]Here, a value of $\alpha < 0$ allows us to extrapolate from the population mean and reinforce our features. On the other hand, a value of $\alpha > 1$ performs its opposite, reinforcing the average face’s feature. Below, I provide a table of caricatures across diverse values of $\alpha$ and subpopulations involved in the dataset.

In this seciton, we directly apply the gender manipulating method mentioned in our methodolgoy section near the beginning of the post. In the following gender-wise caricature, you can see the male prototype reinforced at the left image, and female prototype reinforced at the right image. Some features, like lighter eyebrows and heavier beards, are conspicuous. Below are some results:

Now I’m suddenly glad I am the way I am (perhaps until I go back and deal with my project).
Roses are red, the assignment is now done.
Now that the blog post is written and I learned a lot about the power of simple linear operations I am impressed and I think I can be gone.
Roses are red, my wandb alerts are booming.
Thank you for reading and grading and now I need to go deal with unsupervised reinforcement learning.
*Note: Although I use the pronoun ``we’’ in this writeup, this is simply becuase I have been converted beyond the past innocence of using “I” at any academic setting due to a recent intoxication of arXiv products. Meanwhile, please bear with the lack of theoretic figures as opposed to in the last post, where some illustrations are provided to explain concepts.
In this assignment, we explore a fundamental aspect of signal processing and computer vision– filters. Filters are mathematical objects that allow for the removal of components with certain frequencies within a signal. They are useful when we need to remove a high frequency component of the signal. For example, when a vocal recording has too much high-pitched noise, one can use a low-pass filter to permit all of the low-frequency components in our recording, effectivly excluding the high-frequency noises we mentioned before.
In images, this technique is applied across several purposes, and today we dive mostly into the construction of images and edge detection via use of filters coupled with many familiar mathematical notions: derivatives, gaussian distributions, and convolution. In this writeup, we detail the work we have done across several tasks assigned throughout the assignment.
This section outlines the preliminary methodologies used within this assignment to fulfill numerous objectives, containing all methodological descriptions of the images’ creation. Experimental details, such as the use of hyperparameter and particular reflections on the product, are detailed in later sections. For external readers, this section serves as an overview and review of techniques; for graders, this section serves as a proof of the writer’s understanding of concepts for relevant grading items. Last but not least, for the writer, this section serves as a trial towards the completion of assignment and a finally coming period of freedom upon the writing’s accomplishment.
A kernel is a matrix that we can perform cross-correlation with an image on. So, particualrly, a Gaussian kernel is a kernel that can be used to perform cross-correlation, but with its values across the kernel’s matrix distributed based on the Gaussian distribution. That is, following a specific density function, values at the center of the matrix enjoys a much higher weight than those at the edge of the matrix. A demonstration of Gaussian kernels with width $30$ across various standard deviation value selections follows:
Particularly, it follows the following density function:
\[h(u, v) = \frac{1}{2\pi\sigma^2} \exp\left(-\frac{u^2 + v^2}{\sigma^2}\right)\]where $\sigma^2$ serves as a hyperparameter for smoothing– the larger $\sigma$ becomes, the more smoothing involved in the process. Usually, we choose a kernel width of $6 \sigma$. Unless noted otherwise, we use this standard for kernel width throughout the post.
This format allows for outcomes of cross-correlations to be a weighted average of some block of pixels, where the weights are normalized as they are individual kernel values coming from the Gaussian distribution, which should have an area under curve of $1$ as a density function. In such manner, applying a Gaussian kernel onto a picture is essentially equivalent to creating a smoothed version of the image, which preserves the low-frequency elements of an image. In such manner, as the low-frequency images are stored, we call this act of application a “low-pass filter”.
For an image as a discrete function $f(x, y)$, recall that the derivative of a continuous function could be computed as:
\[\frac{\delta f(x, y)}{\delta x} = \lim_{\epsilon \rightarrow 0} \frac{f(x + \epsilon, y) - f(x, y)}{\epsilon}\]which brings us to a discrete equivalent:
\[\frac{\delta f(x, y)}{\delta x} \approx \frac{f(x+1, y) - f(x, y)}{1}\]or potentially,
\[\frac{\delta f(x, y)}{\delta x} \approx \frac{f(x+1, y) - f(x-1, y)}{2}\]These computations can in fact be presented as convolutions. Particularly, the primary derivative comes with the filter $\begin{bmatrix} -1 & 1 \end{bmatrix}$, while for the latter $\begin{bmatrix} -1 & 0 & 1 \end{bmatrix}$.
Meanwhile, we can simiarly propose the formulation of an image gradient:
\[\nabla f = \begin{bmatrix} \frac{\delta f}{\delta x} & \frac{\delta f}{\delta y} \end{bmatrix}\]The edge of the image can even be found from the gradient magnitude:
\[\| \nabla f \| = \sqrt{\left( \frac{\delta f}{\delta x} \right)^2 + \left( \frac{\delta f}{\delta y} \right)^2}\]The direction of the gradient can also indicate the direction of the lighting (although it also hallucinates some structure):
\[\theta = \arctan \left( \frac{\frac{\delta f}{\delta y}}{\frac{\delta f}{\delta x}} \right)\]Convolution is a mathematical operation defined as follows.
Let $F$ be the image, $H$ be the kernel, and $G$ be the result of this operation. Then,
\[G[i, j] = \sum_{u = -k}^k \sum_{v = -k}^k H[u, v] F[i-u, j-v]\]Otherwise denoted as: $ G = H \star F $ It is commutative and associative. Therefore, all of the following expressions hold:
\[\begin{align*} a \star b &= b \star a \\ a \star (b * c) &= (a \star b) \star c \\ a \star (b + c) &= a \star b + a \star c \\ \alpha a \star b = a \star \alpha b &= \alpha (a \star b) \end{align*}\]At the edge of an image, one can choose to partially apply the kernel or not. Different libraries made different calls. In this project, we choose to preserve the original image size, and make convolutions at the edge of an image based on this policy.
Notbaly, while a Gaussian kernel removes high-frequency components from the image, working as a low-pass filter, convolution of Gaussian kernels is another Gaussian kernel. That is, for a Guassian kernel of width $\sigma$, its self-convolution has width $\sigma\sqrt{2}$.
The application of convolution is diverse. Without addressing the theoretical formulations of it, here is a list of operations performed in our assignment:
There are two popular formulations for the composition of an image: (1) a pixel-based function and (2) a mathematical entity in a signal-basis space.
A pixel-based function view of the image maintains that one may imagine an image as some sort of function:
\[f(x, y) = \begin{bmatrix} r(x, y) \\ g(x, y) \\ b(x, y) \end{bmatrix}\]where $r(x, y)$, $g(x, y)$, and $b(x, y)$ are the red, green, and blue intensities at the point $(x, y)$.
The signal-istic view of image, on the other hand, originates from empirical results like Campbell-Robson contrast sensitivity curve, which converge at the idea that humans are sensitive to different frequencies at different orders. In such case, formulating the image as a sum of many waves should enable us to decompose the image more efficiently along the human perception space (and generally, a biological perception space).
Each signal contains a fundamental building block of:
\[A \sin(\omega x + \phi)\]as a possible member of the ``signal basis’’. And, it is hypothesized that with enough of these blocks added, we can obtain any signal $f(x)$. Note that, however, each of the building block possesses three degrees of freedom: $A$ (amplitude), $\omega$ (frequency), and $\phi$ (phase). Particularly, the frequency constructs and encodes the fine-ness of this signal. Fourier Transform is an oepration that manages to separate one image into several building blocks (as well as obtaining their coefficients).
The signal-istic view of an image introduces us to the idea that an image has a low-frequency component and a high-frequency component. High-frequency components refer to the finer details of an image where change occurs rapidly, such as the edges of an image; low-frequency components refer to the coarser details of an image, such as the general silhouette of a portrait. We have learned that the Gaussian filter helps us extract the low-frequency component of an image. Then, since the image is a sum of low-frequency and high-frequency signals, the signal-istic view suggests that subtracting the low-frequency components of an image away ought to provide us only the high-frequency signals: the sharp edges of an image and the subjects within it. Therefore, we can reinforce the edges of an image by adding the high-frequency signals: $f - f \star g$ onto the original image $f$.
However, it may occur that the high-frequency component of the image does not have a large enough magnitude, so we may want to add a scalar multiple of it rather than its original, forming the computation:
\[f + \alpha (f - f \star g)\]The mathematical properties of convolution allow us to summarize this sharpening action, called an unsharpen mask filter, as an operation involving one single filter:
\[f \star ((1 + \alpha){\rm unit~impulse} - \alpha g)\]In the last assignment, we have worked with a naive image pyramid that downscales images and upscales them back, allowing for a recursive problem-solving structure:
In a Gaussian pyramid, we apply this structure by attaching a Gaussian filter to the construction of each layer downwards, such that each subsequent layer experiences a low-pass filter and is also downscaled. Computation with a prior layer, which contains an image two-times larger, involves the rescaling of a smaller layer. The Gaussian pyramid therefore contains images of differing low-pass results, which selects lower and lower frequency components. Consequently, subtracting each consecutive layer grants us a range of frequency components between those consecutive layers, essentially creating what we call a bandpass filter (bandpass means middle-pass, as opposed to low-pass or high-pass).
Therefore, based on the above theory, we may construct a modified framework called the Laplacian pyramid, where each layer of the Laplacian pyramid is the difference of some respective difference of Gaussian pyramid’s consecutive layers. By propety, Laplacian pyramids have one layer less than Gaussian pyramids, so the final layer of Laplacian pyramids are formualted to be the last layer of Gaussian pyramid.
To summarize, in a Guassian pyramid, there is a cascading of several low-pass filters, which splits the signals by bands in the filter. To obtain a bandpass pyramid rather than a low-pass pyramid, we can subtract each layer and an upsampled version of its previous layer. This variation, which is called a Laplacian pyramid, performs Laplacian filters which accept higher frequency waves.
On the other hand, a Laplacian pyramid then contains images with only the bandpass frequencies. Usually, we find local structures to be stored in the Laplacian pyramid, which becomes coarser as we traverse into deeper layers of the pyramid. And because of the Laplacian pyramid’s property as a consecutive difference, as we add all the images in a Laplacian pyramid and the lowest frequency picture in the Gaussian pyramid, we can recover the original image.
For external readers, please refer to this slidework from COMPSCI 180 of UC Berkeley for brilliantly illustrated intuitions and insights.
In this problem, we consider the cameraman image:

and we want to extract the edges of this image. Following the theoretic insights described in the previous section, two natural solutions stand: (1) using a very-high-pass filter, or (2) using an image gradient magnitude image. In this problem, we consider the latter.
Particualrly, we experiment on three different methods: (1) naively using the image gradient magnitude image, (2) applying a Gaussian filter first, then applying the image gradient magnitude treatment, and (3) applying the convolution of Gaussian filter and image derivative operators, then obtaining the image gradient magnitude.
Notably, by the mathematical property of convolution, methods (2) and (3) are theoretic equivalents. Therefore, a portion of the work below will also concern the equivalence of outcomes between methods (2) and (3), beyond the improvement that they can provide beyond (1) in reducing high-frequency noises coming from the grassland background of cameraman.png.
Via methods described in the theory section, we arrive at the following picture:

Here, each subplot concerns a specific cutoff of the image. Particularly, the shown $f_{clip}$ of image gradient magnitude image $f$ is clipped such that all values above the specified $\alpha$ has pixel-values set to 1, and otherwise 0.
The original pixel values are shown in the subplot called Original.
Empirically, we see that the $\alpha$ treatment helps to make the elicited result more concrete, but also observe the accompanying noises it can bring upon. Therefore, we choose to first smooth out the noises by applying a low-pass filter (enacted as a Gaussian kernel and convolution), then applying our current technique.
Here, the Gaussian filters are created as an outer product of two 1-dimensional Gaussian filters, each with a width of 30 and a standard deviation denoted upon the figure.
Method (2)’s outcome is as attached below:

Meanwhile, method (3)’s outcome is as attached below:

and its filters are:

Here, we witness an empirical similarity of results in methods (2) and (3), and also note the significant improvement of edges’ clarity and noise removal at the rightmost column of methods (2) compared to the best outcome in method (1). Therefore, all qualities of the assignment’s problem are satisfied.
We directly apply the aforementioned unsharpened mask filter in the preliminaries section on every investigated image.
Let’s look at the proposed mask’s effect on some soft/blurry images:


Across all alphas, we see a reinforcement of higher-frequency details in the image. This trend persists in sharper images:

Hybrid images are (possibly cursed) pictures that occur to appear as two different images at the same time. This occurs by mixing the high-frequency aspects of one image and the low-frequency aspects of another, which can cause the image to look differently across different viewing distances. Below, we demonstrate such operation with the example of a Derek-Nutmeg (catboy, or Nutrek, if not Dermeg, but preferrably catboy).
The procedure of creating a hybrid image is outlined as follows:
Let us first discuss the catboy.

We may examine our catboy closer:

To prove the theoretic assumptions postulated above, let us look at the frequency maps of low-frequeney Derek variant and high-frequency Nutmeg variant, attached below in the order they are addressed above:

In this manner, we indeed observe that high frequency variant of nutmeg entertains a wide range of higher frequencies, while the low frequency variant of the Derek image entertains mostly low frequencies except some beam of higher frequencies.
Notably, low-frequency components are better off colored than not, as high-frequency components being colored will make them more dominant at a further distance.
There are several failure cases we can discuss, but I’d look at this one:

In this one, it didn’t work because the high-frequency picture of the canny face (captured at top row of above plot) is hard to elicit in clarity, and the two images are also hardly align-able.
With the above techniques, let us compose two more hybrid images.
Choose a coefficient set of your preference from above to form your favorite Clown-rus. Qualitatively, then, Cyrus’s picture is left more adapted to the clown picture rather than the uncanny picture in the prior subsection.
On the other hand, let us picturize our deer Ryan.

Again, choose a coefficient set of your preference from above to form your favorite deer-ryan. This is inspired by the extensively discussed empirical similarity of Ryan and an anime deer (with some necessary rotations to his face):

Now you may understand why the section is titled this way. The images are not only unbearable to create for its difficult processes, but also unbearable to see, worthy to be called an exercise for both of us writers and readers. Notably, Ryan has posted a picture of my face blended onto the “financial support” meme, and on the other hand, Cyrus hasn’t started the assignment yet.
In this task, we concern ourselves with creating absurd entities by blending different images into one, via a Laplacian pyramid that effectively interleaves the details of two pictures with the help of a binary mask. The construction of such pyramids are detailed in the preliminary section at the beginning of this writeup. This submission did not use a stack, we used a pyramid.
In all images, we follow the following procedure to produce our products:
The mask of oraple is simply crafted as a horizontal binary mask. Let us first review the layers of Laplacian pyramids and mask Gaussian pyramid below:

The blended image’s Laplacian layers, without summing, are:

Review the subtitle above each subplot to review what each subplot refers to. Notably, Laplacian layers have been normalized for visibility. Meanwhile, here are some results:

Let us consider another trivial case of blended images with a diagonal binary mask, where I mix two minecraft swords into one.


Now, let us consider some cases for irregular masks, where the masks are created based on the apparoach outlined in the first subsection of this section.
Let us first consider a penguin.


As you see, I blended a pen to a guin.
Now, time for deer Ryan again:


In this project, I learned the importance of a signal-istic view for images and its application onto image processing. Such view holds great potential for detecting and removing noise for diverse forms of data, potentially applicable for settings that require high data quality, such as deep and reinforcement learning.
]]>
Aligning color channels of an image is a prevalent problem for existing image files where the saved image in channels R, G, B are not well aligned. Specifically, an image might be offered as follows, where directly overlapping the three subfilms of the image results in an unideal picture. Several blog posts on the internet have addressed solutions to this matter, and witnessed successful results using classical methods with minimal linear algebra usage. In this blog post, we discuss the exercise of naive methods that can solve this problem, as well as the extent of limitation that these naive methods, extracted from suggestions in the homework instructions main page. Hopefully, the blog post is a quick write and a quick read, so I can go back and debug PPO afterwards.
In this blog post, there are three major piece of knowledge that will be utilized throughout our methods introduced below: (1) the use of cosine similarity (not its extension, NCC), (2) a general method of exhaustive search for finding optimal displacement in smaller images, and (3) an image pyramid.
Cosine similarity is widely used across many literature to compare vectorized representations, and recently largely applied onto computing the semantic similarity of embeddings. A vital application of cosine similarity may be found in attention mechanisms. Witnessing this success in cosine similarity, we will be comparing displaced imagges via cosine similarity.
An extension of cosine similarity, called Normalized Cross-Correlation (NCC), is a much superior choice. The metric is formulated as follows:
\[NCC(img_1, img_2) = cos(img_1 - \mu_{img_1}, img_2 - \mu_{img_2})\]This metric proves to be useful for most images except emir, while picking the right hyperparameters for a search process with the cosine metric performs a more consistent alignment across images. Therefore, in this post, we only use cosine similarity as the metric for comparing images.
To find an optimal displacement (two-dimensions) for small images, we can first target a search-space (say, from offsetting an image with -15 pixels to 15 pixels), and try each two-dimensional vector within the range ${-15, \dots, 15} \times {-15, \dots, 15}$. This means the search-space scales quadratically with the amount of offsets we want to investigate per dimension. In such sense, this method essentially solves the following optimization problem via exhaustive search:
\[\max_{x \in [-15, 15], y \in [-15, 15]} \cos({\rm offset}(img_1, x, y), img_2)\]Here, the ${\rm offset}$ function is implemented as a img1.roll(x, y, axis=(0, 1)) call. A cropping approach was attempted in the middle of the project, but replaced with the above option for solution of better quality.
While the above method is proven effective on smaller images (specifically, with each dimensions being less than 400 pixels), it will take a large amount of time on larger images. Therefore, a school of method employs the approach of first solving the displacement finding problem on a small, downsampled version of the image, then adjusting its estimate of the optimal displacement as the base problem is solved and the image is upsampled. Particularly, the method can be phrased as an iterative procedure with the following steps:
Recursively, we may phrase the final found displacement with the following mathematical expression. Suppose the $(n-1)^{th}$ downscaled image finds a local displacement of $(x_{n-1}, y_{n-1})$, and the local displacement found on the nth downscaled displaced image is $(x_{local, n}, y_{local, n})$, then the true displacement found at the nth layer of this process would be $\left((x_{n-1} / f_{downscale}) + x_{local, n}, (y_{n-1} / f_{downscale}) + y_{local, n}\right)$.
The overall flow of our solution is as illustrated in the following figure:
First, provided a .jpg file containing all three channels of the image separately, we separate the channels of the image by slicing it into three parts with equivalent heights. Then, we crop the edges of the image away to avoid noisy information by taking $10\%$ of each dimension’s end away. Next, we perform the small single-scale alignment procedure over a problem space of $[-15, 15] \times [-15, 15]$ pixels as stated in our Background section. Upon finding the appropriate displacements, we roll the images according to the found solution to recreate the original image.
A survey of the provided image’s outcomes can be seen as follows:

The overall flow of our solution is as illustrated in the following figure:
First, we separate the provided .tif image file into three subparts of equivalent height as shown in the single-scale approach. Then, we follow the image pyramid approach with a problem size of $[-20, 20] \times [-20, 20]$ pixels in the base problem, and $[-1, 1] \times [-1, 1]$ pixels for subsequent upscaled layers. The downscaling factor is $0.5$. Note that, an empirical test has been performed for downscaling factors across $0.25$ to $0.95$, and the chosen value performs most consistently across the desired set of images.
The results are exhibited as follows:

Then, here is a table of found displacements that produces the following images:
| Name | Red Displacement | Green Displacement |
|---|---|---|
| Cathedral* | (12, 3) | (5, 2) |
| Monastery* | (3, 2) | (-3, 2) |
| Tobolsk* | (6, 3) | (3, 3) |
| Church** | (66, -7) | (33, 7) |
| Emir** | (108, 66) | (58, 33) |
| Harvesters** | (134, 24) | (66, 24) |
| Icon** | (100, 33) | (49, 24) |
| Lady** | (125, 9) | (58, 16) |
| Melons** | (177, 16) | (92, 15) |
| Onion Church** | (117, 41) | (58, 33) |
| Sculpture** | (151, -25) | (41, -8) |
| Self Portrait** | (177, 41) | (84, 33) |
| Three Generations** | (117, 16) | (58, 24) |
| Train** | (100, 41) | (49, 16) |
*: Obtained via single-scale; **: Obtained via multi-scale.
In this blog post, we successfully replicated preliminary methods for color channel alignment. I will now go back and debug PPO :’)
]]>