Advanced Computer Vision Meetup

Session 14: Instant Neural Graphics Primitives

2022-10-27T00:00:00+00:00

Instant Neural Graphics Primitives with a Multiresolution Hash Encoding

In this session, Moritz Hambach and Rishabh Raj will be leading the conversation. They will explain the main parts and logic behind the paper, show some demos and help moderate the session. These events are informal conversations where everyone has the opportunity to ask, clarify or add any type of intervention.

We welcome all levels. We strongly recommend reading the paper beforehand.

Links:

—————– About the Hosts —————–

Moritz works on Computer Vision and Perception for autonomous vehicles at Kopernikus. He comes from a PhD in Physics and loves improving his geometrical understanding and intuition of Deep Learning, CV and 3D representations and the corresponding algorithms (ideally unsupervised).

Rishabh is a Machine Learning and Computer vision engineer working at Kopernikus Automotive. He has his master’s at TUM and previous job and research experience in Detection, Segmentation, Tracking, and Motion Prediction. He has a passion for CV and ML. His interest is in Scene Understanding using Deep Learning and Reinforcement Learning.

Session 13: EPro-PnP

2022-10-25T00:00:00+00:00

EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation.

In this session, Matthes Krull will be leading the conversation. In the 2022 paper EPro-PnP, the authors showed how their continuous PnP layer can beat other 6DoF approaches, without the use of a fancy network design and we will dig deeper into the reasons why. The algorithm is highly mathematical, but every step they take is well reasoned and crucial to its outstanding performance, as they’ve shown in the ablation study.

In brief, several algorithms are combined and applied on top of a backbone, which outputs: A) 2D-3D correspondences B) weights for those correspondences. First, a decoupled PnP solution is found by a Levenberg-Marquardt (LM) inspired solver. Second, this intermediate solution is used in a Monte-Carlo approach to find a continuous pose distribution around this “local” solution (which is parameterized by our 2D-3D correspondences).

Here, they also use the Adaptive Multiple Importance Sampling (AMIS) algorithm to iteratively refine that pose distribution. Finally and third, this distribution is used to optimize the network by randomly sample N poses from it and calculating the gradient according to their reprojection errors.

Here are the slides and videos showed in the session:

Videos:

Building a Radio controlled car

2022-10-15T00:00:00+00:00

Updates:

Session 2 (22.10.22): We need to fix:

NN:
- Run new architecture
- Start with the dataloader
Sensing:
- Run in Jetson and get 20 FPS and up to 100ms latency
Control
- Moving controller to raspberry + ps4 controller

Session 1 (15.10.22): Mainly defining who will be doing what

Note 0: Thanks to Kopernikus Automotive for sponsoring this project and the Meetup.

For more updates please check the discord channel

General Description

BTW the picture above is the car we’re building

We’re thinking about joining the race for autonomous driving + robots meetup berlin and for that we need to build an RC car. Everyone in the team has something they want to experiment with, for example, a general end-to-end driving approach capable of working with on-board and off-board cameras. Everyone is welcome, we always need more hands.

If you’re interested in helping us, please join the event! We have some parts of the car already built, but we still need to finish that + all the perception parts.

If you’re interested, let us know! We welcome everyone who is interested in learning and/or helping us. We will be meeting usually in Berlin but sometimes online depending on the topic.

We will be posting updates of progress here.

Session 12: Text to image generation with diffusion models

2022-09-15T00:00:00+00:00

Text to image generation with diffusion models

This talk is slightly different because we will be talking about multiple papers, therefore there is no need to have read them beforehand.

In this session Felipe Cruz will talk about how diffusion models work, using dalle-2 specifically as a use case and touch on the differences with newer models (imagen, stable Diffusion, parti, etc)

Felipe is a research engineer in Aleph Alpha working on novel methods to improve large pre-trained models, Both with and without scaling up models to billion of parameters and beyond. Previously he worked at Microsoft researching about scaling up multilingual models for machine translation, he also was part of the Cortana team. He got his master’s in computer science from the University of Washington.

Slides are here: PDF**

Session 11: YOLOX in depth

2022-07-15T00:00:00+00:00

Discussing YOLOX in depth

This is a second session on the last paper to discuss specific improvement in depth of this paper. We discussed the main improvemest and the reasons behind why they work and pros and cons.

The main improvements we touch were:

Strong data augmentation
Decoupled heads End-to-end* (removing NMS)
Anchor free
Multiple positives samples
Optimal Transport Assignment
3x3 Center sampling
IoU on regression

YOLOX is a great paper because it groups a lot of new insights in Single Stage Detectors (SSD) but it is also a bad paper because it does not explains any of the concepts it uses nor proposes something new.

Presentation here: PDF**

Session 10: Discussing YOLOX and YOLOR

2022-03-06T00:00:00+00:00

Discussing YOLOX and YOLOR

Regarding YOLOX the general agreement is that the paper is great. The main added value of this paper is grouping the latest improvements in object detection for single shot detection (SSD). We personally thing the most important contribution where removing anchors and nms and the approach to improve training signal by multiple positives. It is hard to write down all the possitives effects that these changes bring.

Papers that would be interesting to understand specific points are:

“Ota: Optimal transport assignment for object detection.”
“Object detection made simpler by eliminating heuristic nms.”
“Fcos: Fully convolutional one-stage object detection”

There was mainly one conclusion from YOLOR: The idea is interesting and should be explorer further by the comunity, the execution was mediocre. comunication was poor and it is missing a lot of important details to have a proper conversation of this topic.

Using Group Theory to Solve the Rubik’s cube

2022-01-15T00:00:00+00:00

One might think that all about Rubik’s cube has been said after its invention in the second half of the 20th century: detailed instructions to follow and solve it, variants of the original \(3\times 3 \times 3\) version, computer-vision aided systems to read in a cube state and solve it physically by making use of robotics and so on. We want to understand how it works by describing precisely what happens when certain operation is applied to it and how operations affect its state in general. In the end, we seek an explicit closed formula or algorithm for solving it that should be a consequence of these properties.

The implicit geometrical symmetries of the cube seen as an object in the space, the cyclic nature of operations and the fact that cube state transitions are permutations of its forming sub-cubes motivate us to express its properties in the language of group theory. This will also allow us to have a mathematical perspective of the puzzle and apply any classical result from the theory when it’s appropriate. We haven’t been exposed to any similar approach to it, so we intend that our development of the solution here is authentic and hopefully also useful to any further study on it.

If you are interested in this and having this kind of discussions, join us and this project in Discord ;)

Tilt5

2022-01-15T00:00:00+00:00

Updates:

Note 2 (Oct, 2022): We have received confirmation that we will receiving the glasses around January

Note 1 (june 2021): We have ordered a tilt5 headset to understand how it works better, if you are interested in helping us with the project or doing something related to tilt5, let us know. We are mainly localised in Berlin, Germany. You can contact us on our Meetup page or discord server (links in footer)

Note 0: Thanks to Kopernikus Automotive for sponsoring this project and the Meetup.

General Description

The Tilt5 system uses very clever light filters to achieve a good FOV for AR glasses using LIDOR projectors and a retro-reflective board. Apart from this, their localization takes advantage of the configuration of the whole system, mainly of the planarity of the board to localize the glasses (projectors), which simplifies slightly localization. In this project we try to replicate their approach from zero, building our own hardware (experimental version of only the projection part) and our own localization software. We leave aside the hardware optimisation because we are mainly interested in computer vision and because we don’t have too many hands.

We focus on these two aspects (hardware and localization) because we need the first to test the second and the second is a very important topic in computer vision. Nonetheless, we are also curious about how easy will it be to implement a similar approach and what else it can be used for.

The following diagram is a simplification of how the projection work

We will be posting updates of progress here.

Two-week challenge

2022-01-13T00:00:00+00:00

We took the challenge of reading one paper a day for two weeks and here is the list of the papers we all read. For the sake of exactness, it was around 1 paper per day on average and we didn’t need to read the same papers, some people shared some papers but in general, we were fully free to choose.

After the 2 weeks, we made short meetings to share a very short summary of each of the papers

Here is the list of all papers discussed

Transformers from zero

2021-12-20T00:00:00+00:00

This is a quick implementation of transformers that a friend of mine (link) and I made from zero with no guidance from the paper just to practice and understand better transformers. This is our explanation of transformers, therefore the notation and order of variables maybe change a bit.

The code can be found in the following link github.com/NotAnyMike/transformer

General

There are mainly 3 parts to this. Transformer start with the self-attention mechanism, scaling it into multiple heads we get the multi-head attention and finally by staking these together we get a transformer. The sections here are organized in those 3 steps.

1. Attention architecture

Let’s assume we have an input of \(n\) elements of length \(l\)

\[Q = V = K = \{q_{i,j}\} \in \mathbb R^{n \times l}\] \[Q V^T = S \in \mathbb R^{n \times n}\]

\(S = \{s_{i,j}\}\) is basically the similarity matrix of the elements (how similar element \(i\) is to element \(j\)). The more similar the higher their multiplication will be and vice-versa.

Basically element \(j\) means \(e_j = \{v_{i,j}\}\) with \(i\in [1,2,3,...,n]\)

Therefore it is the same as

\[s_{a,b} = e_a \odot e_b = |e_a||e_b|\cos \theta\]

It is just a way of measuring how alike two vectors are.

We make it a probability with softmax applied to each row or column depending on the order of multiplication we will follow, for simplicity, we will apply it over the row.

\[\text{softmax}( S)\]

And self-attention basically is the result of weighting the original values of \(V\) with the result of the softmax

\[\text{softmax}( S)^T V\]

One extra small thing we could do to improve training and convergence is to avoid the gradient tending to zero (similar to a vanishing gradient problem), remember that softmax flattens the bigger the absolute value gets. One way to avoid this is constraining \(S\) to grow significantly. We could control the variance of the model.

Let’s assume \(\text{var} (q_{i,j}) = \sigma^2\), because part of the matrix multiplication operation to calculate \(S\) includes summing over the \(l\) product of elements, the variance will be \(l\) times \(\sigma^2\). To avoid the product potentially having bigger and bigger values we could scale \(S\) down by \(\sqrt l\) so the variance gets scaled down by the same amount it increased due to the matrix multiplication. Therefore the scaled attention equation becomes

\[\text{softmax}\left(\frac{ S}{\sqrt l} \right)^TV\]

Masking the future out

Sometimes when working with a sequence of values, it is important to avoid passing information about the future (because during inference the model will not have future information or because we want to estimate the next value). Given that \(Q, V, K\) includes all the sequences (all of the \(n\) vectors).

For this reason, we can just multiply the element scaled similarity matrix by a mask. Let’s imagine we are in the \(j\)th element and want to predict the \(j+1\)th element, therefore the mask will be \([1,1,1,...,0,0,0]\) where all the elements after \(j\) will be zero. We are allowed to train not just in the \(j\)th element but in any element, therefore the mask masking each of the posterior (aka future inputs) will look like

\[M = \{m_{i,j}\} \in \mathbb R^{n\times n} = \left( \matrix{ 1 & 0 & 0 & 0 & ...\\ 1 & 1 & 0 & 0 & ...\\ 1 & 1 & 1 & 0 & ... \\ & \text{...} & & & 1 } \right)\]

Including the mask into the attention equation we have

\[A = \text{softmax}\left(M\frac{ S}{\sqrt l} \right)^TV\]

Therefore because \(A\) is just the same input but weighted differently we get that \(Q,V,K,A \in \mathbb R^{l\times n}\)

2. Multi-head attention architecture

There are three improvements we can apply to the current self-attention architecture.

First, to allow the network to transform the network (that is why it is called a transformer), we could improve the architecture by adding one linear layer just before the similarity matrix operation.

\[Q = f_\theta(X)\\ K = f_\gamma(X)\\ V = f_\beta(X)\]

Where \(\theta, \gamma, \beta\) are the parameters of each layer and \(X\) is the input vector. Now \(Q,K,V\) may not be equal.

A second improvement we could add is to stack in parallel several of these operations \(h\) times. the way doing it is by simply expanding the parameters \(\theta, \gamma\) and \(\beta\), \(h\) times and then reshaping to \(h \times ...\). Therefore

\[Q' = f_{\theta'}(X) \in \mathbb R^{h \times l \times n} \\ K' = f_{\gamma'}(X) \in \mathbb R^{h \times l \times n}\\ V' = f_{\beta'}(X) \in \mathbb R^{h \times l \times n} \\ \\ S = Q'^T V'\in \mathbb R^{h \times n \times n}\\ \\ A' = \text{softmax}\left(\frac{ S}{\sqrt l} \right)^TV' \in \mathbb R^{h\times l\times n}\]

To keep the same output dimension as the original self-attention mechanism we can flatten \(A'\) and add a linear layer from \(h \times l \times n\) to \(l \times n\) so the final equation for the multi-head attention mechanism is

\[A = f_{\phi}(A_\text{flatten}) \quad \text{with} \quad A_\text{flatten} \in \mathbb R^{hln}\]

So as result we get \(A \in \mathbb R^{l \times n}\) and we can add an extra dimension \(b\) for the batch size.

3. Transformer architecture

The last step to build a transformer is to stack together several multi-head self-attention mechanisms in a specific order \(N\) times.

Positional encoding and input embeddings are an essential part of the architecture, but here we assume that \(X\) already contains both.

We will connect each of the outputs of the encoding transformers to the output of the next. For the decoder we will do the same, i.e. each output of the decoder is the input for the first masked multi-head attention’s input of the next decoder. Additionally, the output of the last encoder transformer is connected to all the input of the second multi-head attention of all \(N\) decoders.

To finish, we add a linear layer and softmax function to predict the probability of the next element of the sequence based on the vocabulary.