<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.2">Jekyll</generator><link href="https://advancedcomputervision.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://advancedcomputervision.github.io/" rel="alternate" type="text/html" /><updated>2022-10-29T16:24:02+00:00</updated><id>https://advancedcomputervision.github.io/feed.xml</id><title type="html">Advanced Computer Vision Meetup</title><subtitle>We meet and discuss CV papers, mainly  in the area of ML and sometimes interesting off-topic papers. We also meet and  do interesting projects. You are always welcome to join any of those.
</subtitle><author><name>GitHub User</name><email>your-email@domain.com</email></author><entry><title type="html">Session 14: Instant Neural Graphics Primitives</title><link href="https://advancedcomputervision.github.io/misc/2022/10/27/instant-nerf.html" rel="alternate" type="text/html" title="Session 14: Instant Neural Graphics Primitives" /><published>2022-10-27T00:00:00+00:00</published><updated>2022-10-27T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/10/27/instant-nerf</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/10/27/instant-nerf.html">&lt;h1 id=&quot;instant-neural-graphics-primitives-with-a-multiresolution-hash-encoding&quot;&gt;Instant Neural Graphics Primitives with a Multiresolution Hash Encoding&lt;/h1&gt;

&lt;p&gt;In this session, &lt;a href=&quot;https://www.linkedin.com/in/moritz-hambach-277326b0&quot;&gt;Moritz Hambach&lt;/a&gt; and &lt;a href=&quot;https://www.linkedin.com/in/rishabhraj17/&quot;&gt;Rishabh Raj&lt;/a&gt; will be leading the conversation. They will explain the main parts and logic behind the paper, show some demos and help moderate the session. These events are informal conversations where everyone has the opportunity to ask, clarify or add any type of intervention.&lt;/p&gt;

&lt;p&gt;We welcome all levels. We strongly recommend reading the paper beforehand.&lt;/p&gt;

&lt;p&gt;Links:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Project: &lt;a href=&quot;https://github.com/NVlabs/instant-ngp&quot;&gt;GitHub - NVlabs/instant-ngp: Instant neural graphics primitives: lightning fast NeRF and more&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Paper: &lt;a href=&quot;https://nvlabs.github.io/instant-ngp/assets/mueller2022instant.pdf&quot;&gt;https://nvlabs.github.io/instant-ngp/assets/mueller2022instant.pdf&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Slides: &lt;a href=&quot;/assets/instant_nerf/Nvidia_ngp.pdf&quot;&gt;Link&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;—————– About the Hosts —————–&lt;/p&gt;

&lt;p&gt;Moritz works on Computer Vision and Perception for autonomous vehicles at Kopernikus. He comes from a PhD in Physics and loves improving his geometrical understanding and intuition of Deep Learning, CV and 3D representations and the corresponding algorithms (ideally unsupervised).&lt;/p&gt;

&lt;p&gt;Rishabh is a Machine Learning and Computer vision engineer working at Kopernikus Automotive. He has his master’s at TUM and previous job and research experience in Detection, Segmentation, Tracking, and Motion Prediction. He has a passion for CV and ML. His interest is in Scene Understanding using Deep Learning and Reinforcement Learning.&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">Instant Neural Graphics Primitives with a Multiresolution Hash Encoding In this session, Moritz Hambach and Rishabh Raj will be leading the conversation. They will explain the main parts and logic behind the paper, show some demos and help moderate the session. These events are informal conversations where everyone has the opportunity to ask, clarify or add any type of intervention. We welcome all levels. We strongly recommend reading the paper beforehand. Links: Project: GitHub - NVlabs/instant-ngp: Instant neural graphics primitives: lightning fast NeRF and more Paper: https://nvlabs.github.io/instant-ngp/assets/mueller2022instant.pdf Slides: Link —————– About the Hosts —————– Moritz works on Computer Vision and Perception for autonomous vehicles at Kopernikus. He comes from a PhD in Physics and loves improving his geometrical understanding and intuition of Deep Learning, CV and 3D representations and the corresponding algorithms (ideally unsupervised). Rishabh is a Machine Learning and Computer vision engineer working at Kopernikus Automotive. He has his master’s at TUM and previous job and research experience in Detection, Segmentation, Tracking, and Motion Prediction. He has a passion for CV and ML. His interest is in Scene Understanding using Deep Learning and Reinforcement Learning.</summary></entry><entry><title type="html">Session 13: EPro-PnP</title><link href="https://advancedcomputervision.github.io/misc/2022/10/25/epro.html" rel="alternate" type="text/html" title="Session 13: EPro-PnP" /><published>2022-10-25T00:00:00+00:00</published><updated>2022-10-25T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/10/25/epro</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/10/25/epro.html">&lt;h1 id=&quot;epro-pnp-generalized-end-to-end-probabilistic-perspective-n-points-for-monocular-object-pose-estimation&quot;&gt;EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation.&lt;/h1&gt;

&lt;p&gt;In this session, Matthes Krull will be leading the conversation. In the 2022 paper EPro-PnP, the authors showed how their continuous PnP layer can beat other 6DoF approaches, without the use of a fancy network design and we will dig deeper into the reasons why. The algorithm is highly mathematical, but every step they take is well reasoned and crucial to its outstanding performance, as they’ve shown in the ablation study.&lt;/p&gt;

&lt;p&gt;In brief, several algorithms are combined and applied on top of a backbone, which outputs: A) 2D-3D correspondences B) weights for those correspondences. First, a decoupled PnP solution is found by a Levenberg-Marquardt (LM) inspired solver. Second, this intermediate solution is used in a Monte-Carlo approach to find a continuous pose distribution around this “local” solution (which is parameterized by our 2D-3D correspondences).&lt;/p&gt;

&lt;p&gt;Here, they also use the Adaptive Multiple Importance Sampling (AMIS) algorithm to iteratively refine that pose distribution. Finally and third, this distribution is used to optimize the network by randomly sample N poses from it and calculating the gradient according to their reprojection errors.&lt;/p&gt;

&lt;p&gt;Here are the slides and videos showed in the session:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/epro/epro.png&quot; alt=&quot;Info&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Videos:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://youtu.be/V3ZehIJ9C3E&quot;&gt;https://youtu.be/V3ZehIJ9C3E&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://youtu.be/1pR44nmp0yc&quot;&gt;https://youtu.be/1pR44nmp0yc&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://youtu.be/b9FgOvxFAdg&quot;&gt;https://youtu.be/b9FgOvxFAdg&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation. In this session, Matthes Krull will be leading the conversation. In the 2022 paper EPro-PnP, the authors showed how their continuous PnP layer can beat other 6DoF approaches, without the use of a fancy network design and we will dig deeper into the reasons why. The algorithm is highly mathematical, but every step they take is well reasoned and crucial to its outstanding performance, as they’ve shown in the ablation study. In brief, several algorithms are combined and applied on top of a backbone, which outputs: A) 2D-3D correspondences B) weights for those correspondences. First, a decoupled PnP solution is found by a Levenberg-Marquardt (LM) inspired solver. Second, this intermediate solution is used in a Monte-Carlo approach to find a continuous pose distribution around this “local” solution (which is parameterized by our 2D-3D correspondences). Here, they also use the Adaptive Multiple Importance Sampling (AMIS) algorithm to iteratively refine that pose distribution. Finally and third, this distribution is used to optimize the network by randomly sample N poses from it and calculating the gradient according to their reprojection errors. Here are the slides and videos showed in the session: Videos: https://youtu.be/V3ZehIJ9C3E https://youtu.be/1pR44nmp0yc https://youtu.be/b9FgOvxFAdg</summary></entry><entry><title type="html">Building a Radio controlled car</title><link href="https://advancedcomputervision.github.io/misc/2022/10/15/rc-car.html" rel="alternate" type="text/html" title="Building a Radio controlled car" /><published>2022-10-15T00:00:00+00:00</published><updated>2022-10-15T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/10/15/rc-car</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/10/15/rc-car.html">&lt;p&gt;&lt;img src=&quot;/assets/rccar/rccar.jpeg&quot; alt=&quot;tilt5&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;updates&quot;&gt;Updates:&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Session 2 (22.10.22):&lt;/strong&gt; We need to fix:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;NN&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Run new architecture&lt;/li&gt;
      &lt;li&gt;Start with the dataloader&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Sensing&lt;/strong&gt;:
    &lt;ul&gt;
      &lt;li&gt;Run in Jetson and get 20 FPS and up to 100ms latency&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Control&lt;/strong&gt;
    &lt;ul&gt;
      &lt;li&gt;Moving controller to raspberry + ps4 controller&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Session 1 (15.10.22):&lt;/strong&gt; Mainly defining who will be doing what&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note 0:&lt;/strong&gt; Thanks to &lt;a href=&quot;https://www.kopernikusauto.com/&quot;&gt;Kopernikus Automotive&lt;/a&gt; for sponsoring this project and the Meetup.&lt;/p&gt;

&lt;p&gt;For more updates please check the discord channel&lt;/p&gt;

&lt;h1 id=&quot;general-description&quot;&gt;General Description&lt;/h1&gt;

&lt;p&gt;BTW the picture above is the car we’re building&lt;/p&gt;

&lt;p&gt;We’re thinking about joining the race for &lt;a href=&quot;https://www.meetup.com/autonomous-robots-berlin/&quot;&gt;autonomous driving + robots meetup berlin&lt;/a&gt; and for that we need to build an RC car. Everyone in the team has something they want to experiment with, for example, a general end-to-end driving approach capable of working with on-board and off-board cameras. Everyone is welcome, we always need more hands.&lt;/p&gt;

&lt;p&gt;If you’re interested in helping us, please join the event! We have some parts of the car already built, but we still need to finish that + all the perception parts.&lt;/p&gt;

&lt;p&gt;If you’re interested, let us know! We welcome everyone who is interested in learning and/or helping us. We will be meeting usually in Berlin but sometimes online depending on the topic.&lt;/p&gt;

&lt;p&gt;We will be posting updates of progress here.&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html"></summary></entry><entry><title type="html">Session 12: Text to image generation with diffusion models</title><link href="https://advancedcomputervision.github.io/misc/2022/09/15/diffusion_models.html" rel="alternate" type="text/html" title="Session 12: Text to image generation with diffusion models" /><published>2022-09-15T00:00:00+00:00</published><updated>2022-09-15T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/09/15/diffusion_models</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/09/15/diffusion_models.html">&lt;h1 id=&quot;text-to-image-generation-with-diffusion-models&quot;&gt;Text to image generation with diffusion models&lt;/h1&gt;

&lt;p&gt;This talk is slightly different because we will be talking about multiple papers, therefore there is no need to have read them beforehand.&lt;/p&gt;

&lt;p&gt;In this session &lt;a href=&quot;https://www.linkedin.com/in/afcruzs&quot;&gt;Felipe Cruz&lt;/a&gt; will talk about how diffusion models work, using dalle-2 specifically as a use case and touch on the differences with newer models (imagen, stable Diffusion, parti, etc)&lt;/p&gt;

&lt;p&gt;Felipe is a research engineer in Aleph Alpha working on novel methods to improve large pre-trained models, Both with and without scaling up models to billion of parameters and beyond. Previously he worked at Microsoft researching about scaling up multilingual models for machine translation, he also was part of the Cortana team. He got his master’s in computer science from the University of Washington.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slides are here: &lt;a href=&quot;/assets/DM.pdf&quot;&gt;PDF&lt;/a&gt;&lt;/strong&gt;**&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">Text to image generation with diffusion models This talk is slightly different because we will be talking about multiple papers, therefore there is no need to have read them beforehand. In this session Felipe Cruz will talk about how diffusion models work, using dalle-2 specifically as a use case and touch on the differences with newer models (imagen, stable Diffusion, parti, etc) Felipe is a research engineer in Aleph Alpha working on novel methods to improve large pre-trained models, Both with and without scaling up models to billion of parameters and beyond. Previously he worked at Microsoft researching about scaling up multilingual models for machine translation, he also was part of the Cortana team. He got his master’s in computer science from the University of Washington. Slides are here: PDF**</summary></entry><entry><title type="html">Session 11: YOLOX in depth</title><link href="https://advancedcomputervision.github.io/misc/2022/07/15/yolox_in_depth.html" rel="alternate" type="text/html" title="Session 11: YOLOX in depth" /><published>2022-07-15T00:00:00+00:00</published><updated>2022-07-15T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/07/15/yolox_in_depth</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/07/15/yolox_in_depth.html">&lt;h1 id=&quot;discussing-yolox-in-depth&quot;&gt;Discussing YOLOX in depth&lt;/h1&gt;

&lt;p&gt;This is a second session on the last paper to discuss specific improvement in depth of this paper. We discussed the main improvemest and the reasons behind why they work and pros and cons.&lt;/p&gt;

&lt;p&gt;The main improvements we touch were:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Strong data augmentation&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Decoupled heads End-to-end* (removing NMS)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Anchor free&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Multiple positives  samples&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Optimal Transport Assignment&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;3x3 Center sampling&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;IoU on regression&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;YOLOX is a great paper because it groups a lot of new insights in Single Stage Detectors  (SSD) but it is also a bad paper because it does not explains any of the concepts it uses nor proposes something new.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Presentation here: &lt;a href=&quot;/assets/YOLOX.pdf&quot;&gt;PDF&lt;/a&gt;&lt;/strong&gt;**&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">Discussing YOLOX in depth This is a second session on the last paper to discuss specific improvement in depth of this paper. We discussed the main improvemest and the reasons behind why they work and pros and cons. The main improvements we touch were: Strong data augmentation Decoupled heads End-to-end* (removing NMS) Anchor free Multiple positives samples Optimal Transport Assignment 3x3 Center sampling IoU on regression YOLOX is a great paper because it groups a lot of new insights in Single Stage Detectors (SSD) but it is also a bad paper because it does not explains any of the concepts it uses nor proposes something new. Presentation here: PDF**</summary></entry><entry><title type="html">Session 10: Discussing YOLOX and YOLOR</title><link href="https://advancedcomputervision.github.io/misc/2022/03/06/yolox.html" rel="alternate" type="text/html" title="Session 10: Discussing YOLOX and YOLOR" /><published>2022-03-06T00:00:00+00:00</published><updated>2022-03-06T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/03/06/yolox</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/03/06/yolox.html">&lt;h1 id=&quot;discussing-yolox-and-yolor&quot;&gt;Discussing YOLOX and YOLOR&lt;/h1&gt;

&lt;p&gt;Regarding YOLOX the general agreement is that the paper is great. The main added value of this paper is grouping the latest improvements in object detection for single shot detection (SSD). We personally thing the most important contribution where removing anchors and nms and the approach to improve training signal by multiple positives. It is hard to write down all the possitives effects that these changes bring.&lt;/p&gt;

&lt;p&gt;Papers that would be interesting to understand specific points are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;“Ota: Optimal transport assignment for object detection.”&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;“Object detection made simpler by eliminating heuristic nms.”&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;“Fcos: Fully convolutional one-stage object detection”&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There was mainly one conclusion from YOLOR: The idea is interesting and should be explorer further by the comunity, the execution was mediocre. comunication was poor and it is missing a lot of important details to have a proper conversation of this topic.&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">Discussing YOLOX and YOLOR Regarding YOLOX the general agreement is that the paper is great. The main added value of this paper is grouping the latest improvements in object detection for single shot detection (SSD). We personally thing the most important contribution where removing anchors and nms and the approach to improve training signal by multiple positives. It is hard to write down all the possitives effects that these changes bring. Papers that would be interesting to understand specific points are: “Ota: Optimal transport assignment for object detection.” “Object detection made simpler by eliminating heuristic nms.” “Fcos: Fully convolutional one-stage object detection” There was mainly one conclusion from YOLOR: The idea is interesting and should be explorer further by the comunity, the execution was mediocre. comunication was poor and it is missing a lot of important details to have a proper conversation of this topic.</summary></entry><entry><title type="html">Using Group Theory to Solve the Rubik’s cube</title><link href="https://advancedcomputervision.github.io/misc/2022/01/15/Rubik.html" rel="alternate" type="text/html" title="Using Group Theory to Solve the Rubik’s cube" /><published>2022-01-15T00:00:00+00:00</published><updated>2022-01-15T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/01/15/Rubik</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/01/15/Rubik.html">&lt;p&gt;One might think that all about Rubik’s cube has been said after its invention in the second half of the 20th century: detailed instructions to follow and solve it, variants of the original \(3\times 3 \times 3\) version, computer-vision aided systems to read in a cube state and solve it physically by making use of robotics and so on. We want to &lt;em&gt;understand&lt;/em&gt; how it works by describing precisely what happens when certain operation is applied to it and how operations affect its state in general. In the end, we seek an explicit closed formula or algorithm for solving it that should be a consequence of these properties.&lt;/p&gt;

&lt;p&gt;The implicit geometrical symmetries of the cube seen as an object in the space, the cyclic nature of operations and the fact that cube state transitions are permutations of its forming sub-cubes motivate us to express its properties in the language of group theory. This will also allow us to have a mathematical perspective of the puzzle and apply any classical result from the theory when it’s appropriate. We haven’t been exposed to any similar approach to it, so we intend that our development of the solution here is authentic and hopefully also useful to any further study on it.&lt;/p&gt;

&lt;p&gt;If you are interested in this and having this kind of discussions, join us and this project in Discord ;)&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">One might think that all about Rubik’s cube has been said after its invention in the second half of the 20th century: detailed instructions to follow and solve it, variants of the original \(3\times 3 \times 3\) version, computer-vision aided systems to read in a cube state and solve it physically by making use of robotics and so on. We want to understand how it works by describing precisely what happens when certain operation is applied to it and how operations affect its state in general. In the end, we seek an explicit closed formula or algorithm for solving it that should be a consequence of these properties.</summary></entry><entry><title type="html">Tilt5</title><link href="https://advancedcomputervision.github.io/misc/2022/01/15/tilt5.html" rel="alternate" type="text/html" title="Tilt5" /><published>2022-01-15T00:00:00+00:00</published><updated>2022-01-15T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/01/15/tilt5</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/01/15/tilt5.html">&lt;p&gt;&lt;img src=&quot;https://assets-global.website-files.com/60d41442f203961c4600ff57/61afaa2e60f891891470e422_0358-TiltFive_OGImage2.jpg&quot; alt=&quot;Tilt5banner&quot; /&gt;&lt;/p&gt;

&lt;h1 id=&quot;updates&quot;&gt;Updates:&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Note 2 (Oct, 2022):&lt;/strong&gt; We have received confirmation that we will receiving the glasses around January&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note 1 (june 2021):&lt;/strong&gt; We have ordered a tilt5 headset to understand how it works better, if you are interested in helping us with the project or doing something related to tilt5, let us know. We are mainly localised in Berlin, Germany. You can contact us on our Meetup page or discord server (links in footer)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note 0:&lt;/strong&gt; Thanks to &lt;a href=&quot;https://www.kopernikusauto.com/&quot;&gt;Kopernikus Automotive&lt;/a&gt; for sponsoring this project and the Meetup.&lt;/p&gt;

&lt;h1 id=&quot;general-description&quot;&gt;General Description&lt;/h1&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.tiltfive.com/&quot;&gt;Tilt5&lt;/a&gt; system uses very clever light filters to achieve a good FOV for AR glasses using LIDOR projectors and a retro-reflective board. Apart from this, their localization takes advantage of the configuration of the whole system, mainly of the planarity of the board to localize the glasses (projectors), which simplifies slightly localization. In this project we try to replicate their approach from zero, building our own hardware (experimental version of only the projection part) and our own localization software. We leave aside the hardware optimisation because we are mainly interested in computer vision and because we don’t have too many hands.&lt;/p&gt;

&lt;p&gt;We focus on these two aspects (hardware and localization) because we need the first to test the second and the second is a very important topic in computer vision. Nonetheless, we are also curious about how easy will it be to implement a similar approach and what else it can be used for.&lt;/p&gt;

&lt;p&gt;The following diagram is a simplification of how the projection work&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/tilt5/tilt5.svg&quot; alt=&quot;tilt5&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We will be posting updates of progress here.&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html"></summary></entry><entry><title type="html">Two-week challenge</title><link href="https://advancedcomputervision.github.io/misc/2022/01/13/2week.html" rel="alternate" type="text/html" title="Two-week challenge" /><published>2022-01-13T00:00:00+00:00</published><updated>2022-01-13T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2022/01/13/2week</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2022/01/13/2week.html">&lt;p&gt;We took the challenge of reading one paper a day for two weeks and here is the list of the papers we all read. For the sake of exactness, it was around 1 paper per day on average and we didn’t need to read the same papers, some people shared some papers but in general, we were fully free to choose.&lt;/p&gt;

&lt;p&gt;After the 2 weeks, we made short meetings to share a very short summary of each of the papers&lt;/p&gt;

&lt;p&gt;Here is the list of all papers discussed&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/princeton-vl/droid-slam&quot;&gt;DROID-SLAM&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/ISEE-Technology/CamVox&quot; title=&quot;https://github.com/ISEE-Technology/CamVox&quot;&gt;GitHub - ISEE-Technology/CamVox: [ICRA2021] A low-cost SLAM system based on camera and Livox lidar.&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;A Review of Visual-LiDAR Fusion based Simultaneous Localization and Mapping&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2111.06377&quot;&gt;Masked Autoencoders are Scalable Vision Learners&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1909.03459&quot;&gt;Blind Geometric Distortion Correction on Images Through Deep Learning&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2110.09482&quot;&gt;Self-supervised Monocular Depth Estimation with internal feature fusion&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2011.12104&quot;&gt;Recurrent Multi-View alignment network for unsupervised surface registration&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2106.11958&quot;&gt;Prototypical cross-attention network for multiple object tracking and segmentation&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.matthewtancik.com/nerf&quot;&gt;Nerf&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2103.15875&quot;&gt;In-place scene labelling and understanding with implicit scene representation&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://nerf-w.github.io/&quot;&gt;Nerf in the wild&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://users.eecs.northwestern.edu/~asb479/papers/bmvc_2018.pdf&quot;&gt;Recurrent Multiframe single shot detector for video object detection&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://zju3dv.github.io/loftr/&quot;&gt;LoFTR: Detector-Free Local Feature Matching with Transformers&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2104.11487&quot;&gt;Skip-Convolutions for Efficient Video Processing&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://nvlabs.github.io/face-vid2vid/&quot;&gt;One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2003.06409&quot;&gt;Probabilistic Future Prediction for Video Scene Understanding&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1912.00177&quot;&gt;Urban Driving with Conditional Imitation Learning&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://wayve.ai/blog/fiery-future-instance-prediction-birds-eye-view/&quot;&gt;FIERY&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1506.02025&quot;&gt;Spatial Transformer Networks&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/darglein/ADOP&quot;&gt;ADOP&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1811.10119&quot;&gt;Variational End-to-End Navigation and Localization&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1906.08240&quot;&gt;Neural Point-Based Graphics&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;ORB-SLAM2&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1502.00956&quot;&gt;ORB-SLAM: a Versatile and Accurate Monocular SLAM System&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://alexyu.net/plenoctrees/&quot;&gt;Plen-octrees&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2008.05711&quot;&gt;Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2002.08394&quot;&gt;MonoLayout: Amodal scene layout from a single image&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1811.08188&quot;&gt;Orthographic Feature Transform for Monocular 3D Object Detection&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://johnthickstun.com/docs/transformers.pdf&quot;&gt;The Transformer Model in Equations&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2012.00152&quot;&gt;Every Model Learned by Gradient Descent Is Approximately a Kernel Machine&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2003.06754&quot;&gt;MotionNet: Joint Perception and Motion Prediction for Autonomous Driving Based on Bird’s Eye View Maps&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1912.02908&quot;&gt;Why Having 10,000 Parameters in Your Camera Model is Better Than Twelve&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2103.15691&quot;&gt;ViViT: A Video Vision Transformer&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://iliad-project.eu/wp-content/uploads/papers/iros_se_ndt.pdf&quot;&gt;Semantic-assisted 3D Normal Distributions Transform for scan registration in environments with limited structure&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Discrete Kalman Filter Tutorial&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1806.01261&quot;&gt;Relational inductive biases, deep learning, and graph networks (in progress)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2011.15091&quot;&gt;Inductive Biases for Deep Learning of Higher-Level Cognition.&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1911.01547&quot;&gt;On the Measure of Intelligence&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2106.07643&quot;&gt;Unsupervised Learning of Visual 3D Keypoints for Control&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/2102.05176&quot;&gt;Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://arxiv.org/abs/1703.03400&quot;&gt;Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://papers.nips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html#:~:text=Prototypical%20Networks%20learn%20a%20metric,regime%2C%20and%20achieve%20excellent%20results.&quot;&gt;Prototypical Networks for Few-shot Learning&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">We took the challenge of reading one paper a day for two weeks and here is the list of the papers we all read. For the sake of exactness, it was around 1 paper per day on average and we didn’t need to read the same papers, some people shared some papers but in general, we were fully free to choose.</summary></entry><entry><title type="html">Transformers from zero</title><link href="https://advancedcomputervision.github.io/misc/2021/12/20/transformers.html" rel="alternate" type="text/html" title="Transformers from zero" /><published>2021-12-20T00:00:00+00:00</published><updated>2021-12-20T00:00:00+00:00</updated><id>https://advancedcomputervision.github.io/misc/2021/12/20/transformers</id><content type="html" xml:base="https://advancedcomputervision.github.io/misc/2021/12/20/transformers.html">&lt;p&gt;This is a quick implementation of transformers that a friend of mine (link) and I made from zero with no guidance from the paper just to practice and understand better transformers. This is our explanation of transformers, therefore the notation and order of variables maybe change a bit.&lt;/p&gt;

&lt;p&gt;The code can be found in the following link &lt;a href=&quot;https://github.com/NotAnyMike/transformer&quot;&gt;github.com/NotAnyMike/transformer&lt;/a&gt;&lt;/p&gt;

&lt;!-- [LINK](https://colab.research.google.com/drive/1YelwLtfqP-fm-v77LZ7-5HIPzFUXu0H1)  TODO Pass it to git gist --&gt;

&lt;h2 id=&quot;general&quot;&gt;General&lt;/h2&gt;

&lt;p&gt;There are mainly 3 parts to this. Transformer start with the self-attention mechanism, scaling it into multiple heads we get the multi-head attention and finally by staking these together we get a transformer. The sections here are organized in those 3 steps.&lt;/p&gt;

&lt;h2 id=&quot;1-attention-architecture&quot;&gt;1. Attention architecture&lt;/h2&gt;

&lt;p&gt;&lt;img src=&quot;/assets/transformer/attention.png&quot; alt=&quot;drawing&quot; height=&quot;400&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Let’s assume we have an input of \(n\) elements of length \(l\)&lt;/p&gt;

\[Q = V = K = \{q_{i,j}\} \in \mathbb R^{n \times l}\]

\[Q V^T =  S \in \mathbb R^{n \times n}\]

&lt;p&gt;\(S = \{s_{i,j}\}\) is basically the similarity matrix of the elements (how similar element \(i\) is to element \(j\)). The more similar the higher their multiplication will be and vice-versa.&lt;/p&gt;

&lt;p&gt;Basically element \(j\) means \(e_j = \{v_{i,j}\}\) with \(i\in [1,2,3,...,n]\)&lt;/p&gt;

&lt;p&gt;Therefore it is the same as&lt;/p&gt;

\[s_{a,b} = e_a \odot e_b = |e_a||e_b|\cos \theta\]

&lt;p&gt;It is just a way of measuring how alike two vectors are.&lt;/p&gt;

&lt;p&gt;We make it a probability with softmax applied to each row or column depending on the order of multiplication we will follow, for simplicity, we will apply it over the row.&lt;/p&gt;

\[\text{softmax}( S)\]

&lt;p&gt;And self-attention basically is the result of weighting the original values of \(V\) with the result of the softmax&lt;/p&gt;

\[\text{softmax}( S)^T  V\]

&lt;p&gt;One extra small thing we could do to improve training and convergence is to avoid the gradient tending to zero (similar to a vanishing gradient problem), remember that softmax flattens the bigger the absolute value gets. One way to avoid this is constraining \(S\) to grow significantly. We could control the variance of the model.&lt;/p&gt;

&lt;p&gt;Let’s assume \(\text{var} (q_{i,j}) = \sigma^2\), because part of the matrix multiplication operation to calculate \(S\) includes summing over the \(l\) product of elements, the variance will be \(l\) times \(\sigma^2\). To avoid the product potentially having bigger and bigger values we could scale \(S\) down by \(\sqrt l\) so the variance gets scaled down by the same amount it increased due to the matrix multiplication. Therefore the scaled attention equation becomes&lt;/p&gt;

\[\text{softmax}\left(\frac{ S}{\sqrt l} \right)^TV\]

&lt;h3 id=&quot;masking-the-future-out&quot;&gt;Masking the future out&lt;/h3&gt;

&lt;p&gt;Sometimes when working with a sequence of values, it is important to avoid passing information about the future (because during inference the model will not have future information or because we want to estimate the next value). Given that \(Q,  V,  K\) includes all the sequences (all of the \(n\) vectors).&lt;/p&gt;

&lt;p&gt;For this reason, we can just multiply the element scaled similarity matrix by a mask. Let’s imagine we are in the \(j\)th element and want to predict the \(j+1\)th element, therefore the mask will be \([1,1,1,...,0,0,0]\) where all the elements after \(j\) will be zero. We are allowed to train not just in the \(j\)th element but in any element, therefore the mask masking each of the posterior (aka future inputs) will look like&lt;/p&gt;

\[M = \{m_{i,j}\} \in \mathbb R^{n\times n} = \left(
\matrix{
1 &amp;amp; 0 &amp;amp; 0 &amp;amp; 0 &amp;amp; ...\\
1 &amp;amp; 1 &amp;amp; 0 &amp;amp; 0 &amp;amp; ...\\
1 &amp;amp; 1 &amp;amp; 1 &amp;amp; 0 &amp;amp; ... \\
&amp;amp; \text{...} &amp;amp; &amp;amp; &amp;amp; 1
}
\right)\]

&lt;p&gt;Including the mask into the attention equation we have&lt;/p&gt;

\[A = \text{softmax}\left(M\frac{ S}{\sqrt l} \right)^TV\]

&lt;p&gt;Therefore because \(A\) is just the same input but weighted differently we get that \(Q,V,K,A \in \mathbb R^{l\times n}\)&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;2-multi-head-attention-architecture&quot;&gt;2. Multi-head attention architecture&lt;/h2&gt;

&lt;p&gt;There are three improvements we can apply to the current self-attention architecture.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/transformer/multihead.png&quot; alt=&quot;drawing&quot; width=&quot;400&quot; /&gt;&lt;/p&gt;

&lt;p&gt;First, to allow the network to transform the network (that is why it is called a transformer), we could improve the architecture by adding one linear layer just before the similarity matrix operation.&lt;/p&gt;

\[Q = f_\theta(X)\\
K = f_\gamma(X)\\
V = f_\beta(X)\]

&lt;p&gt;Where \(\theta, \gamma, \beta\) are the parameters of each layer and \(X\) is the input vector. Now \(Q,K,V\) may not be equal.&lt;/p&gt;

&lt;p&gt;A second improvement we could add is to stack in parallel several of these operations \(h\) times. the way doing it is by simply expanding the parameters \(\theta, \gamma\) and \(\beta\), \(h\) times and then reshaping to \(h \times ...\). Therefore&lt;/p&gt;

\[Q' = f_{\theta'}(X) \in \mathbb R^{h \times l \times n} \\
K' = f_{\gamma'}(X) \in \mathbb R^{h \times l \times n}\\
V' = f_{\beta'}(X) \in \mathbb R^{h \times l \times n} \\
\\
S = Q'^T  V'\in \mathbb R^{h \times n \times n}\\
\\
A' = \text{softmax}\left(\frac{ S}{\sqrt l} \right)^TV' \in \mathbb R^{h\times l\times n}\]

&lt;p&gt;To keep the same output dimension as the original self-attention mechanism we can flatten \(A'\) and add a linear layer from \(h \times l \times n\) to \(l \times n\) so the final equation for the multi-head attention mechanism is&lt;/p&gt;

\[A = f_{\phi}(A_\text{flatten}) \quad \text{with} \quad A_\text{flatten} \in \mathbb R^{hln}\]

&lt;p&gt;So as result we get \(A \in \mathbb R^{l \times n}\) and we can add an extra dimension \(b\) for the batch size.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;3-transformer-architecture&quot;&gt;3. Transformer architecture&lt;/h2&gt;

&lt;p&gt;The last step to build a transformer is to stack together several multi-head self-attention mechanisms in a specific order \(N\) times.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/transformer/transformer.png&quot; alt=&quot;drawing&quot; width=&quot;400&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Positional encoding and input embeddings are an essential part of the architecture, but here we assume that \(X\) already contains both.&lt;/p&gt;

&lt;p&gt;We will connect each of the outputs of the encoding transformers to the output of the next. For the decoder we will do the same, i.e. each output of the decoder is the input for the first masked multi-head attention’s input of the next decoder. Additionally, the output of the last encoder transformer is connected to all the input of the second multi-head attention of all  \(N\) decoders.&lt;/p&gt;

&lt;p&gt;To finish, we add a linear layer and softmax function to predict the probability of the next element of the sequence based on the vocabulary.&lt;/p&gt;</content><author><name>GitHub User</name><email>your-email@domain.com</email></author><category term="misc" /><summary type="html">This is a quick implementation of transformers that a friend of mine (link) and I made from zero with no guidance from the paper just to practice and understand better transformers. This is our explanation of transformers, therefore the notation and order of variables maybe change a bit.</summary></entry></feed>