Yiyang’s Personal Page

Multi-View Transformer for 3D Visual Grounding Paper Reading

2023-08-22T00:00:00+08:00

Easy and effective 3D visual grounding.

Multi-View Transformer for 3D Visual Grounding

Abstract

The work is about proposing a new approach to 3D visual grounding, called the Multi-View Transformer (MVT), that outperforms all state-of-the-art methods. The MVT approach uses a multi-view space to learn a more robust multi-modal representation for 3D visual grounding.

Gaps of existing works and motivations

According to the paper, the existing works in 3D visual grounding mainly follow a two-stage scheme, i.e., first generating all candidate objects in 3D scene (by classification) and then selecting the most matched one. These two-stage approaches in 3D visual grounding are tailored from 2D visual grounding methods, which have not considered the unique property of 3D data. The paper argues that these methods have limitations in handling view changes and do not learn a view-robust representation.

In the context of the 3D visual grounding task, query data can be categorized into two fundamental groups: explicit view-related queries and implicit view-related queries. Taking the Nr3D as an illustration, consider querying for the same nightstand. In this scenario, diverse utterances emerge, such as “if you face the bed, you need to select the nightstand that is on the right” and “nightstand closer to the desk.” The initial query explicitly indicates the direction of the view, while the latter query does not provide this specific information.

Figure 1. The Nr3D visualization. There are two different types of query, view-explicit and view-implicit.

The paper states that the motivation for the research is to address the limitations of existing methods in 3D visual grounding by making full use the view property in queries.

Methodology

The Multi-View Transformer (MVT) technique addresses the limitations of existing approaches through a series of steps.

Initially, the MVT method leverages a multi-view space to develop a more resilient multi-modal representation for 3D visual grounding. This strategy eliminates reliance on the initial view and amalgamates all information to create a viewpoint-agnostic representation. According to codes available in their public repository, this is achieved by rotating the entire point cloud four times to generate four sets of new coordinates, which are then integrated into the training dataset for subsequent training phases.

Subsequently, the MVT approach disentangles the computation of 3D object representations by separately calculating point cloud features and object coordinates. This separation facilitates the sharing of point cloud features across various viewpoints. Based on their publicly available codes, this is accomplished by passing all coordinate information through a Multi-Layer Perceptron module and subsequently adding it into the object feature matrix as the new object feature.

Lastly, the Multi-Modal Feature Fusion employs a conventional transformer decoder, where BERT-encoded language features serve as queries and object features as keys. The outcome of the transformer decoder is dimensionally reduced by averaging across the view dimension.

Figure 2. The network structure of Multi-View Transformer.

The MVT approach outperforms all state-of-the-art methods on several datasets, demonstrating its effectiveness in addressing the gaps of existing methods.

The paper uses several evaluations to validate the design of the proposed Multi-View Transformer (MVT) approach. First, the paper compares the performance of the MVT approach with several state-of-the-art methods on three datasets: Nr3D, CLEVR-Ref+, and CLEVRER. The results show that the MVT approach outperforms all other methods on all three datasets. Second, the paper conducts an ablation study to analyze the effectiveness of different components of the MVT approach. The results show that each component contributes to the overall performance of the MVT approach. Third, the paper analyzes the effectiveness of multi-view modeling by comparing the performance of the MVT approach with different numbers of views. The results show that multi-view modeling can learn an effective representation that can benefit different views and that increasing the view number during testing can still improve the grounding accuracy. The paper also provides visualization results to show the effectiveness of the MVT approach in localizing objects in 3D scenes.

Constraint

The paper does not directly discuss any limitations related to the proposed Multi-View Transformer (MVT) technique. Nevertheless, it does highlight that the MVT method necessitates the independent computation of point cloud features and object coordinates. This separation is intended to facilitate the sharing of point cloud features across different viewpoints. However, this design choice might result in the model’s performance being constrained by the weakest aspect among these multiple modalities. Moreover, the paper acknowledges that for intricate queries or scenes, the network’s capacity to comprehend the scene and establish associations between objects and queries requires further enhancement.

Future direction

The technique presented in this paper is characterized by its simplicity and remarkable effectiveness. This success serves as an inspiration, prompting us to emphasize the distinct attributes of 3D scenes, rather than solely applying 2D methodologies to tasks involving 3D contexts.

In terms of potential avenues for future exploration, the paper puts forth several directions for forthcoming research endeavors. To begin with, the paper acknowledges the prospect of extending the proposed Multi-View Transformer (MVT) methodology to encompass other tasks, including but not limited to 3D object detection and segmentation. This could be extremely useful in VR or meta universe. Additionally, the paper proposes an enhancement of the MVT approach through the incorporation of supplementary information, such as object attributes and relationships. Furthermore, the paper highlights the potential to apply the MVT technique to alternate modalities like audio and haptic data, thus enabling comprehensive multi-modal grounding. The next suggestion involves the expansion of the MVT approach to tackle more intricate scenes, such as those involving occlusions and clutter. Lastly, the paper presents the prospect of leveraging the MVT approach for the generation of natural language descriptions pertaining to 3D scenes. Such capabilities could find utility in applications such as virtual assistants and robotics.

Some comments

We have tried to modify its backbone to SAM, fix the object mask, extent the position embedding, apply distance normalization and fourier transform to the coordinate, but the performance is not as good as the original one.

Cyber Punk: Edge Runner review

2022-10-19T00:00:00+08:00

一些想法不动笔记录下来很快就会忘记，所以要刻在石头上。硅也算是石头。

Cyber Punk: Edge Runner review

A classic boy-meets-girl story
A destined tragedy

剧透警告！！

曼恩死之前我以为这是一部喜剧，主角们在魔幻现实中踩着帮派混混和公司狗的尸体翩翩起舞，大笑着成为夜之城真正的传奇。主角们最后也许会潇洒离开，也许会结束在爱情修成正果，也许会在荒坂塔大闹一番。

没想到这三个愿景都成真了，但是是镜子大师式的“成真”。

Lucy实现了自己登上月球的梦想，逃离了对她而言是地狱的地球，却永远的失去了David，并且在对方死前的最后几分钟二人才互相坦白各自的挣扎；David走得比曼恩更远，成为了夜之城新的传奇并登上了荒坂塔的顶端，但是每天挣扎在变成赛博精神病的边缘，最后更是在荒坂塔下被亚当重锤炸成了烟花。其他的角色死的死散的散，只有开车猛男法尔科活了下来，带着David的外套。

意难平，啊，意难平啊。编剧你怎么能这么对Lucy。

虽然意难平，不得不说这个结局才是赛博朋克，可以说扳机社你们是懂赛博朋克的。在夜之城没有活着的传奇，更没有什么得偿所愿。每一个试图扳倒公司，反抗一切的个体在体量巨大的公司看来都是一条蝼蚁：哪怕是杀了一队公司狗，也会有千千万万的人做梦都想成为公司狗。公司就是权力和财富的代名词，进入公司才有稳定的前途。毕竟不是每个人都能“以我残躯化烈火”，不如说除了V再也没有人能做到。所以说这个结局我愿意称为“True End”，it is what it is。

话是这么说，还是好想看到Lucy能再一次露出在救护车上的笑容啊。希望2077的DLC能有一点联动内容，蠢驴最好识相一点。

Bag of Freebies for Training Object Detection Neural Networks 论文笔记

2022-06-24T00:00:00+08:00

想要白赚的性能提升吗？那你来对地方了！

作者信息

亚马逊的AI团队，一作现在在字节。通讯作者李沐，在B站发的论文精读视频还挺好看。

问题定义

本文主要讨论针对object detection的神经网络优化方法。 object detection神经网络可以分为两类：以YOLO为代表的single-stage object detection networks、以Faster-RCNN为代表的multi-stage object detection networks 本文涉及的优化包括：

MixUp
head label smoothing
transformations（random crop等）
学习率schedule
synchronized batch normalization
random shape training

动机和思路

作者讨论应该如何进一步调优object detection的模型，探讨哪些方法对于哪些模型表现有进一步的帮助

算法流程

MixUp：保留了完整信息（object label也被整合成了新的array）的geometry preserved alignment of mixed images也能够提升目标检测的准确性
data prepossessing
- single stage（YOLOv3）：对transformation更加敏感，包括random flip, rotate和crop之类的常规transform方式
- multiple stage（Faster-RCNN）：对transformation不敏感，本文给出的理由是sampling-based方法在获取ROI的时候要在feature map上做大量的crop，这个过程代替了数据预处理中的crop
training schedule
- step schedule的学习率变化过于剧烈，并且会导致优化器需要重新稳定momentum
- cosine schedule的学习率变化更加平滑，效果更好
- warmup schedule对于一部分的物体检测算法影响很大，如YOLOv3（起始时负样本的梯度作为主导，因此如果一开始学习率就很大会导致对于主要的样本得分接近于0）设置的当的cosine schedule和warmup schedule配合能让整体训练效果提升
synchronized batch normalization
- 非同步的bn无论如何都会造成一定程度的batchsize减少和数据差异（每张卡做归一化），这种问题在小batchsize的时候可能会更加严重（如高分辨率图片训练时可能出现的，1张图片一张卡）
随机大小的图片（仅限single-stage网络）
- faster RCNN本身就能够接受多种不同大小的图片
- 随机大小的图片可以降低过拟合风险和提升模型泛化能力

实验结果

数据集	Single stage（YOLOv3）	Multiple stage（Faster-RCNN）
Pascal	去除data augmentation带来巨额性能损失（16%mAP），证明single stage模型严重依赖数据增强来创造之前没见过的特征以提升模型的预测能力；其余提到的方法共提升了约3.43%mAP，其中mixup提升最大（1.54），进一步佐证的本文所说：single stage模型严重依赖于各类数据增强方法。	去除data augmentation带来非常轻微的性能损失（0.16%mAP），证明proposal层大量的采样实际上替代了在single stage中大量使用的random cropping；其余本文提到的方法共带来3.55%的性能提升，其中cosine lr schdule带来的提升占比最大（1.82）
MS COCO	同样各种方法一起安排能够提点（4-5.4），并且在输入图片分辨率较低的时候效果更加明显。最终达到的准确率与Faster-RCNN差不多，同时推理速度更快	相对而言提升更小（1-1.7），但许多category都能够有提升

Mixup的使用可以在两个不同的阶段，分别是预训练backbone和训练检测头。实验结果证明Mixup方法能对两个阶段分别起作用，并且在两个阶段同时使用最终能够产生更佳的效果。

Deep High-Resolution Representation Learning for Visual Recognition 论文笔记

2022-06-08T00:00:00+08:00

多尺度融合的大成之作，在多种分割任务上发挥着作用。

作者信息

作者王井东现在是百度计算机视觉组的首席架构师。他在2007到2021在微软亚洲研究院工作，而这篇文章在这期间（2020）发布。

问题定义

图像分割与检测的网络大多都有基于图像分类任务的网络，先卷积再上采样、恢复成高分辨率，这个过程会导致图像的分辨率变低。而本文希望通过新的网络结构保持整个训练过程中都能保持高分辨率表征。

动机和思路

本文希望通过并行连接high-to-low的子网络，保持高分辨率而非从低分辨率恢复高分辨率，从而保证表征的空间准确性。

算法流程

输入图片时，先进入两个步长为2的3*3卷积将分辨率降为1/4。这个分辨率C将作为后续的阶段主体中保持不变的分辨率。

开始训练后，逐步增加high-to-low流，并将多个分辨率的流进行链接。

主体共有四个阶段，每个阶段分别重复1，1，4，3次。
每个模块由4个残差网络单元组成，每个残差网络单元内对每种分辨率都有两个3*3卷积，并在卷积后进行batch normalization和ReLU激活。 Multi-Resolution Fusions：多分辨率链接的过程。低到高就上采样，高到低就下采样，最后将变化完成之后的feature map相加。 Representation Head：作者设计了三种不同的网络结构来对应不同的任务。
HRNetV1：只输出高分辨率的特征，用于pose estimate任务。
HRNetV2：将低分辨率的表征上采样并最终将来自四个分辨率的表征图链接，用于分割任务。
HRNetV2p：将HRNetV2输出的高分辨率表征进行下采样形成特征金字塔，用于目标检测。

实验结果

实验结果丰富全面、在不同数据集上有大量对比实验（但大多数是为了打榜而生），这边只摘抄了部分实验结果。

实验一：Pose estimate（HRNetV1）

结论：pretrain提升明显，W48对比W32有提升但提升较小，且参数量的运算量成倍增加（受网络结构的影响），sota

实验二：Semantic segmentation

Cityscapes val：OCR的引入能有0.5%的提升，但同样GFOLPS几乎翻倍
Cityscapes test：HRNetV2-W48 sota
PASCAL-Context数据集：OCR的引入能有2%的提升（没有提GFOLPS）

实验三：Object detection

COCO val sota

源代码分析

源代码清晰完整，最核心的多分枝创建融合代码，解析见此处。

Fully Convolutional Networks for Semantic Segmentation 论文笔记

2022-06-07T00:00:00+08:00

分割任务的开山鼻祖，至今仍然是广泛应用的head。

作者信息

两个一作Jonathan Long和Even Shelhanmer。这两位都是caffe的开发者。
通讯作者Trevor Darrell，UC Berkeley的教授and上述两位的导师，也是caffe的开发者之一。

问题定义

卷积神经网络原本主要是用于图像分类任务，能够通过卷积的方式抽象图像信息成特征图与特征向量。再通过全连接层和激活函数计算属于哪个类别的概率。本文讨论如何通过卷积神经网络来实现图像的语义分割和物体识别。

动机和思路

作者的fully convolutional network主要解决几个问题：

end-to-end训练
任意大小图像输入
让卷积网络能够实现物体的分割检测
提升计算效率思路及实现见算法流程。

算法流程

训练流程非常直白，作者做了一件事：把最后的全连接层换成卷积层，让最后输出的不是一个概率向量而是一个包含了label信息的coarse output maps（heat map)。
这个coarse output map再被连接至每一个像素，从而实现对于单个像素的分类并获取非常准确的图像分割结果。将coarse output map链接至每一个像素的过程，本文作者对比了shift-and-stitch和upsampling，
- 前者存在一定程度的妥协，虽然感知野的大小不会被减小，但是filter也不能够调整大小
- 后者是作者实际选择使用的方式，将前几层卷积、池化之后的feature map保留下来，然后通过反向卷积的方式进行对图像因为抽象而损失的细节进行补充，最后实现图像的还原
缺陷：
- 上采样无论如何都会出现图像信息的损失，图片细节仍然会偏向模糊，图像分割只能描述大概的轮廓。
- 对单个像素进行分类没有充分考虑像素与像素之间的关系。忽略了在通常的基于像素分类的分割方法中使用的空间规整步骤，缺乏空间一致性（来自知乎，对这个部分了解还不够深入）

实验结果

第一组实验：将FCN和各个已经在分类任务上被证明有效的网络结构结合，目的是证明现有的卷积神经网络结构能够通过修改成FCN网络结构来完成分割与检测任务。测试数据集为VOC2011。

实验名	mean IoU	comment
FCN-AlexNet	39.8
FCN-VGG16	56.0
FCN-GoogLeNet	42.5

实验结果符合预期，证明图像分类网络通过FCN可以转换成图像分割与检测网络，并且能够拥有相当程度的正确率

第二组实验：测试upsampling的表现情况，测试数据集为VOC2011。

实验名	pixel acc.	mean IoU
FCN-32s-fixed	83.0	45.4
FCN-32s	89.1	59.4
FCN-16s	90.0	62.4
FCN-8s	90.3	62.7

实验结果也是符合预期的，数字越小的网络使用的来自pool的prediction越多，也就意味着获得的额外细节信息更多；以FCN-8s为例，8s的网络使用了pool3的stride 8的prediction，最终的上采样率是8x，也就保留了更多细节和更精确的像素预测。同样stride为8之后的提升已经不是很明显了，说明浅层信息学习已经接近极限了，也就没有必要继续学习进一步的浅层特征了。

第三组实验：与之前网络结构的结果做对比（baseline实验）。值得注意的是，计算时间也有显著减少。

实验名	mean IoU (VOC2011)	mean IoU (VOC2012)	inference time
R-CNN	47.9
SDS	52.6	51.6	~50s
FCN-8s	62.7	62.2	~175ms

第四组实验：RGBD，四维图像输入测试，在除了RGB之外还多了一个深度信息（depth information）。目的应该是测试多通道的情况下是不是仍然能够拥有较好表现。

第五组实验：SIFT包含图像分割信息和背景信息，多任务学习测试。

Yiyang’s Personal Page

Multi-View Transformer for 3D Visual Grounding Paper Reading

Multi-View Transformer for 3D Visual Grounding

Abstract

Gaps of existing works and motivations

Methodology

Constraint

Future direction

Some comments

*Cyber Punk: Edge Runner* review

Cyber Punk: Edge Runner review

Bag of Freebies for Training Object Detection Neural Networks 论文笔记

目录

作者信息

问题定义

相关工作

动机和思路

算法流程

实验结果

Deep High-Resolution Representation Learning for Visual Recognition 论文笔记

目录

作者信息

问题定义

相关工作

低分辨率表征学习

恢复高分辨率表征

保持高分辨率表征

多尺度融合（multi-scale fusion)

动机和思路

算法流程

实验结果

实验一：Pose estimate（HRNetV1）

实验二：Semantic segmentation

实验三：Object detection

源代码分析

Fully Convolutional Networks for Semantic Segmentation 论文笔记

目录

作者信息

问题定义

相关工作

动机和思路

算法流程

实验结果

Cyber Punk: Edge Runner review