The work is about proposing a new approach to 3D visual grounding, called the Multi-View Transformer (MVT), that outperforms all state-of-the-art methods. The MVT approach uses a multi-view space to learn a more robust multi-modal representation for 3D visual grounding.
According to the paper, the existing works in 3D visual grounding mainly follow a two-stage scheme, i.e., first generating all candidate objects in 3D scene (by classification) and then selecting the most matched one. These two-stage approaches in 3D visual grounding are tailored from 2D visual grounding methods, which have not considered the unique property of 3D data. The paper argues that these methods have limitations in handling view changes and do not learn a view-robust representation.
In the context of the 3D visual grounding task, query data can be categorized into two fundamental groups: explicit view-related queries and implicit view-related queries. Taking the Nr3D as an illustration, consider querying for the same nightstand. In this scenario, diverse utterances emerge, such as “if you face the bed, you need to select the nightstand that is on the right” and “nightstand closer to the desk.” The initial query explicitly indicates the direction of the view, while the latter query does not provide this specific information.

Figure 1. The Nr3D visualization. There are two different types of query, view-explicit and view-implicit.
The paper states that the motivation for the research is to address the limitations of existing methods in 3D visual grounding by making full use the view property in queries.
The Multi-View Transformer (MVT) technique addresses the limitations of existing approaches through a series of steps.
Initially, the MVT method leverages a multi-view space to develop a more resilient multi-modal representation for 3D visual grounding. This strategy eliminates reliance on the initial view and amalgamates all information to create a viewpoint-agnostic representation. According to codes available in their public repository, this is achieved by rotating the entire point cloud four times to generate four sets of new coordinates, which are then integrated into the training dataset for subsequent training phases.
Subsequently, the MVT approach disentangles the computation of 3D object representations by separately calculating point cloud features and object coordinates. This separation facilitates the sharing of point cloud features across various viewpoints. Based on their publicly available codes, this is accomplished by passing all coordinate information through a Multi-Layer Perceptron module and subsequently adding it into the object feature matrix as the new object feature.
Lastly, the Multi-Modal Feature Fusion employs a conventional transformer decoder, where BERT-encoded language features serve as queries and object features as keys. The outcome of the transformer decoder is dimensionally reduced by averaging across the view dimension.

Figure 2. The network structure of Multi-View Transformer.
The MVT approach outperforms all state-of-the-art methods on several datasets, demonstrating its effectiveness in addressing the gaps of existing methods.
The paper uses several evaluations to validate the design of the proposed Multi-View Transformer (MVT) approach. First, the paper compares the performance of the MVT approach with several state-of-the-art methods on three datasets: Nr3D, CLEVR-Ref+, and CLEVRER. The results show that the MVT approach outperforms all other methods on all three datasets. Second, the paper conducts an ablation study to analyze the effectiveness of different components of the MVT approach. The results show that each component contributes to the overall performance of the MVT approach. Third, the paper analyzes the effectiveness of multi-view modeling by comparing the performance of the MVT approach with different numbers of views. The results show that multi-view modeling can learn an effective representation that can benefit different views and that increasing the view number during testing can still improve the grounding accuracy. The paper also provides visualization results to show the effectiveness of the MVT approach in localizing objects in 3D scenes.
The paper does not directly discuss any limitations related to the proposed Multi-View Transformer (MVT) technique. Nevertheless, it does highlight that the MVT method necessitates the independent computation of point cloud features and object coordinates. This separation is intended to facilitate the sharing of point cloud features across different viewpoints. However, this design choice might result in the model’s performance being constrained by the weakest aspect among these multiple modalities. Moreover, the paper acknowledges that for intricate queries or scenes, the network’s capacity to comprehend the scene and establish associations between objects and queries requires further enhancement.
The technique presented in this paper is characterized by its simplicity and remarkable effectiveness. This success serves as an inspiration, prompting us to emphasize the distinct attributes of 3D scenes, rather than solely applying 2D methodologies to tasks involving 3D contexts.
In terms of potential avenues for future exploration, the paper puts forth several directions for forthcoming research endeavors. To begin with, the paper acknowledges the prospect of extending the proposed Multi-View Transformer (MVT) methodology to encompass other tasks, including but not limited to 3D object detection and segmentation. This could be extremely useful in VR or meta universe. Additionally, the paper proposes an enhancement of the MVT approach through the incorporation of supplementary information, such as object attributes and relationships. Furthermore, the paper highlights the potential to apply the MVT technique to alternate modalities like audio and haptic data, thus enabling comprehensive multi-modal grounding. The next suggestion involves the expansion of the MVT approach to tackle more intricate scenes, such as those involving occlusions and clutter. Lastly, the paper presents the prospect of leveraging the MVT approach for the generation of natural language descriptions pertaining to 3D scenes. Such capabilities could find utility in applications such as virtual assistants and robotics.
We have tried to modify its backbone to SAM, fix the object mask, extent the position embedding, apply distance normalization and fourier transform to the coordinate, but the performance is not as good as the original one.
]]>A classic boy-meets-girl story
A destined tragedy
剧透警告!!
曼恩死之前我以为这是一部喜剧,主角们在魔幻现实中踩着帮派混混和公司狗的尸体翩翩起舞,大笑着成为夜之城真正的传奇。主角们最后也许会潇洒离开,也许会结束在爱情修成正果,也许会在荒坂塔大闹一番。
没想到这三个愿景都成真了,但是是镜子大师式的“成真”。
Lucy实现了自己登上月球的梦想,逃离了对她而言是地狱的地球,却永远的失去了David,并且在对方死前的最后几分钟二人才互相坦白各自的挣扎;David走得比曼恩更远,成为了夜之城新的传奇并登上了荒坂塔的顶端,但是每天挣扎在变成赛博精神病的边缘,最后更是在荒坂塔下被亚当重锤炸成了烟花。其他的角色死的死散的散,只有开车猛男法尔科活了下来,带着David的外套。
意难平,啊,意难平啊。编剧你怎么能这么对Lucy。
虽然意难平,不得不说这个结局才是赛博朋克,可以说扳机社你们是懂赛博朋克的。在夜之城没有活着的传奇,更没有什么得偿所愿。每一个试图扳倒公司,反抗一切的个体在体量巨大的公司看来都是一条蝼蚁:哪怕是杀了一队公司狗,也会有千千万万的人做梦都想成为公司狗。公司就是权力和财富的代名词,进入公司才有稳定的前途。毕竟不是每个人都能“以我残躯化烈火”,不如说除了V再也没有人能做到。所以说这个结局我愿意称为“True End”,it is what it is。
话是这么说,还是好想看到Lucy能再一次露出在救护车上的笑容啊。希望2077的DLC能有一点联动内容,蠢驴最好识相一点。
]]>亚马逊的AI团队,一作现在在字节。通讯作者李沐,在B站发的论文精读视频还挺好看。
本文主要讨论针对object detection的神经网络优化方法。 object detection神经网络可以分为两类: 以YOLO为代表的single-stage object detection networks、以Faster-RCNN为代表的multi-stage object detection networks 本文涉及的优化包括:
多种调优模型的方法,包括
本文以这些工作为基础讨论应该如何进一步调优object detection的模型,探讨哪些方法对于哪些模型表现有进一步的帮助
作者讨论应该如何进一步调优object detection的模型,探讨哪些方法对于哪些模型表现有进一步的帮助
| 数据集 | Single stage(YOLOv3) | Multiple stage(Faster-RCNN) |
|---|---|---|
| Pascal | 去除data augmentation带来巨额性能损失(16%mAP),证明single stage模型严重依赖数据增强来创造之前没见过的特征以提升模型的预测能力;其余提到的方法共提升了约3.43%mAP,其中mixup提升最大(1.54),进一步佐证的本文所说:single stage模型严重依赖于各类数据增强方法。 | 去除data augmentation带来非常轻微的性能损失(0.16%mAP),证明proposal层大量的采样实际上替代了在single stage中大量使用的random cropping;其余本文提到的方法共带来3.55%的性能提升,其中cosine lr schdule带来的提升占比最大(1.82) |
| MS COCO | 同样各种方法一起安排能够提点(4-5.4),并且在输入图片分辨率较低的时候效果更加明显。最终达到的准确率与Faster-RCNN差不多,同时推理速度更快 | 相对而言提升更小(1-1.7),但许多category都能够有提升 |
Mixup的使用可以在两个不同的阶段,分别是预训练backbone和训练检测头。实验结果证明Mixup方法能对两个阶段分别起作用,并且在两个阶段同时使用最终能够产生更佳的效果。
]]>作者王井东现在是百度计算机视觉组的首席架构师。他在2007到2021在微软亚洲研究院工作,而这篇文章在这期间(2020)发布。
图像分割与检测的网络大多都有基于图像分类任务的网络,先卷积再上采样、恢复成高分辨率,这个过程会导致图像的分辨率变低。而本文希望通过新的网络结构保持整个训练过程中都能保持高分辨率表征。
本文希望通过并行连接high-to-low的子网络,保持高分辨率而非从低分辨率恢复高分辨率,从而保证表征的空间准确性。
输入图片时,先进入两个步长为2的3*3卷积将分辨率降为1/4。这个分辨率C将作为后续的阶段主体中保持不变的分辨率。
开始训练后,逐步增加high-to-low流,并将多个分辨率的流进行链接。
Multi-Resolution Fusions:多分辨率链接的过程。低到高就上采样,高到低就下采样,最后将变化完成之后的feature map相加。
Representation Head:作者设计了三种不同的网络结构来对应不同的任务。

实验结果丰富全面、在不同数据集上有大量对比实验(但大多数是为了打榜而生),这边只摘抄了部分实验结果。

源代码清晰完整,最核心的多分枝创建融合代码,解析见此处。
]]>卷积神经网络原本主要是用于图像分类任务,能够通过卷积的方式抽象图像信息成特征图与特征向量。再通过全连接层和激活函数计算属于哪个类别的概率。本文讨论如何通过卷积神经网络来实现图像的语义分割和物体识别。
作者的fully convolutional network主要解决几个问题:



第一组实验:将FCN和各个已经在分类任务上被证明有效的网络结构结合,目的是证明现有的卷积神经网络结构能够通过修改成FCN网络结构来完成分割与检测任务。测试数据集为VOC2011。
| 实验名 | mean IoU | comment |
|---|---|---|
| FCN-AlexNet | 39.8 | |
| FCN-VGG16 | 56.0 | |
| FCN-GoogLeNet | 42.5 |
实验结果符合预期,证明图像分类网络通过FCN可以转换成图像分割与检测网络,并且能够拥有相当程度的正确率
第二组实验:测试upsampling的表现情况,测试数据集为VOC2011。
| 实验名 | pixel acc. | mean IoU |
|---|---|---|
| FCN-32s-fixed | 83.0 | 45.4 |
| FCN-32s | 89.1 | 59.4 |
| FCN-16s | 90.0 | 62.4 |
| FCN-8s | 90.3 | 62.7 |
实验结果也是符合预期的,数字越小的网络使用的来自pool的prediction越多,也就意味着获得的额外细节信息更多;以FCN-8s为例,8s的网络使用了pool3的stride 8的prediction,最终的上采样率是8x,也就保留了更多细节和更精确的像素预测。同样stride为8之后的提升已经不是很明显了,说明浅层信息学习已经接近极限了,也就没有必要继续学习进一步的浅层特征了。
第三组实验:与之前网络结构的结果做对比(baseline实验)。值得注意的是,计算时间也有显著减少。
| 实验名 | mean IoU (VOC2011) | mean IoU (VOC2012) | inference time |
|---|---|---|---|
| R-CNN | 47.9 | ||
| SDS | 52.6 | 51.6 | ~50s |
| FCN-8s | 62.7 | 62.2 | ~175ms |
第四组实验:RGBD,四维图像输入测试,在除了RGB之外还多了一个深度信息(depth information)。目的应该是测试多通道的情况下是不是仍然能够拥有较好表现。

第五组实验:SIFT包含图像分割信息和背景信息,多任务学习测试。
