Jekyll2025-11-22T15:15:46+08:00https://mrtater.github.io/feed.xmlYiyang’s Personal Pagepersonal descriptionYiyang Luo, Lawrencelawrence.luoyy[at]gmail.comMulti-View Transformer for 3D Visual Grounding Paper Reading2023-08-22T00:00:00+08:002023-08-22T00:00:00+08:00https://mrtater.github.io/posts/PaperReading/PaperReading-MVT-3DVGEasy and effective 3D visual grounding.

Multi-View Transformer for 3D Visual Grounding

Abstract

The work is about proposing a new approach to 3D visual grounding, called the Multi-View Transformer (MVT), that outperforms all state-of-the-art methods. The MVT approach uses a multi-view space to learn a more robust multi-modal representation for 3D visual grounding.

Gaps of existing works and motivations

According to the paper, the existing works in 3D visual grounding mainly follow a two-stage scheme, i.e., first generating all candidate objects in 3D scene (by classification) and then selecting the most matched one. These two-stage approaches in 3D visual grounding are tailored from 2D visual grounding methods, which have not considered the unique property of 3D data. The paper argues that these methods have limitations in handling view changes and do not learn a view-robust representation.

In the context of the 3D visual grounding task, query data can be categorized into two fundamental groups: explicit view-related queries and implicit view-related queries. Taking the Nr3D as an illustration, consider querying for the same nightstand. In this scenario, diverse utterances emerge, such as “if you face the bed, you need to select the nightstand that is on the right” and “nightstand closer to the desk.” The initial query explicitly indicates the direction of the view, while the latter query does not provide this specific information.

Figure 1. The Nr3D visualization. There are two different types of query, view-explicit and view-implicit.

Figure 1. The Nr3D visualization. There are two different types of query, view-explicit and view-implicit.

The paper states that the motivation for the research is to address the limitations of existing methods in 3D visual grounding by making full use the view property in queries.

Methodology

The Multi-View Transformer (MVT) technique addresses the limitations of existing approaches through a series of steps.

Initially, the MVT method leverages a multi-view space to develop a more resilient multi-modal representation for 3D visual grounding. This strategy eliminates reliance on the initial view and amalgamates all information to create a viewpoint-agnostic representation. According to codes available in their public repository, this is achieved by rotating the entire point cloud four times to generate four sets of new coordinates, which are then integrated into the training dataset for subsequent training phases.

Subsequently, the MVT approach disentangles the computation of 3D object representations by separately calculating point cloud features and object coordinates. This separation facilitates the sharing of point cloud features across various viewpoints. Based on their publicly available codes, this is accomplished by passing all coordinate information through a Multi-Layer Perceptron module and subsequently adding it into the object feature matrix as the new object feature.

Lastly, the Multi-Modal Feature Fusion employs a conventional transformer decoder, where BERT-encoded language features serve as queries and object features as keys. The outcome of the transformer decoder is dimensionally reduced by averaging across the view dimension.

Figure 2. The network structure of Multi-View Transformer.

Figure 2. The network structure of Multi-View Transformer.

The MVT approach outperforms all state-of-the-art methods on several datasets, demonstrating its effectiveness in addressing the gaps of existing methods.

The paper uses several evaluations to validate the design of the proposed Multi-View Transformer (MVT) approach. First, the paper compares the performance of the MVT approach with several state-of-the-art methods on three datasets: Nr3D, CLEVR-Ref+, and CLEVRER. The results show that the MVT approach outperforms all other methods on all three datasets. Second, the paper conducts an ablation study to analyze the effectiveness of different components of the MVT approach. The results show that each component contributes to the overall performance of the MVT approach. Third, the paper analyzes the effectiveness of multi-view modeling by comparing the performance of the MVT approach with different numbers of views. The results show that multi-view modeling can learn an effective representation that can benefit different views and that increasing the view number during testing can still improve the grounding accuracy. The paper also provides visualization results to show the effectiveness of the MVT approach in localizing objects in 3D scenes.

Constraint

The paper does not directly discuss any limitations related to the proposed Multi-View Transformer (MVT) technique. Nevertheless, it does highlight that the MVT method necessitates the independent computation of point cloud features and object coordinates. This separation is intended to facilitate the sharing of point cloud features across different viewpoints. However, this design choice might result in the model’s performance being constrained by the weakest aspect among these multiple modalities. Moreover, the paper acknowledges that for intricate queries or scenes, the network’s capacity to comprehend the scene and establish associations between objects and queries requires further enhancement.

Future direction

The technique presented in this paper is characterized by its simplicity and remarkable effectiveness. This success serves as an inspiration, prompting us to emphasize the distinct attributes of 3D scenes, rather than solely applying 2D methodologies to tasks involving 3D contexts.

In terms of potential avenues for future exploration, the paper puts forth several directions for forthcoming research endeavors. To begin with, the paper acknowledges the prospect of extending the proposed Multi-View Transformer (MVT) methodology to encompass other tasks, including but not limited to 3D object detection and segmentation. This could be extremely useful in VR or meta universe. Additionally, the paper proposes an enhancement of the MVT approach through the incorporation of supplementary information, such as object attributes and relationships. Furthermore, the paper highlights the potential to apply the MVT technique to alternate modalities like audio and haptic data, thus enabling comprehensive multi-modal grounding. The next suggestion involves the expansion of the MVT approach to tackle more intricate scenes, such as those involving occlusions and clutter. Lastly, the paper presents the prospect of leveraging the MVT approach for the generation of natural language descriptions pertaining to 3D scenes. Such capabilities could find utility in applications such as virtual assistants and robotics.

Some comments

We have tried to modify its backbone to SAM, fix the object mask, extent the position embedding, apply distance normalization and fourier transform to the coordinate, but the performance is not as good as the original one.

]]>
Yiyang Luo, Lawrencelawrence.luoyy[at]gmail.com
*Cyber Punk: Edge Runner* review2022-10-19T00:00:00+08:002022-10-19T00:00:00+08:00https://mrtater.github.io/posts/Thoughts/RandomWords-Cyberpunk一些想法不动笔记录下来很快就会忘记,所以要刻在石头上。硅也算是石头。

Cyber Punk: Edge Runner review

A classic boy-meets-girl story
A destined tragedy

剧透警告!!

曼恩死之前我以为这是一部喜剧,主角们在魔幻现实中踩着帮派混混和公司狗的尸体翩翩起舞,大笑着成为夜之城真正的传奇。主角们最后也许会潇洒离开,也许会结束在爱情修成正果,也许会在荒坂塔大闹一番。

没想到这三个愿景都成真了,但是是镜子大师式的“成真”。

Lucy实现了自己登上月球的梦想,逃离了对她而言是地狱的地球,却永远的失去了David,并且在对方死前的最后几分钟二人才互相坦白各自的挣扎;David走得比曼恩更远,成为了夜之城新的传奇并登上了荒坂塔的顶端,但是每天挣扎在变成赛博精神病的边缘,最后更是在荒坂塔下被亚当重锤炸成了烟花。其他的角色死的死散的散,只有开车猛男法尔科活了下来,带着David的外套。

意难平,啊,意难平啊。编剧你怎么能这么对Lucy。

虽然意难平,不得不说这个结局才是赛博朋克,可以说扳机社你们是懂赛博朋克的。在夜之城没有活着的传奇,更没有什么得偿所愿。每一个试图扳倒公司,反抗一切的个体在体量巨大的公司看来都是一条蝼蚁:哪怕是杀了一队公司狗,也会有千千万万的人做梦都想成为公司狗。公司就是权力和财富的代名词,进入公司才有稳定的前途。毕竟不是每个人都能“以我残躯化烈火”,不如说除了V再也没有人能做到。所以说这个结局我愿意称为“True End”,it is what it is。

话是这么说,还是好想看到Lucy能再一次露出在救护车上的笑容啊。希望2077的DLC能有一点联动内容,蠢驴最好识相一点。

]]>
Yiyang Luo, Lawrencelawrence.luoyy[at]gmail.com
Bag of Freebies for Training Object Detection Neural Networks 论文笔记2022-06-24T00:00:00+08:002022-06-24T00:00:00+08:00https://mrtater.github.io/posts/PaperReading/PaperReading-finetuning想要白赚的性能提升吗?那你来对地方了!

目录

作者信息

亚马逊的AI团队,一作现在在字节。通讯作者李沐,在B站发的论文精读视频还挺好看。

问题定义

本文主要讨论针对object detection的神经网络优化方法。 object detection神经网络可以分为两类: 以YOLO为代表的single-stage object detection networks、以Faster-RCNN为代表的multi-stage object detection networks 本文涉及的优化包括:

  • MixUp
  • head label smoothing
  • transformations(random crop等)
  • 学习率schedule
  • synchronized batch normalization
  • random shape training

相关工作

多种调优模型的方法,包括

  • learning rate warmup
  • label smoothing
  • 以及object detection的两种分类
    • multiple stage
    • single stage

本文以这些工作为基础讨论应该如何进一步调优object detection的模型,探讨哪些方法对于哪些模型表现有进一步的帮助

动机和思路

作者讨论应该如何进一步调优object detection的模型,探讨哪些方法对于哪些模型表现有进一步的帮助

算法流程

  • MixUp:保留了完整信息(object label也被整合成了新的array)的geometry preserved alignment of mixed images也能够提升目标检测的准确性
  • data prepossessing
    • single stage(YOLOv3):对transformation更加敏感,包括random flip, rotate和crop之类的常规transform方式
    • multiple stage(Faster-RCNN):对transformation不敏感,本文给出的理由是sampling-based方法在获取ROI的时候要在feature map上做大量的crop,这个过程代替了数据预处理中的crop
  • training schedule
    • step schedule的学习率变化过于剧烈,并且会导致优化器需要重新稳定momentum
    • cosine schedule的学习率变化更加平滑,效果更好
    • warmup schedule对于一部分的物体检测算法影响很大,如YOLOv3(起始时负样本的梯度作为主导,因此如果一开始学习率就很大会导致对于主要的样本得分接近于0) 设置的当的cosine schedule和warmup schedule配合能让整体训练效果提升
  • synchronized batch normalization
    • 非同步的bn无论如何都会造成一定程度的batchsize减少和数据差异(每张卡做归一化),这种问题在小batchsize的时候可能会更加严重(如高分辨率图片训练时可能出现的,1张图片一张卡)
  • 随机大小的图片(仅限single-stage网络)
    • faster RCNN本身就能够接受多种不同大小的图片
    • 随机大小的图片可以降低过拟合风险和提升模型泛化能力

实验结果

数据集 Single stage(YOLOv3) Multiple stage(Faster-RCNN)
Pascal 去除data augmentation带来巨额性能损失(16%mAP),证明single stage模型严重依赖数据增强来创造之前没见过的特征以提升模型的预测能力;其余提到的方法共提升了约3.43%mAP,其中mixup提升最大(1.54),进一步佐证的本文所说:single stage模型严重依赖于各类数据增强方法。 去除data augmentation带来非常轻微的性能损失(0.16%mAP),证明proposal层大量的采样实际上替代了在single stage中大量使用的random cropping;其余本文提到的方法共带来3.55%的性能提升,其中cosine lr schdule带来的提升占比最大(1.82)
MS COCO 同样各种方法一起安排能够提点(4-5.4),并且在输入图片分辨率较低的时候效果更加明显。最终达到的准确率与Faster-RCNN差不多,同时推理速度更快 相对而言提升更小(1-1.7),但许多category都能够有提升

Mixup的使用可以在两个不同的阶段,分别是预训练backbone和训练检测头。实验结果证明Mixup方法能对两个阶段分别起作用,并且在两个阶段同时使用最终能够产生更佳的效果。

]]>
Yiyang Luo, Lawrencelawrence.luoyy[at]gmail.com
Deep High-Resolution Representation Learning for Visual Recognition 论文笔记2022-06-08T00:00:00+08:002022-06-08T00:00:00+08:00https://mrtater.github.io/posts/PaperReading/PaperReading-HRNet多尺度融合的大成之作,在多种分割任务上发挥着作用。

目录

作者信息

作者王井东现在是百度计算机视觉组的首席架构师。他在2007到2021在微软亚洲研究院工作,而这篇文章在这期间(2020)发布。

问题定义

图像分割与检测的网络大多都有基于图像分类任务的网络,先卷积再上采样、恢复成高分辨率,这个过程会导致图像的分辨率变低。而本文希望通过新的网络结构保持整个训练过程中都能保持高分辨率表征。

相关工作

低分辨率表征学习

  • 代表是全卷积网络,卷积网络语义分割鼻祖。

    恢复高分辨率表征

  • 逐步恢复成高分辨率的学习过程可以通过上采样完成。大致可以分为
    • 对称上采样:指上采样过程和卷积过程基本是相同的,如VGGNet、encoder-decoder等。
    • 不对称上采样

      保持高分辨率表征

  • 与本文方法关系紧密,但是之前的工作convolutional neural fabrics和interlinked CNNs出现的问题有:
    • 没有仔细设计什么时候开始低分辨率并行流
    • 缺乏仔细设计的并行流信息交换过程
    • 没有使用batch normalization和残差链接
  • 另外一份工作Grid-Net拥有两个对称的信息交换过程,第一个过程是从高分辨率传递信息给低分辨率,第二个过程是从低分辨率传递信息给高分辨率。这个分离的过程限制其图像分割的质量。
  • Multi-scale DenseNet没有接受来自低分辨率的信息,因此不能够/没有能力学习高分辨率的表征

多尺度融合(multi-scale fusion)

  • 最简单直接的方法是将不同分辨率的图像分别输入进不同的网络并且合并输出的map。
  • 也可以将从high-to-low下采样获取的low-level特征和low-to-high上采样获取的high-level特征通过skip connection相结合。
  • 以DeepLabv3为代表的pyramid pooling and atrous spatial pyramid pooling
  • 本文的模型不同之处在于会融合4种分辨率而不是一种分辨率的output,同时这个融合过程会重复多次。

动机和思路

本文希望通过并行连接high-to-low的子网络,保持高分辨率而非从低分辨率恢复高分辨率,从而保证表征的空间准确性。

算法流程

输入图片时,先进入两个步长为2的3*3卷积将分辨率降为1/4。这个分辨率C将作为后续的阶段主体中保持不变的分辨率。

开始训练后,逐步增加high-to-low流,并将多个分辨率的流进行链接。

  • 主体共有四个阶段,每个阶段分别重复1,1,4,3次。
  • 每个模块由4个残差网络单元组成,每个残差网络单元内对每种分辨率都有两个3*3卷积,并在卷积后进行batch normalization和ReLU激活。 avatar Multi-Resolution Fusions:多分辨率链接的过程。低到高就上采样,高到低就下采样,最后将变化完成之后的feature map相加。 avatar Representation Head:作者设计了三种不同的网络结构来对应不同的任务。 avatar
  • HRNetV1:只输出高分辨率的特征,用于pose estimate任务。
  • HRNetV2:将低分辨率的表征上采样并最终将来自四个分辨率的表征图链接,用于分割任务。
  • HRNetV2p:将HRNetV2输出的高分辨率表征进行下采样形成特征金字塔,用于目标检测。

实验结果

实验结果丰富全面、在不同数据集上有大量对比实验(但大多数是为了打榜而生),这边只摘抄了部分实验结果。

实验一:Pose estimate(HRNetV1)

avatar avatar

  • 结论:pretrain提升明显,W48对比W32有提升但提升较小,且参数量的运算量成倍增加(受网络结构的影响),sota

实验二:Semantic segmentation

  • Cityscapes val:OCR的引入能有0.5%的提升,但同样GFOLPS几乎翻倍
  • Cityscapes test:HRNetV2-W48 sota
  • PASCAL-Context数据集:OCR的引入能有2%的提升(没有提GFOLPS)

实验三:Object detection

  • COCO val sota

源代码分析

源代码清晰完整,最核心的多分枝创建融合代码,解析见此处

]]>
Yiyang Luo, Lawrencelawrence.luoyy[at]gmail.com
Fully Convolutional Networks for Semantic Segmentation 论文笔记2022-06-07T00:00:00+08:002022-06-07T00:00:00+08:00https://mrtater.github.io/posts/PaperReading/PaperReading-FCN分割任务的开山鼻祖,至今仍然是广泛应用的head。

目录

作者信息

  • 两个一作Jonathan Long和Even Shelhanmer。这两位都是caffe的开发者。
  • 通讯作者Trevor Darrell,UC Berkeley的教授and上述两位的导师,也是caffe的开发者之一。

问题定义

卷积神经网络原本主要是用于图像分类任务,能够通过卷积的方式抽象图像信息成特征图与特征向量。再通过全连接层和激活函数计算属于哪个类别的概率。本文讨论如何通过卷积神经网络来实现图像的语义分割和物体识别。

相关工作

  • 虽然卷积网络在这个文章很多年前就出现了,但是用卷积神经网络做物体检测的少,用fully convolutional network的少,能end-to-end训练的就少之又少了;
    • 其中比较接近的可能有Joint training of a convolutional network and a graphical model for human pose estimation,但是这篇文章用本文作者的话说就是虽然用了但没用明白,没有分析和解释这种网络结构的优点。
    • 还有一篇相关工作 Spatial pyramid pooling in deep convolutional networkds for visiual recognition,去除了分类神经网络中非卷积的部分,并且将spatial pyramid pooling和proposal结合起来获取localized特征,但是问题在于不能end-to-end训练。
  • 另外一个相关方向是dense prediction卷积,相关工作中通常都有很多小trick
  • 小模型,capacity和receptive field都有限
    • patchwise training(重要,本文使用了类似的思路)
    • shift-and-stitch(重要)
  • proposal算法,代表性的是R-CNN的微调版本,但是这个不是end-to-end的训练

动机和思路

作者的fully convolutional network主要解决几个问题:

  • end-to-end训练
  • 任意大小图像输入
  • 让卷积网络能够实现物体的分割检测
  • 提升计算效率 思路及实现见算法流程。

算法流程

  • 训练流程非常直白,作者做了一件事:把最后的全连接层换成卷积层,让最后输出的不是一个概率向量而是一个包含了label信息的coarse output maps(heat map)。 Image
  • 这个coarse output map再被连接至每一个像素,从而实现对于单个像素的分类并获取非常准确的图像分割结果。将coarse output map链接至每一个像素的过程,本文作者对比了shift-and-stitch和upsampling,
    • 前者存在一定程度的妥协,虽然感知野的大小不会被减小,但是filter也不能够调整大小
    • 后者是作者实际选择使用的方式,将前几层卷积、池化之后的feature map保留下来,然后通过反向卷积的方式进行对图像因为抽象而损失的细节进行补充,最后实现图像的还原 Image
  • 缺陷:
    • 上采样无论如何都会出现图像信息的损失,图片细节仍然会偏向模糊,图像分割只能描述大概的轮廓。
    • 对单个像素进行分类没有充分考虑像素与像素之间的关系。忽略了在通常的基于像素分类的分割方法中使用的空间规整步骤,缺乏空间一致性(来自知乎,对这个部分了解还不够深入) Image

实验结果

第一组实验:将FCN和各个已经在分类任务上被证明有效的网络结构结合,目的是证明现有的卷积神经网络结构能够通过修改成FCN网络结构来完成分割与检测任务。测试数据集为VOC2011。

实验名 mean IoU comment
FCN-AlexNet 39.8  
FCN-VGG16 56.0  
FCN-GoogLeNet 42.5  

实验结果符合预期,证明图像分类网络通过FCN可以转换成图像分割与检测网络,并且能够拥有相当程度的正确率

第二组实验:测试upsampling的表现情况,测试数据集为VOC2011。

实验名 pixel acc. mean IoU
FCN-32s-fixed 83.0 45.4
FCN-32s 89.1 59.4
FCN-16s 90.0 62.4
FCN-8s 90.3 62.7

实验结果也是符合预期的,数字越小的网络使用的来自pool的prediction越多,也就意味着获得的额外细节信息更多;以FCN-8s为例,8s的网络使用了pool3的stride 8的prediction,最终的上采样率是8x,也就保留了更多细节和更精确的像素预测。同样stride为8之后的提升已经不是很明显了,说明浅层信息学习已经接近极限了,也就没有必要继续学习进一步的浅层特征了。

第三组实验:与之前网络结构的结果做对比(baseline实验)。值得注意的是,计算时间也有显著减少。

实验名 mean IoU (VOC2011) mean IoU (VOC2012) inference time
R-CNN 47.9    
SDS 52.6 51.6 ~50s
FCN-8s 62.7 62.2 ~175ms

第四组实验:RGBD,四维图像输入测试,在除了RGB之外还多了一个深度信息(depth information)。目的应该是测试多通道的情况下是不是仍然能够拥有较好表现。 image

第五组实验:SIFT包含图像分割信息和背景信息,多任务学习测试。 image

]]>
Yiyang Luo, Lawrencelawrence.luoyy[at]gmail.com