Behavior cloning, the dominant approach for training autonomous vehicle (AV) policies, suffers from a fundamental gap: policies trained open-loop on temporally independent samples must operate in closed-loop where actions influence future observations. This mismatch can cause covariate shift, compounding errors, and poor interactive behavior, among other issues. Closed-loop training mitigates the problem by exposing policies to the consequences of their actions during training. However, the recent shift to end-to-end (“sensor to action’’) systems has made closed-loop training significantly more complex, requiring costly high-dimensional rendering and managing sim-to-real gaps. This survey presents a comprehensive taxonomy of closed-loop training techniques for end-to-end driving, organized along three axes: action generation (policy rollouts vs. perturbed demonstrations); environment response generation (real-world data collection, AV simulation, generative video and latent world models); and training objectives (closed-loop imitation, reinforcement learning, and their combinations). We analyze key trade-offs along each axis: on-policy vs. on-expert action generation, environment fidelity vs. cost, and expert vs. reward-based training objectives; as well as coupling factors, such as rollout deviation from the policy, expert, and real world logs; and data type, throughput, and latency requirements. The analysis reveals gaps between current research and industry practice, and points to promising directions for future work.
]]>
Autonomous vehicle (AV) stacks have traditionally relied on decomposed approaches, with separate modules handling perception, prediction, and planning. However, this design introduces information loss during inter-module communication, increases computational overhead, and can lead to compounding errors. To address these challenges, recent works have proposed architectures that integrate all components into an end-to-end differentiable model, enabling holistic system optimization. This shift emphasizes data engineering over software integration, offering the potential to enhance system performance by simply scaling up training resources. In this work, we evaluate the performance of a simple end-to-end driving architecture on internal driving datasets ranging in size from 16 to 8192 hours with both open-loop metrics and closed-loop simulations. Specifically, we investigate how much additional training data is needed to achieve a target performance gain, e.g., a 5% improvement in motion prediction accuracy. By understanding the relationship between model performance and training dataset size, we aim to provide insights for data-driven decision-making in autonomous driving development.
]]>
Despite the ongoing efforts to digitalize the world, there is still a vast amount of paper documents that need to be processed. Document image dewarping is a crucial step in the digitization process, as it aims to remove the distortions induced by challenging environment settings and document sheet deformations often encountered when using smartphone cameras for image capture. With better dewarping results, subsequent document analysis tasks, such as text recognition, information extraction, and classification, can be performed more accurately. Recently, deep learning-based methods were combined with knowledge about the expected document structure, also known as a template, at inference time to improve the dewarping results. While this approach has shown promising results, its utilization of the template information can be further improved. Our contributions in this work are threefold: (1) we propose a novel document image dewarping approach that leverages the prior knowledge about the document structure effectively by detecting and matching lines from the warped and the template domain, and (2) we introduce a novel evaluation metric called matched normalized character error rate (mnCER) to overcome the limitations of existing metrics in evaluating the dewarping process. (3) Finally, we evaluate our approach on the Inv3DReal dataset and show that our approach outperforms the state-of-the-art methods in terms of visual and text-based metrics. Our approach improves upon the state-of-the-art methods by 32.6% in Local Distortion and 40.2% in mnCER. Our code and models are available at https://github.com/FelixHertlein/doc-matcher.
]]>
Programming tutorials in the form of coding screencasts play a crucial role in programming education, serving both novices and experienced developers. However, the video format of these tutorials presents a challenge due to the difficulty of searching for and within videos. Addressing the absence of large-scale and diverse datasets for screencast analysis, we introduce the CodeSCAN dataset. It comprises 12,000 screenshots captured from the Visual Studio Code environment during development, featuring 24 programming languages, 25 fonts, and over 90 distinct themes, in addition to diverse layout changes and realistic user interactions. Moreover, we conduct detailed quantitative and qualitative evaluations to benchmark the performance of Integrated Development Environment (IDE) element detection, color-to-black-and-white conversion, and Optical Character Recognition (OCR). We hope that our contributions facilitate more research in coding screencast analysis, and we make the source code for creating the dataset and the benchmark publicly available on this website.
]]>You want to run some code you found on Github, but the dependencies (e.g. Ubuntu package versions) sound scary to install, since they might interfere with existing, potentially fragile dependencies? When using Devcontainers, this is no problem at all, since every container is a isolated environment which you can freely and quickly configure. Thus, you can iterate the setup steps without ever affecting your local system.
Your colleague wants to quickly run or test your code, but his OS is very different from yours? You’re wondering how difficult setting up your code on his machine will be? With Devcontainers the only overhead you have is that you need to wait for the Docker container to build and it will just work out of the box!
It doesn’t, that’s the great part! The only thing that changes is that status bar badge in the lower left that will let you know that you’re inside a Devcontainer!
Okay, I lied ;). There are a few (minor?) things to consider:
mount all data (outside your project directory) you want to access into the Devcontainer. But don’t worry, that’s an easy one-liner (see Configuration) and you can organize it however you like!root inside the Devcontainer. This can be fixed by not being root inside the container or the good
old sudo chown -R USERNAME:USERNAME /PATH/TO/CODE.tmux. However, you should clean up old open Devcontainers from time to time.ssh since the container has it’s own signature. Check this issue for fixes.You need to (also see the docs)
ms-vscode-remote.remote-containersOkay, that’s the installation - now how to configure it?
.devcontainer in the project rootdevcontainer.json (see below).
├── .devcontainer
├── devcontainer.json
└── Dockerfile
The devcontainer.json should look something like this (for details the
see docs):
{
"build": {
"dockerfile": "Dockerfile", // location of Dockerfile relative to .devcontainer folder
"args": {
// OPTIONAL: only when using GPUs
"runtime": "nvidia"
}
},
"runArgs": [
// OPTIONAL: only when using GPUs
"--runtime=nvidia"
],
"customizations": {
"vscode": {
"extensions": [ // OPTIONAL: specify which extensions should be installed inside the container
"ms-python.python",
"ms-python.vscode-pylance"
]
}
},
"mounts": [ // OPTIONAL: if you want to mount a dataset or something similar
"source=LOCAL/PATH/TO/DATA,target=/DOCKER/PATH/TO/DATA,type=bind,consistency=cached"
],
// OPTIONAL: run commands after creation, e.g. automatic donwload of weights or datasets
"postCreateCommand": "sh .devcontainer/postcreate.sh", // relative to project root
}
Simply press F1 (or CTRL + SHIFT + P) and select Dev Containers: Rebuild and Reopen Container
Due to the steadily rising amount of valuable goods in supply chains, tampering detection for parcels is becoming increasingly important. In this work, we focus on the use-case last-mile delivery, where only a single RGB image is taken and compared against a reference from an existing database to detect potential appearance changes that indicate tampering. We propose a tampering detection pipeline that utilizes keypoint detection to identify the eight corner points of a parcel. This permits applying a perspective transformation to create normalized fronto-parallel views for each visible parcel side surface. These viewpoint-invariant parcel side surface representations facilitate the identification of signs of tampering on parcels within the supply chain, since they reduce the problem to parcel side surface matching with pair-wise appearance change detection. Experiments with multiple classical and deep learning-based change detection approaches are performed on our newly collected TAMpering detection dataset for PARcels, called TAMPAR. We evaluate keypoint and change detection separately, as well as in a unified system for tampering detection. Our evaluation shows promising results for keypoint (Keypoint AP 75.76) and tampering detection (81% accuracy, F1-Score 0.83) on real images. Furthermore, a sensitivity analysis for tampering types, lens distortion and viewing angles is presented. Code and dataset are available at https://a-nau.github.io/tampar.
]]>
To facilitate the transition into the digital era, it is necessary to digitize printed documents such as forms and invoices. Due to the presence of diverse lighting conditions and geometric distortions in real-world photographs of documents, document image restoration typically consists of two stages: first, geometric unwarping to remove the displacement distortions and, second illumination correction to reinstate the original colors. In this work, we tackle the problem of illumination correction for document images and, thereby, enhance downstream tasks, such as text extraction and document archival. Despite the recent state-of-the-art improvements in geometric unwarping, the reliability of those models is limited. Hence, we aim to reduce lighting impurity under the assumption of imperfectly unwarped documents. To reduce the complexity of the task, we incorporate a-priori known visual cues in the form of template images, which offer additional information about the perfect lighting conditions. In this work, we present a novel approach for integrating prior visual cues in the form of document templates. Our extensive evaluation shows a 15.0 % relative improvement in LPIPS and 6.3 % in CER over the state-of-the-art. We will make all code and data publicly available at https://felixhertlein.github.io/illtrtemplate.
]]>
Precisely predicting the future trajectories of surrounding traffic participants is a crucial but challenging problem in autonomous driving, due to complex interactions between traffic agents, map context and traffic rules. Vector-based approaches have recently shown to achieve among the best performances on trajectory prediction benchmarks. These methods model simple interactions between traffic agents but don’t distinguish between relation-type and attributes like their distance along the road. Furthermore, they represent lanes only by sequences of vectors representing center lines and ignore context information like lane dividers and other road elements. We present a novel approach for vector-based trajectory prediction that addresses these shortcomings by leveraging three crucial sources of information: First, we model interactions between traffic agents by a semantic scene graph, that accounts for the nature and important features of their relation. Second, we extract agent-centric image-based map features to model the local map context. Finally, we generate anchor paths to enforce the policy in multi-modal prediction to permitted trajectories only. Each of these three enhancements shows advantages over the baseline model HoliGraph.
]]>
Motion prediction and planning are key components to enable autonomous driving. Although high definition (HD) maps provide important contextual information that constrains the action space of traffic participants, most approaches are not able to fully exploit this heterogeneous information. In this work, we enrich the existing road geometry of the popular nuScenes dataset and convert it into the open-source map framework Lanelet2. This allows easy access to the road topology and thus, enables the usage of (1) spatial semantic information, such as agents driving on intersecting roads and (2) map-generated anchor paths for target vehicles that can help to improve trajectory prediction performance. Further, we present DMAP, a simple, yet effective approach for diverse map-based anchor path generation and filtering. We show that combining DMAP with ground truth velocity profile information yields high-quality motion prediction results on nuScenes (MinADE5=1.09, MissRate5,2=0.18, Offroad rate=0.00). While it is obviously unfair to compare us against the state-of-the-art, it shows that our HD map accurately depicts the road geometry and topology. Future approaches can leverage this by focusing on data-driven sampling of map-based anchor paths and estimating velocity profiles. Moreover, our HD map can be used for map construction tasks and supplement perception. Code and data are made publicly available at https://felixhertlein.github.io/lanelet4nuscenes.
]]>
We focus on enabling damage and tampering detection in logistics and tackle the problem of 3D shape reconstruction of potentially damaged parcels. As input we utilize single RGB images, which corresponds to use-cases where only simple handheld devices are available, e.g. for postmen during delivery or clients on delivery. We present a novel synthetic dataset, named Parcel3D, that is based on the Google Scanned Objects (GSO) dataset and consists of more than 13,000 images of parcels with full 3D annotations. The dataset contains intact, i.e. cuboid-shaped, parcels and damaged parcels, which were generated in simulations. We work towards detecting mishandling of parcels by presenting a novel architecture called CubeRefine R-CNN, which combines estimating a 3D bounding box with an iterative mesh refinement. We benchmark our approach on Parcel3D and an existing dataset of cuboid-shaped parcels in real-world scenarios. Our results show, that while training on Parcel3D enables transfer to the real world, enabling reliable deployment in real-world scenarios is still challenging. CubeRefine R-CNN yields competitive performance in terms of Mesh AP and is the only model that directly enables deformation assessment by 3D mesh comparison and tampering detection by comparing viewpoint invariant parcel side surface representations. Dataset and code are available at https://a-nau.github.io/parcel3d.
]]>