COINSE

Clotho is accepted to FSE 2026

Mon, 22 Dec 2025 12:00:00 +0000

A nice Christmas present came in the form of an author notification from FSE 2026 - a paper titled Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs written by Juyeon, Somin, Robert, and Shin, has been accepted. Congratulations!

We think this is an important breakthrough, at the risk of saying so for our own work. Existing coverage criteria defined for DNNs have been evaluated for LLMs to detect out-of-distribution inputs such as jailbreak attempts, but distribution-based adequacy metric such as Surprise Adequacy remained inapplicable for LLMs, simply because it is practically infeasible to measure similarity between neuron activations initiated by the new input and activations by the training data. This, in turn, is due to two reasons. First, training corpora for pre-trained LLMs are simply too big. Second, even if we could handle a large training corpus, the purpose of pre-training is to reduce auto-regressive loss, not to solve a specific task we are writing the prompt for, so even knowing what to measure is not clear.

Clotho mitigates this problem with the use of reference set, active learning, and Gaussian Mixture Model. The use of reference set means we will measure the similarity between activation triggered by a new input and a set of representative inputs for the task. GMM models the distribution of such representative inputs. Finally, active learning allows us to choose the reference set efficiently. Combined, we can now compute SA as a test adequacy measure for a specific LLM task.

A surprising finding for us was that the adequacy does generalise. When we computed the adequacy scores using a local SLM, and prioritsed the inputs for GPT-4o mini, Gemini Flash 2.5 Lite, and Claude Haiku, the prioritisation was still very much meaningful. Compared to randomly selected 100 samples, Clotho can select 126.8% more failing inputs.

This is an exciting start, as it opens so many doors to interesting analyses of internal behaviour of LLMs. More to come!

AutoCrashFL is accepted to ICSE SEIP 2026

Fri, 05 Dec 2025 12:00:00 +0000

We are happy to report that our latest collaboration with SAP has been accepted by SEIP track at ICSE 2026. The paper introduces AutoCrashFL, an LLM based agent that can localise the root cause of crashes. AutoCrashFL is an adaptation of AutoFL, which is our original FL agent that works for test failures. The major difference of AutoCrashFL is that it only requires the core dump generated from the crash. Our empirical evaluation studied 454 real crashes of SAP HANA, an enterprise level database with 35MLoC. AutoCrashFL can rank the file responsible for 30% of studied crashes at the top. The results show that agentic approach can scale up to industry projects.

AutoSD accepted at EMSE

Tue, 12 Nov 2024 22:11:00 +0000

AutoSD, a project led by Dr. Sungmin Kang, has been accepted into the Journal of Empeirical Software Engineering. The preprint is available from here.

Scientific Debugging is a guideline for systematic debugging originally proposed by Andreas Zeller initially for human developers. It suggests that developers shoud follow the process of scientific discovery:

Hypothesize why the failure occurs
Predict the program behaviour based on the hypothesis
Experiment to confirm the prediction
Observe the result
Conclude what the root cause is if the observation confirms the hypothesis; otherwise go back and make another hypothesis

AutoSD is an autonomous LLM agent that performs scientific debugging, with zero-shot prompt that explains the process of scientific debugging as well as how to use debuggers to run experiments and make observations. We believe this work is significant in two aspects:

Here, an LLM agent is excuting a guideline that is initially written for humans. In turn, we expect that the outcome (i.e., the generated patch as well as the debugging process itself) is inherently explainable and better aligned with human understanding, especially compared to existing techniques that directly produces patches.
AutoSD shows how the autocompletion driven by LLMs can be hybridised with more symbolic analysis (here, dynamic analysis using the debugger). We believe that such executability will play a major role in improving the robustness of LLM-generated solutions.

This work was born out of Dr. Kang’s internship at Microsoft Research Asia. Congratulations, everyone!

ICSME 2024 Best Indistry Paper Award

Thu, 10 Oct 2024 13:48:00 +0000

The paper titled “Just-in-Time Flaky Test Detection via Abstracted Failure Symptom Matching”, a collaboration between COINSE, SAP (Germany), and SAP Labs (South Korea), won Best Industry Paper Award at ICSME 2024! Congratulations, everyone! We really appreciate the recognition of the simplicity :)

Flaky Symptoms at ICSME 2024

Tue, 03 Sep 2024 13:48:00 +0000

Our paper titled “Just-in-Time Flaky Test Detection via Abstracted Failure Symptom Matching” has been accepted into the industry track of ICSME 2024. This is a collaboration between COINSE, SAP (Germany), and SAP Labs (South Korea). Congratulations!

The paper proposes a very simple idea: when a test fails due to its flakiness, the symptoms (i.e., error messages, logs, etc) may be different from those observed in non-flaky failures of the same test. Based on this intuition, we first build a database of known flaky symptoms, using the existing human labels at SAP. Subsequently, we decide whether new, incoming test failures are flaky or not by matching the textual symptoms. To make the matching more robust, we process the symptoms via “abstraction” (e.g., mask concrete IP addresses).

The achieves about 96% precision when evaluated against real world historical test data from SAP, while saving 58% of machine time that would have been used for reruns. Once again, the results show that simplicity is a huge benefit. SAP is working to integrate this into their CI/CD pipeline.

Dr. Kang's Birthday Present

Mon, 10 Jun 2024 17:00:00 +0000

Sungmin Kang has successfully defended his thesis, titled “Improving the Reliability Large Language Model-based Software Artifacts via Execution”. Notably, he did it on his birthday :) Congratulations!!

Dr. Kang joined COINSE in 2019, initially as a master’s student, then converting to an integrated programme (MSc + PhD). Sungmin’s very first research attempt was Monte-Carlo Tree Search, which led to Automated Program Repair (APR). Recently he is best known for his work with Large Language Models, including bug reproduction (ICSE 2023), explainable APR (preprint), and fault localization (FSE 2024). Sungmin is also our resident Bayesian expert (ISSTA 2023). Congratulations!!

Is it you, Dr. Gabin An?

Tue, 04 Jun 2024 17:00:00 +0000

Gabin An has successfully defended her thesis, titled “Synergizing Fault Localization and Continuous Integration to Streamline Bug Resolution in Large-Scale Software Systems”. Congratulations!!

Dr. An joined COINSE in 2017, initially as an undergraduate research intern, progressing to master’s and eventually the doctoral programme. Gabin initially worked on Genetic Improvement, releasing PyGGI, which is still widely used. During her PhD, Gabin focused on how fault localization can be practically useful, releasing a series of quality publications at ISSTA 2022, ICSE SEIP 2022, and ICSE 2023. During this, Gabin collaborated extensively with SAP Labs Korea. Congratulations!!

ICST 2024 MBFL Paper about MUSE received Most Influential Paper Award at ICST 2024

Thu, 30 May 2024 18:00:00 +0000

The paper published back in 2014 at the 7th IEEE International Conference on Software Testing, Verification \& Validation, has been awarded IEEE TCSE Most Influential Award at ICST 2024. This was a collaboration between SWTV group at KAIST and Shin Yoo, who was still in UCL bacn in 2014. The paper is titled: “Ask the Mutant: Mutating Faulty Programs for Fault Localization”.

MUSE, the technique introduced in the paper, exploits the fact that mutating non-faulty areas of a program that is already faulty will break further tests, whereas mutating exactly the faulty location of the program may result in partial fixes.

This work led to many other fault localization research, in particular the effort to make mutation-based fault localization ([ISSRE 2021]{https://coinse.github.io/publications/pdfs/Kim2021xv.pdf}, IST 2023) as well as mutation analysis itself more efficient (TOSEM 2022).

Our paper "A Quantitative and Qualitative Evaluation of LLM-based Explainable Fault Localization" has been accepted to FSE 2024

Wed, 24 Apr 2024 18:00:00 +0000

We are honored to have a paper accepted to the 32nd International Conference on the Foundations of Software Engineering, titled [A Quantitative and Qualitative Evaluation of LLM-based Explainable Fault Localization]. As many are aware of, large language models (LLMs) are showing strong performance in many different domains, but had not yet been applied to fault localization, as providing an entire repository to a large language model is generally infeasible. In this work, we tackle this challenge by providing LLMs with tools so that it can autonomously navigate the repository and find relevant code within the framework of our tool AutoFL.

Overall, we find that AutoFL could substantially outperform existing fault localization techniques, all while only using failing tests unlike other approaches which require more resources:

If you are interested, you can find many more details in our preprint!

Our paper "Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents" has been accepted to ICST 2024

Fri, 29 Dec 2023 14:00:00 +0000

We are pleasant to introduce our paper Intent-Driven Mobile GUI Testing with Autonomous Large Language Model Agents that will be presented to the International Conference on Software Testing, Verification and Validation (ICST). We propose a novel approach, named DroidAgent, which enables high-level mobile GUI testing driven by artificial user intents.

Different from the conventional GUI testing techniques that focus on generating the low-level GUI events, the Planner component of DroidAgent consecutively generates high-level tasks to be performed on the target app and then the Actor component of DroidAgent generates and executes the GUI events to accomplish the tasks. DroidAgent effectively harnesses the power of large language models by equipping the model to the multiple components of DroidAgent. We evaluated DroidAgent on 15 real-world Android apps and the results show that DroidAgent can generate viable tasks and accomplish them, and achieve higher screen coverage than the baseline techniques.

If you are interested, check out our preprint.