CAISA Lab

Highlights

2025-12-30T00:00:00+00:00

2025-2026 Perspective: From Making AI Work to Making It Matter

The past decade of AI was largely driven by one question: how to make large language models work at all. How to scale them, stabilize them, and push their capabilities far enough to be usable.

The turn of this year feels different — because a longer research arc has become clear. It is no longer about whether these systems work, but about what we do with them once they do. How do we integrate them into human workflows responsibly? How do we make them robust, interpretable, and trustworthy under real-world uncertainty? And how do we embed them into science and public institutions in ways that scale—and last?

This shift, from making AI work to deciding how it should work for people and science, is what connects our recent results and what will guide our work over the next 5+ years. Almost everything my group has been building converges on two tightly connected research lines: Socially Aligned Artificial Intelligence, and AI for Accelerating Scientific Discovery, especially in physics.

What connects them is a shared methodological core: robustness, efficiency, and interpretability of foundation models. These properties are not optional. They are what make AI systems trustworthy in social settings and usable as scientific instruments. I am also glad we can embed this work into shared, centralized infrastructures for open AI science at HPC scale.

Socially Aligned Artificial Intelligence

Our Socially Aligned AI research asks how LLM systems model people, social interaction, and society—and how these capabilities can be measured, interpreted, and governed responsibly. A major milestone for our group this year is being awarded the ERC Starting Grant “LLMpathy”, accommodating five new researchers across personal psychology, LLM reasoning, and agentic simulations. What excites me most is what it enables: social intelligence becomes a measurable research object. With LLMpathy, we can simulate, stress-test, and systematically analyze social reasoning in large language models, rather than relying on anecdotal behavior or narrow benchmarks. This builds directly on our completed junior research group Dynamically Social Discourse Analysis and recent ACL and EMNLP publications.

Equally important is the community forming around this work. The approval of our Dagstuhl Seminar on “Social Artificial Intelligence” for summer 2026 with colleagues from Harvard, JHU and CMU, and the upcoming ACM CHI workshop “Redefining Empathy” in April signal that the field is ready for deeper, more reflective conversations. For me, socially aligned AI is inseparable from AI safety and long-term resilience: the question is not if AI will shape human systems, but how we design that integration to remain human-centered and sustainable.

AI for Scientific Discovery

Our work on AI for Scientific Discovery connects directly to alignment as both depend on transparent, uncertainty-aware, and interpretable methods. Physics, in particular, is unforgiving. Models must cope with distribution shifts and integrate into workflows where assumptions are constantly challenged. In that sense, AI for physics is one of the most demanding testbeds of methods that steer, explain and control transformer-based predictive systems.

A concrete example was our ECML PKDD Challenge “Colliding with Adversaries”, which created a shared evaluation space for machine learning and physics researchers and made conceptual challenges such as correlation attacks and physics-aware attacks visible in a way that mattered to both communities. Looking ahead, we aim to increasingly focus on physics foundation models in HEP, astroparticle physics, and astrophysics, and on combinations of LLM-based scientific agents that assist with analysis workflows and tool-augmented reasoning. My group will grow by four physics researchers this year to explore these opportunities.

This line of work is anchored in the newly acquired Dynaverse Excellence Cluster, where I serve as PI for AI for Astrophysics, and reinforced by my BMFTR ErUM-Data projects — AISafety, AALearning, and Physics-LLM —linking our group with partners at Bonn, RWTH Aachen, TU Dortmund, DESY, Forschungszentrum Jülich, TUM, and the Leibniz Institute for Astrophysics Potsdam. AISafety and AALearning reframe adversarial learning as a scientific tool for physics-informed simulations, while Physics-LLM explores how large language models can support knowledge organization and reproducibility in large physics collaborations. Complementary DFG-funded work on geometric representation learning, TRA Synergy Bubble AI for Astrophysics and our new connection to the ELLIS network add methodological depth and international structure.

Open-Source Foundation Models

As chair of Lamarr NLP, I’m happy about a third strand that matured significantly - our work on open-source foundation models.

For me, this is about making values like transparency, reproducibility, and controllability operational. In close collaborations with Fraunhofer IAIS, TU Dortmund, and Hessian.AI, the research insights and infrastructure came together. The JQL pipeline paper at EMNLP and the TACL paper on multilingual pruning address data quality, controllability and efficiency—prerequisites for serious open foundation models. Community matters here too. The Polyglot LLM Workshop we co-organize, and the fact that Nicolas Kluge will be joining our group after his award-winning work on Portuguese LLMs, strengthens our commitment to multilingual, open, and culturally grounded models. Looking forward, I see strong potential in combining state-space models with neural foundation models into hybrid architectures, especially for long-horizon reasoning.

Institutionally, this work is embedded in the Lamarr Institute, and via Fraunhofer IAIS connected to new broader OSFM initiatives and the Jupiter AI Factory. Shared compute, tooling, and deployment pathways make it realistic to maintain and evolve models beyond individual projects.

People, Transitions, and Research Community

In numbers, this year in my group encompasses 30 paper preprints, 20 international guest speakers, 3 DAAD visiting researchers, 6 new research grants, and one more professorship for our PhD alumni. But this year also marks a significant transition for the team. With the successful completion of five BMBF projects, several colleagues will be moving on at the end of the year—something that always feels bittersweet, but also reflects the training role of the group. At the same time, we are entering a growth phase. Over the coming months, the group will welcome four new researchers in AI for Science and four new researchers funded through the ERC, complemented by Florian’s junior research group on AI safety, which will work closely with Nicolas.

Alongside research, the past year also emphasized community engagement and outreach. We helped shape international discourse by organizing the INLG conference in Hanoi, strengthened transatlantic exchange through visits to Canadian AI institutes, and actively participated in AI policy and public debates across Germany, from Düsseldorf and Bonn to Berlin. In addition, it was fascinating to meet young AI talents in the Bundes-KI Wettbewerb and EGOI girls’ informatics olympics.

Looking Forward

I am convinced that the next phase of AI will not be defined by scale alone, but by whether we can align powerful systems with human judgment, scientific rigor, and institutional responsibility.

With socially aligned AI, physics foundation models, scientific agents, hybrid architectures, and open infrastructures coming together, we are now in a position to ask not just what AI can do, but what it should be trusted to do—and under which conditions. That is the question we are choosing to work on now, while the technology works well enough for the answer to matter.

To our dedicated researchers, collaborators, and supporters: thank you for your unwavering commitment to pushing the boundaries of human knowledge. Here’s to another year of curiosity, innovation, and transformative research!

Highlights

2024-12-31T00:00:00+00:00

Celebrating Our Achievements in 2024: A Year of Innovation, Collaboration, and Growth

As 2024 comes to an end, we’re taking a moment to reflect on the remarkable journey our research team has undertaken over the past twelve months. The excitement, curiosity, and commitment to pushing boundaries have defined our work this year. Here’s a look at some of the highlights that made 2024 extraordinary.

With language generation models being more powerful than ever, we successfully started exploring numerous cross-disciplinary areas:

LLMs and Human Behavior:

With LLMs producing seemingly natural conversations, we wonder how exactly their skills differ from their human counterparts, and where they can complement and learn from each other. How do LLMs impact collaboration? How well can they reason about their choices? And what does it mean for LLMs to exhibit social skills or to master the creative art of storytelling? Our multi-institutional paper on collective intelligence was published in Nature Human Behavior and was covered by numerous media such as Forbes. Our ACL Workshop on Human-Centered LLMs had over 200 in person attendees in Bangkok. At ACL we have shown how LLMs can improve in taking the perspective of others by learning to generate responses to conflict situations. In our awarded poster at EMNLP WiNLP in Miami, we explained that to-date LLMs are by far not yet robust in reasoning about the behavior of others. At NeurIPS Workshop on System 2 Reasoning we have demonstrated how telling better stories in explanations can improve LLM accuracy in scientific question answering. Narratives also play a key role in our Bonn-Melbourne Research Excellence Grant on understanding indoctrination mechanisms at scale.

LLMs for Mental Health:

What is the potential of LLMs for mental health screening? And for treatment? With our psychology colleagues from Stony Brook University, Stanford and Lund, we found that GPT-4’s assessment and explanation of depression severity had high overall convergent validity (r = 0.81 with experts) and internal model consistency that largely aligned with literature and item-level self-assessment via well-established questionnaires. Such pre-screening may allow clinicians to invest more time into carefully tailored treatment plans. In the InVirtuo collaboration, we explore the development of empathetic, socially competent personalized LLMs to empower virtual reality avatars in behavioral therapy.

LLMs and Physics:

LLMs are merely powerful pretrained models for finding complex patterns in long time series. What if we apply these to particle physics and astrophysics? Whether searching for high-frequency gravitational waves (TRA Matter / GravNet), analyzing AGN light curves (Lamarr Physics), investigating the adversarial robustness of particle physics models (ErUM-Data AI Safety), or searching for new galaxy clusters with AstroLlama and multimodal representation learning (DynaVerse), we are excited what our new projects may reveal!

LLMs and Biomedicine:

Can LLMs accelerate research by harmonizing scientific literature? With our new seed funding from TRA Modelling, we use LLMs to detect patterns in gene regulation dysfunctions associated with cancer, reconstructing gene regulatory networks from published experimental findings.

Within Lamarr NLP, we have established a strong collaboration on LLM research with the Fraunhofer IAIS teams on Foundation Models and Trustworthy AI (joint paper on multilingual alignment), as well as with TU Dortmund on Sustainable AI (joint paper on LLM pruning). With the newly released Teuken-7B model, we are excited about more joint open-source multilingual LLM research to come.

With 2024 being the USA-NRW year, we celebrated our partnerships by organizing a Lamarr AI Event at the German Consulate General in NYC, marking a start of numerous PhD visits. Apart from our traditionally strong collaborations with Stony Brook, NYU, CMU, UPenn and UMich, we are happy about our reinforced ties to Stanford University, where Prof. Flek recently presented our perspectivism work. In 2024 we organized a joint LLM workshop (Diyi Yang), authored a joint mental health paper (Johannes Eichstadt, Betsy Stade), founded the IMPULSE House for Intellectual Innovation and Creativity in Bonn (Sepp Gumbrecht), and Oussama Khatib from the b-it advisory board has welcomed our NRW delegation in his Stanford robotics lab.

In numbers, our team has hired 7 new FTEs (including a new junior research group on LLM agents), published 17 peer-reviewed papers, organized 4 LLM workshops, gave many talks across three continents, acquired four new grants, joined 10 PhD committees in 4 countries, and hosted 4 visiting research fellows and 10 invited speakers from all over the world (UPenn, EPFL, RMIT, Monash, UMich, NYU, SBU, Sheffield, Edinburgh, UPV). One team member accepted a professorship offer and two completed their PhD, all continuing their work in the field (Canada, US, Germany). Also with junior researchers we are off to a great start. We supervised over 20 master theses, many of which were published in international workshops; two students received the Best Poster Award in Miami, one work was awarded the BDD Research Award. Last but not least, the first two caisarians are now mothers of beautiful babies.

While 2024 was a great year, it’s only the beginning of our journey. As we enter 2025, we carry forward the momentum generated by our research projects and collaborations.

Florian Mai joins CAISA group

2024-10-01T00:00:00+00:00

In the Conversational AI and Social Analytics (CAISA) Lab, we combine diverse expertise from areas such as natural language processing, machine learning, and computational social sciences, on a mission to understand people behind the language.

Allison Lahnala successfully defends her PhD dissertation

2024-09-06T00:00:00+00:00

Today, our group member Allison Lahnala successfully defends her PhD dissertation.

Congratulations!

CAISA group welcomes Tianyi Zhang

2024-09-06T00:00:00+00:00

Tianyi Zhang from University of Pennsylvania is visiting CAISA group in the next three months. A brief introduction of Tianyi: “I am passionate about building intelligent agents that emulate human understanding and reasoning of world events. In contrast to human learning, which assimilates and accommodates information into brain schemas, a significant challenge with current Language Models (LMs), including the SOTA GPT-4, is their inability to automatically acquire and anchor structured knowledge in the network. This deficiency leads to unreliable reasoning and hallucinations. To alleviate it, my research directs LMs to construct and reason with structured and symbolic representations. These efforts include event extraction, schema induction, entity-state tracking, natural to symbolic language translation and reasoning.” More about Tianyi.

Lamarr NLP Researchers Train Multilingual Large Language Models Mitigating Stereotype Bias

2024-08-15T00:00:00+00:00

Bias in large language models is a well-known and unsolved problem. In our new paper “Do Multilingual Large Language Models Mitigate Stereotype Bias?” we address this challenge by investigating the influence of multilingual training data on model bias reduction.

In the Lamarr NLP research collaboration between Fraunhofer IAIS and the Language Technologies group at the university of Bonn, we have trained six large language models on public data (one for each of Spanish, German, French, Italian and English, and a combined multilingual one), and compared these LLMs to their state-of-the-art counterparts on multilingual bias benchmarks. Our results show that all multilingual models trained on the same number of tokens as monolingual models are less biased for all languages and benchmarks. In addition, our models are generally less biased than selected open-source LLMs of similar size.

We are very happy to announce that our work, conducted by Shangrui Nie, Michael Fromm, Charles Welch, Rebekka Görge, Akbar Karimi, Joan Plepi, Nazia Afsan Mowmita, Nicolas Flores-Herr, Mehdi Ali and Lucie Flek, has been accepted at the C3NLP (Cross-Cultural Considerations in NLP), collocated with the ACL 2024 in Bangkok.

The challenge of bias in large language models

Large language models allow us to quickly and easily derive and implement real-world applications in fields as diverse as healthcare, finance and law. Huge amounts of data are used to train these models, resulting in incredible model performance. However, numerous studies have shown that large language models learn biases during training that can lead to discrimination against certain groups of people in their downstream application. Often these biases arise from the training data itself, and they vary between different languages and models.

Multilinguality as a solution approach

To avoid the harm of discrimination, research aims to mitigate bias in large language models. Among various approaches, previous research indicates that using multilingual training data to train large language models reduces model bias. In doing so, the models can benefit from the use of different languages, which differ in semantics and syntax and cover a wider cultural diversity. Within our work, we are building up on this previous work by investigating especially in the impact of monolingual versus multilingual data of larger, decoder-based language models.

Our experiments and findings

In our experiments, we train six novel large language models, one each for Spanish, German, French, Italian and English, as well as a multilingual model trained on all five languages but using the same number of tokens. To compare the bias of the monolingual and multilingual models, we benchmark them against two well-known bias evaluation benchmarks CrowS-Pairs and BBQ. The prior measures the degree to which a model prefers stereotyping over a less stereotyping sentences, while the latter is a question-answering dataset that requires from a model to answer stereotype-reflecting questions regarding two social groups (compare Figure 1).  As both benchmarks are originally only available in English, we perform a human-validated automated translation.

Our results support the initial hypothesis that multilingual large language models reduce model bias. Within the experiments, we find that all multilingual models trained on the same number of tokens as monolingual models are less biased than the monolingual models for all languages and both benchmarks. We also find that our models are generally less biased than selected open-source large language models of similar size, although they fall short of zero-shot prompt-based approaches with GPT3.

The translated multilingual bias datasets and the code used for LLM training are published at github.

Prof. Lucie Flek contributed to this article.

Perspective Taking through Generating Responses to Conflict Situations

2024-08-15T00:00:00+00:00

Despite the steadily increasing performance language models achieve on a wide variety of tasks, they continue to struggle with theory of mind, or the ability to understand the mental state of others. Language assists in the development of theory of mind, as it facilitates the exploration of mental states. This ability is central to much of human interaction and could provide many benefits for language models as well, as being able to foresee the reactions of others allows us to better decide which action to take next. This could help language models generate responses that are safer, in particular for healthcare applications, or more personalised, e.g. to sound more empathetic, or provide targeted explanations. These systems can generate responses that are not only relevant to the users queries but also reflect users personal style. Thereby creating a more engaging and customised interaction experience.

In fact, there is a growing interest in a perspectivist approach to many natural language processing (NLP) tasks, which emphasizes that there is no single ground truth. We construct a corpus to study perspective taking through generating responses to conflict situations. An example from our corpus can be found in Figure 1. We see a user asking if they did something wrong in a conversation with their girlfriend about whether or not to terminate a pregnancy. On the right, there are two responses from other users with different judgments of the situation (reasoning and verdict NTA/YTA). On the left, we see self-descriptive statements of each user. Author Y appears to be more family-oriented than Author X which may impact their judgement of the situation.

We focus on three research questions:

RQ1: How should we evaluate perspective taking through the lens of NLG?

We develop a novel evaluation of by asking humans to rank the human response, model output, and a distractor human response, combining approaches from persona consistency. We find that previous consistency evaluation metrics are inadequate and proposed a human ranking evaluation that includes similar human responses. Additionally, we find that our generation model performed competitively with previous work on perspective classification.

RQ2: Do tailored, user-contextualized architectures outperform large language models (LLMs) on this task?

We design two transformer architectures to embed personal context and find that our twin encoder approach outperforms LLMs. We compare tailored architectures to LLMs, including two novel methods, finding that our twin encoder architecture outperformed recent work, FlanT5 and Llama2 models.

RQ3: What user information is most useful to model perspective taking?

Experiments with varied user context showed that self-disclosure statements semantically similar to the conflict situation were most useful.

Joan Plepi contributed to this article.

David Kaczér joins CAISA group

2024-08-01T00:00:00+00:00

Frederik Labonte joins CAISA group

2024-08-01T00:00:00+00:00

Joan Plepi successfully defends his PhD dissertation

2024-07-25T00:00:00+00:00

Today, our first group member Joan Plepi successfully defends his PhD dissertation.

Congratulations!

More photos here.