WildClaims: Conversational Information Access
in the Wild(Chat)

This is the repo for our paper: WildClaims: Conversational Information Access in the Wild(Chat). The repository contains:

The WildClaims dataset with extracted factual claims and human annotations.
The data generation pipeline for preprocessing, filtering, claim extraction, and check-worthiness classification.
The analysis scripts used to reproduce the statistics and evaluation results reported in the paper.
The prompts used to generate the dataset via LLM-based claim extraction and check-worthiness classification.

What is WildClaims?

WildClaims is a dataset designed to study implicit information access in real-world human-system conversations. It focuses on the phenomenon that we found, where users' access to the information often occurs implicitly in the form of check-worthy factual claims made by the system, even when the user's task is not explicitly informational (e.g., during creative writing).
Derived from the existing WildChat corpus, the dataset contains 121,905 factual claims extracted from 7,587 system utterances across 3,000 conversations. Each claim is annotated for check-worthiness, indicating whether it merits fact-checking.
This resource aims to help the community move beyond traditional explicit information access to better understand and address the implicit information access that arises in real-world user-system conversations.

Data Release

The directory annotations/ contains utterance-level results, human annotations, and full claim extractions used in our check-worthiness analysis.

This resource builds on prior work in claim extraction and check-worthiness detection. Specifically, we use Huo et al., 2023 and Song et al., 2024 for claim extraction, and Hassan et al., 2015 and Majer et al., 2024 for check-worthiness classification. See generation/README.md for more details.

claims.csv
Full set of extracted factual claims (~31K with FHuo, ~91K with FSong). Each row corresponds to a claim linked to its source utterance (Selected_Agent_Utterance, Conversation_Hash, Claim_Extr_Method, Individual_Statement) with classifier outputs (Hassan, Majer).
human_annotations.csv
200 human-annotated claims for inter-annotator agreement and classifier evaluation. Includes annotator labels (Human1_Annotation, Human2_Annotation, Check_Worthy), binary CW flags (Human1_CW, Human2_CW, CW_Tie), agreement flags (Human1_Human2_Agree), and automatic classifier outputs (Majer, Hassan, Intersection, Union).
analysis.csv
Utterance-level results for ~3k sampled conversations. Each row corresponds to an agent utterance, with metadata (Conversation_Hash, Turn_Num, Corresponding_User_Question, Selected_Agent_Utterance, Task_Classification, Use) and multiple check-worthiness outputs (Hassan, Majer, Intersection, Union) plus fact counts (*_Fact_Num, *_Fact_Total).

Together, these files enable replication of utterance-level and claim-level statistics, as well as evaluation of human vs. automatic check-worthiness classification.

👉 For detailed schema and column descriptions, see annotations/README.md.

WildClaims Statistics

Table: General statistics of the WILDCLAIMS dataset.

Statistic	Value
# Conversations	3,000
Single/multi-turn ratio	57% : 43%
# Utterances	15,174
# System utterances	7,587
Avg. utterances per conversation	2.52
Avg. words per user utterance	95.70
Avg. words per system utterance	219.24
# Total extracted factual claims	121,905
# Automatic check-worthiness annotations	243,810
# Manual check-worthiness annotations	200

Data Generation Pipeline

The generation/ directory contains scripts for preparing, labeling, and extracting claims from WildChat conversations before running check-worthiness analysis.

Workflow Summary:

Preprocessing preprocess_files_for_pipeline.py
- Explodes conversations into utterance-level rows.
- Generates context windows for each system utterance.
Filtering Math & Code labeling_math_and_code.py
- Labels conversations as Math, Coding, or Others to filter out non-relevant domains.
Task Classification task_classification.py
- Categorizes user utterances into high-level task types (information seeking, creative writing, reasoning, etc.).
Claim Extraction
- FHuo Method f_huo_method.py: Extracts factual statements from system responses via OpenAI Batch.
- FSong Method f_song.py: Generates JSONLs, runs FSong extraction, maps claims back, and explodes them into one-row-per-claim.
Check-Worthiness Classification cw.py
- Classifies factual statements; supports both Majer and Hassan prompt variants.

📂 Details: See the generation/README.md for complete pipeline descriptions, usage examples, and command-line arguments.

Analysis

The directory analysis/ contains Python scripts used to analyze the annotations and reproduce results reported in the paper.

statistics_3k_conversation.py
Generates descriptive statistics for the 3,000 sampled conversations. Outputs counts of utterances, unique conversations, turn distributions, average lengths of user questions and agent utterances, and task classification distributions.
statistics_fact_claim_extraction_3k.py
Computes statistics on factual claim extraction across the 3k conversations, comparing FHuo and FSong methods. Reports total claims, average claims per utterance/conversation, and coverage statistics.
statistics_human_annotations.py
Analyzes the 200 human-annotated claims. Provides row counts per extraction method, percentages of TRUE labels (Human1_CW, Human2_CW, Gold), and inter-annotator agreement using Cohen’s κ.
effectiveness_automatic_check_worthiness.py
Evaluates automatic CW classifiers (Hassan, Majer, Intersection, Union) against the human-annotated gold labels. Reports Precision, Recall, F1-score, and Cohen’s κ for each extraction method.
prevalence_check_worthy_3k.py
Estimates the prevalence of CW claims across the 3,000 sampled conversations. Reports percentages of CW claims, utterances with ≥1 CW claim, and conversations with ≥1 CW claim for all classifier–extraction combinations.

For detailed instructions on running the analysis scripts, see the analysis README.

License

WildClaims is currently released under the ODC-By 1.0 license, following WildChat, and by using WildClaims you are agreeing to its usage terms.

Citation

@misc{Joko:2025:WildClaims,
      title={WildClaims: Conversational Information Access in the Wild(Chat)}, 
      author={Hideaki Joko and Shakiba Amirshahi and Charles L. A. Clarke and Faegheh Hasibi},
      year={2025},
      eprint={2509.17442},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2509.17442}, 
}

Contact

If you have any questions, please contact Shakiba Amirshahi [email protected] or Hideaki Joko [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
analysis		analysis
annotations		annotations
generation		generation
images		images
prompts		prompts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WildClaims: Conversational Information Access
in the Wild(Chat)

What is WildClaims?

Data Release

WildClaims Statistics

Data Generation Pipeline

Analysis

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WildClaims: Conversational Information Accessin the Wild(Chat)

What is WildClaims?

Data Release

WildClaims Statistics

Data Generation Pipeline

Analysis

License

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

WildClaims: Conversational Information Access
in the Wild(Chat)

Packages