This month at Edinburgh ReproducibiliTea we hosted Sean Smith, a Data Analyst in the CAMARADES group at the University of Edinburgh’s Centre for Clinical Brain Sciences. Sean discussed his work with iRISE, an EU project for improving Reproducibility In SciencE. The project seeks to build evidence-based knowledge around what interventions are (and aren’t) effective for improving reproducibility. Sean’s work within this project, iRISE-SOLES (Systematic Online Living Evidence Summaries), aims to produce an overview of the existing interventions to improve reproducibility. You can view the output of Sean’s work on the iRISE-SOLES webpage. During this session, Sean walked us through his workflow for data curation in three steps: Identification, screening, and annotation.
1. Identification
iRISE-SOLES identified over 123,000 relevant papers using a string search over multiple databases (including Web of Science, Medline, PsychInfo, and others). About 55,000 duplicates were removed, leaving 68,995 articles to work with.
2. Screening
From the ~69,000 articles identified, 5,000 of these were randomly selected to be manually screened for relevance by two people. Screened papers were then used to train and test a machine learning algorithm. After algorithm refinement, the remaining articles were processes. This resulted in 16,832 relevant articles which evaluated interventions to improve reproducibility.
3. Annotation
The iRISE-SOLES project used automated methods to annotate articles for:
- Intervention evaluated
- Intervention provider
- Target population
- Discipline
- Research stage affected
- Location
- Outcomes, and
- Evidence type (controlled or uncontrolled studies).
Normally in the annotation process, regular expressions (RegEx) would be used to tag mentions of particular words and phrases. RegEx wasn’t well suited for this work, though, so the iRISE team decided to utilise large language models (LLMs) to annotate instead. LLMs are artificial intelligence systems which are trained to generate human-like responses. They can be used to generate text and code, among other tasks. The iRISE-SOLES project assessed the suitability of OpenAI (ChatGPT-4o), Meta (Llama 3), and Mistral AI platforms, eventually choosing to move forward with ChatGPT-4o. They also determined the ideal amount of data to include (paper title and abstract versus title, abstract, and methods) and settled on using paper title, abstract, and methods. Evaluating the full texts was too computationally intensive to be practical. They also considered the method of LLM use (querying, fine-tuning, zero-shot learning, few-shot learning, embeddings, etc.). They refined their annotation prompts and evaluated the performance of the LLM.
Initial Results
ChatGPT-4o generated better annotations in some categories than in others. It performed particularly well in identifying the target population and discipline in each paper, but struggled with the intervention provider. Looking more deeply into this, iRISE-SOLES found that the model performed better when annotating controlled versus uncontrolled studies. They decided to split the dataset into controlled and uncontrolled studies, finding overall improved annotations when using only the 2,441 controlled studies. ChatGPT-4o also performed best with few-shot learning, where a few examples were included within prompts.
The methodology for this type of study is difficult to reproduce, but the iRISE-SOLES team took several steps to combat this. They:
- Specified the following arguments within the LLM:
- Set seed (controlling randomness of model outputs)
- Temperature (lower values create a more deterministic, less creative response)
- Presence penalty (controls chances of the model creating new tokens), and
- Version (used the exact sample model version each time).
- Restricted category choice by asking ChatGPT-4o to choose from a list of categories instead of generating its own.
- Evaluated performance by:
- Being transparent at each stage,
- Providing a range of expected results if the study were to be reproduced, and
- Identifying trends in the data, not generating perfect annotations.
Next steps
Sean concluded by suggesting the next steps for the iRISE-SOLES project. They plan on making additions to the app available online. They will also make the database ‘living’ with regular automated updates. Additionally, they hope to improve and validate the LLM performance with updated models. Finally, they hope that SOLES will reach different users who might make recommendations to broaden its scope, and to extend the project beyond a systematic review. The iRISE-SOLES project can be viewed online here, which includes a dashboard of their data and more detailed information.
This blog post was written by Alex Colety
SOCIALS:
Edinburgh RT Twitter
Edinburgh RT YouTube Channel
Edinburgh RT OSF page
Edinburgh RT mailing list
For any questions/suggestions, please send us an email at [email protected]

