This repository contains the code and data to reproduce the experiments from the paper Enabling natural language analysis for object-centric event logs. Conversational-OCEL2 is a conversational framework designed to facilitate process mining analysis over object-centric event logs following the OCEL 2.0 standard (in JSON). The approach leverages an architecture that combines Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) to handle users' queries about OCEL 2.0 event logs and generate contextually relevant responses in natural language.
Additionally, the repository contains a dataset for evaluating the conversational framework, derived from a standard OCEL 2.0 Procure-to-Pay (P2P) event log. This dataset functions as a benchmark for evaluating the effectiveness of conversational techniques in analyzing such event log from multiple perspectives.
.
├── data/
│ ├── execution # Knowledge extracted from the event log
│ └── ocel2-p2p.json # Event log used for the evaluation
├── src/
│ ├── cmd4tests.sh # Commands for launching evaluations
│ ├── eval.py # Logic for Evaluation
│ ├── main.py # Main logic for live interaction
│ ├── oracle.py # Verification oracle for evaluation
│ ├── pipeline.py # LLM pipeline setup
│ ├── preprocessing.py # OCEL2 log preprocessing
│ ├── prompts.json # LLM prompt templates
│ ├── utility.py # Helper functions
│ └── vector_store.py # Vector store management with Qdrant
├── tests/ # sources for evaluation
│ ├── outputs/ # outputs of the live conversations
│ ├── test_sets/ # test sets employed during the evaluation
│ └── validation/ # evaluation results for each run
├── logs.zip # zipped folder with the tested log (to unzip)
├── .env # Environment variables (create/fill this)
├── requirements.txt # Requirements to install
├── LICENSE # License file
└── README.md # This file
For a quick setup and test run:
- Clone and setup:
git clone https://github.com/angelo-casciani/Conversational-OCEL2
cd Conversational-OCEL2Create a Python virtual environment.
Option 1: Using venv
python3 -m venv .venv
source .venv/bin/activateOption 2: Using conda
conda create --name xes2pddl python=3.10
conda activate xes2pddlInstall the required dependencies.
pip install -r requirements.txtEnsure Docker is installed and running on your system.
Download the latest Qdrant image from Docker Hub and run the Qdrant service:
docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 \
-v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
qdrant/qdrant- Configure environment (create
.envfile):
Create a .env file in the root directory and configure the following variables:
HF_TOKEN=<your HuggingFace token>
DEEPSEEK_API_KEY=<your DeepSeek API key (if using DeepSeek models)>
GOOGLE_API_KEY=<your Gemini API key (if using Google models)>
OPENAI_API_KEY=<your OpenAI API key (if using OpenAI models)>
QDRANT_URL=127.0.0.0
QDRANT_GRPC_PORT=6334Required configurations:
- HF_TOKEN: Your HuggingFace token for accessing open-source language models and embedding models
- QDRANT_URL: URL where Qdrant is running (default:
QDRANT_URL=127.0.0.0) - QDRANT_GRPC_PORT: gRPC port for Qdrant (default:
6334)
The other configurations are optional.
- Run the application:
cd src
python3 main.py --rebuild_db TruePlease note that this software leverages open-source LLMs reported in the table:
| Model | HuggingFace Link |
|---|---|
| meta-llama/Meta-Llama-3-8B-Instruct | HF link |
| meta-llama/Meta-Llama-3.1-8B-Instruct | HF link |
| meta-llama/Llama-3.2-1B-Instruct | HF Link |
| meta-llama/Llama-3.2-3B-Instruct | HF link |
| mistralai/Mistral-7B-Instruct-v0.2 | HF link |
| mistralai/Mistral-7B-Instruct-v0.3 | HF link |
| mistralai/Mistral-Nemo-Instruct-2407 | HF link |
| mistralai/Ministral-8B-Instruct-2410 | HF link |
| Qwen/Qwen2.5-7B-Instruct | HF link |
| google/gemma-2-9b-it | HF link |
| gpt-4o-mini | OpenAI link |
Request in advance the permission to use each Llama model for your HuggingFace account. Retrive your OpenAI API key to use the supported GPT model.
Please note that each of the selected models have specific requirements in terms of GPU availability. It is recommended to have access to a GPU-enabled environment meeting at least the minimum requirements for these models to run the software effectively.
Navigate to the project directory and run the project in the preferred configuration:
cd src
python3 main.pyThe project now includes an enhanced pipeline with better error handling and performance. To use it:
python3 main.py --modality=liveTo run evaluations for different aspects of the OCEL2 analysis:
Global information evaluation:
python3 main.py --llm_id Qwen/Qwen2.5-7B-Instruct --modality evaluation-global --max_new_tokens 512Events analysis evaluation:
python3 main.py --llm_id Qwen/Qwen2.5-7B-Instruct --modality evaluation-events --max_new_tokens 512Objects analysis evaluation:
python3 main.py --llm_id Qwen/Qwen2.5-7B-Instruct --modality evaluation-objects --max_new_tokens 512Timestamps analysis evaluation:
python3 main.py --llm_id Qwen/Qwen2.5-7B-Instruct --modality evaluation-ts --max_new_tokens 512Complete evaluation (all categories):
python3 main.py --llm_id Qwen/Qwen2.5-7B-Instruct --modality evaluation-all --max_new_tokens 512If you need to rebuild the vector database (e.g., after changing the OCEL2 log or updating embeddings):
python3 main.py --rebuild_db=true The framework supports various configuration parameters:
| Parameter | Default | Description |
|---|---|---|
--embed_model_id |
sentence-transformers/all-MiniLM-L12-v2 |
Embedding model identifier |
--vector_dimension |
384 |
Vector space dimension (auto-detected if using enhanced pipeline) |
--llm_id |
meta-llama/Meta-Llama-3.1-8B-Instruct |
LLM model identifier |
--model_max_length |
128000 |
Maximum input length (context window) |
--num_documents_in_context |
5 |
Number of documents retrieved for context |
--max_new_tokens |
1280 |
Maximum number of tokens to generate |
--batch_size |
32 |
Batch size for embedding processing |
--rebuild_db |
false |
Whether to rebuild the vector index |
--use_enhanced_pipeline |
true |
Use enhanced pipeline with better error handling |
It is possible to upload a different OCEL 2.0 log (in JSON format) in the data folder by replacing the provided ocel2-p2p.json log. After uploading a new log, rebuild the database:
python3 main.py --rebuild_db=true --log=your-new-log.json# Use enhanced pipeline (recommended)
python3 main.py
# Automatic embedding dimension detection
python3 main.py --embed_model_id=sentence-transformers/all-mpnet-base-v2
# Better error recovery and batch processing
python3 main.py --rebuild_db=true --batch_size=16If you use this repository in your research, please cite:
@article{casciani2026enabling,
title={Enabling natural language analysis for object-centric event logs},
author={Casciani, Angelo and Bernardi, Mario Luca and Cimitile, Marta and Marrella, Andrea},
journal={Process Science},
volume={3},
number={1},
pages={5},
year={2026},
publisher={Springer}
}