Evaluating the Long-Term Memory of Large Language Models
LoCoGen is an automated pipeline for constructing long-term dialogue datasets to evaluate the long-term memory capabilities of Large Language Models (LLMs). This project implements the methodology described in the paper "Evaluating the Long-Term Memory of Large Language Models".
- Automated Data Generation: 5-stage pipeline for creating long-term chronological conversations
- LOCCO Dataset: 100 users with 3080 dialogues spanning multiple time periods
- Memory Evaluation: Comprehensive framework for testing LLM long-term memory
- Multiple LLM Support: Compatible with OpenAI GPT models and local models (InternLM2, Llama, etc.)
- Modular Architecture: Clean, well-documented, and easily extensible codebase
- Clone the repository:
git clone https://github.com/yourusername/LoCoGen.git
cd LoCoGen- Install dependencies:
pip install -r requirements.txt- Set up environment variables:
cp .env.example .env
# Edit .env and add your API keysfrom src.api_client import create_client
from src.config import Config
# Initialize LLM client
client = create_client(model_name="gpt-4")
# Generate text
response = client.generate("Your prompt here", max_tokens=500)
print(response)locogen/
├── src/ # Source code
│ ├── config.py # Configuration management
│ ├── api_client.py # Unified LLM API client
│ ├── prompts.py # Prompt templates
│ ├── utils/ # Utility modules
│ │ ├── json_utils.py # JSON parsing utilities
│ │ ├── text_utils.py # Text processing utilities
│ │ └── file_utils.py # File I/O utilities
│ ├── pipeline/ # Data generation pipeline
│ │ ├── stage1_character_init.py # Character initialization
│ │ ├── stage2_diary_generation.py # Diary generation
│ │ ├── stage3_dialogue_generation.py # Dialogue generation
│ │ ├── stage4_dataset_construction.py # Dataset construction
│ │ └── stage5_question_generation.py # Question generation
│ └── evaluation/ # Evaluation modules
│ ├── metrics/ # Evaluation metrics (BLEU, ROUGE, etc.)
│ └── consistency_model.py # Consistency evaluation
├── data/ # Data directory
│ ├── raw/ # Raw input data
│ ├── intermediate/ # Intermediate outputs
│ └── final/ # Final datasets (LOCCO.json, LOCCO_L.json)
├── scripts/ # Execution scripts
├── notebooks/ # Jupyter notebooks for analysis
├── tests/ # Unit tests
├── docs/ # Documentation
├── requirements.txt # Python dependencies
└── README.md # This file
Edit .env file to configure:
# OpenAI API
OPENAI_API_KEY=your_api_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
# Model settings
DEFAULT_MODEL=gpt-4
MAX_TOKENS=4096
TEMPERATURE=0.7
# Logging
LOG_LEVEL=INFOThe LoCoGen pipeline consists of 5 stages:
Generate detailed character profiles with MBTI personality types across 3 time points (1, 3, and 5 years ago).
Create temporal diary entries for characters, maintaining consistency and character development.
Convert diary entries into multi-turn user-chatbot dialogues (3-5 rounds per conversation).
Process dialogues and construct time-split training datasets with cloze-mask tasks.
Generate memory test questions to evaluate LLM's ability to recall historical information.
This project addresses 6 key research questions:
- How do LLMs perform in long-term memory tasks?
- Does memory performance vary with the introduction of new data?
- Do LLMs exhibit memory preferences similar to humans?
- Do LLMs experience cognitive load like humans?
- Do LLMs exhibit a forgetting baseline?
- Do LLMs achieve permanent memory through replay strategies?
The project includes comprehensive evaluation metrics:
- BLEU: Bilingual Evaluation Understudy
- ROUGE: Recall-Oriented Understudy for Gisting Evaluation
- METEOR: Metric for Evaluation of Translation with Explicit ORdering
- CIDEr: Consensus-based Image Description Evaluation
- Consistency Model: Custom model for evaluating response consistency
- LLMs can retain past interaction information to a certain extent
- Memory gradually weakens over time
- Rehearsal strategies enhance memory persistence
- LLMs exhibit memory preferences across different information categories
- Excessive rehearsal is not effective for larger models
If you use this code or dataset in your research, please cite:
@article{locogen2024,
title={Evaluating the Long-Term Memory of Large Language Models},
author={Jia, Zixi and Liu, Qinghua and Li, Hexiao and Chen, Yuyan and Liu, Jiqiang},
journal={arXiv preprint arXiv:2309.16609},
year={2024}
}Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- MBTI-S2Conv dataset for character profiles
- OpenAI for GPT models
- Hugging Face for transformer models
For questions or issues, please:
- Open an issue on GitHub
- Contact the authors (see paper for details)