Skip to content

sam234990/BookRAG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BookRAG

This is the repo for "BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents"

Our framework has based on MinerU and PDF-extract-kit-1.0 to detect PDF processing. For environment setup, please reference to MinerU for more details if meets some problem related to PDF information extraction.

Setup Environment

This project using MinerU as PDF parsing method. Please follow the MinerU's instruction to install the dependency first.

Full environment of BookRAG is coming.

Run BookRAG

Our BookRAG is two steps: offline Index construction and online query.

Before these two steps, please select and modify the system and dataset config first. For the dataset config, please set the dataset input path and working directory, example file: dataset_config.yaml. For the system config, please set the parameters related to LLM, VLM, and ..., example file: default.yaml.

Offline Index

We provide a bash for constructing Book Index, please set the correct config you set before: index.sh.

bash Script/example-index.sh

Online Retrieval

We provide a bash for online retrieval given a specific dataset, please set the correct config you set before: online.sh.

bash Script/example-rag.sh

Evaluate

We use powerful LLM as answer extractor from the responses of BookRAG or other method. Please set the api file first: TXT.

We also provide a bash for evaluate the answer: eval.sh.

bash Script/example-eval.sh

Dataset format

We use the following datasets:

We then transform these dataset into an unified format:

[
    {
        "question":"THE FIRST QUESTION",
        "answer":"THE ANSWER OF FIRST QUESTION",
        "doc_uuid":"UUID OF THE DOCUMENT PDF",
        "doc_path":"PATH TO THE DOCUMENT PDF",
        "xxx":"other attributes"
    },
    {
        "question":"THE SECOND QUESTION",
        "answer":"THE ANSWER OF SECOND QUESTION",
        "doc_uuid":"UUID OF THE DOCUMENT PDF",
        "doc_path":"PATH TO THE DOCUMENT PDF",
        "xxx":"other attributes"
    }
]

Please see the example preprocess scripts in Scripts.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors