Prerequisites

Install VGT

Please follow VGT's documentation.

Install packages

Under the root directory SHTRAG/:

$ conda env create --file my_env.yml
$ conda activate shtrag
$ pip install -r raptor/requirements.txt
$ pip install --upgrade pymupdf
$ pip install anytree
$ pip install python-dotenv
$ pip install datasets
$ pip install rank_bm25
$ pip install colorlog
$ pip install matplotlib

Issues

[issue] If you encounter this error, you may directly remove cached_download from huggingface_hub import line:

> cannot import name 'cached_download' from 'huggingface_hub'

or set the version

$ pip install --upgrade huggingface_hub==0.25.2

[issue] If you encounter this error when using batch api:

> AttributeError: 'OpenAI' object has no attribute 'batches'

you may update openai:

$ pip install --upgrade openai --quiet

OpenAI key

Create a .env file under the root directory SHTRAG/:

API_KEY=your-openai-key

Example Usage

Data preparation

Create your dataset under folder data/. All data will be stored under folder data/your-dataset/. Store the pdf files under folder data/your-dataset/pdf/. Store the queries as a list of dictonaries in data/your-dataset/queries.json.

Example: 01262022-1835.pdf, queries.json

Heading identification

To identify the headings of a pdf, we use VGT:

$ curl -X POST -F 'file=@path-to-your-pdf' localhost:5060

VGT returns a json file, classifying the texts of the pdf into ten categories. Store this json file under folder data/your-dataset/heading_identification.

Example: 01262022-1835.json

Sound SHT extraction & Structured-RAG

After identifying the headings, it is time to build the SHT and to incorporate it into Structured-RAG. For implementation details , please read structured_rag/README.md.

> python run_structured_rag.py --root-dir path-to-your-dataset

For example, you can use data/your-dataset/. It is recommended to always use absolute path.

Below lists the arguments to set configuration:

flag	required	default	explanation	example
root-dir	True	-	(absolute) path to your dataset folder	`--root-dir ./data/example/`
chunk-size	False	100	size of a chunk (i.e., the #tokens of the context of a newly added leaf)	`--chunk-size 100`
summary-len	False	100	length of a recursively generated summary (i.e., the #tokens of the context of an original SHT node)	`--summary-len 100`
node-embedding-model	False	"sbert"	the embedding model for SHT nodes (choices: "sbert", "dpr", "te3small")	`--node-embedding-model "sbert"`
query-embedding-model	False	"sbert"	the embedding model for the query (choices: "sbert", "dpr", "te3small")	`--query-embedding-model "sbert"`
summarization-model	False	"gpt-4o-mini"	the summarization model (choices: "gpt-4o-mini", "empty"(return an empty string as the summary))	`--summarization-model "gpt-4o-mini"`
distance-metric	False	"cosine"	he distance metric in the embedding space (choices: "cosine", "L1", "L2", "Linf")	`--distance-metric "cosine"`
context-hierarchy	False	True	whether to recover hierarchical information in the final context (choices: True, False)	`--context-hierarchy True`
embed-hierarchy	False	True	whether to embed the hierarchical information (choices: True, False)	`--embed-hierarchy True`
context-raw	False	True	whether to retrieve the newly added leaves (i.e., the chunks of the document) for the final context (choices: True, False)	`--context-raw True`
context-len	False	1000	the #tokens of the final context	`--context-len 1000`

After running this command, you will find:

an SHT for each queried pdf

(Example: for 01262022-1835.pdf, 01262022-1835.json is an SHT using "empty" as the summarization model, and 01262022-1835.json is another SHT using "sbert" as the summarization model)
a visualized SHT for each generated SHT

(Example: 01262022-1835.vis)
an indexing that sorts the SHT nodes in ascending order of their distances to a query in the embedding space

(Example: index.jsonl)
the retrieved context for each query

(Example: context.jsonl)

You can then use these contexts to answer queries.

Development

Structured-RAG is implemented in folder structured_rag/. For implementation details , please read structured_rag/README.md.

Add a new embedding model

In SHTBuilder.py
- import your new model from .EmbeddingModels
- add your new model to self.embedder in SHTBuilder.__init__
In SHTIndexer.py
- import your new model from .EmbeddingModels
- add your new model to self.embedder in SHTIndexer.__init__
In StructuredRAG.py
- Add your new model to candid_embedding_models in Structured_RAG.build_sht
In run_structured_rag.py
- add your new model to the choices of cmd argument --node-embedding-model and/or --query-embedding-model

Add a new summarization model

In SummarizationModels.py
- define your new model as a derived class of BaseSummarizationModel
- you can refer to BaseGPTSummarizationModel
In SHTBuilder.py
- import your new model from .SummarizationModels
- add your new model to self.summarizer in SHTBuilder.__init__
In structured_rag.py
- add your new model to the choices of cmd argument --summarization-model

Data

Our experiment data is stored under ./data. You can download it from Google drive (9.14GB). The unzipped dataset is 39GB.

Shared data

data
└── <your-dataset>
    ├── pdf
    ├── heading_identification
    ├── node_clustering
    └── queries.json

pdf/ stored the queried files.
heading_identifiaction/ stores the VGT results for the pdfs.
node_clustering/ stores the clustered headings (e.g., SHT nodes) for the pdfs.
queries.json stores all the queries.

Baselines results

Folder baselines/ stores the results for the baselines.

data
└── <your-dataset>
    └── baselines
        ├── <node-embedding-model>.<summarization-model>.c<chunk-size>.s<summarization-len>
        │   └── <query-embedding-model>.<distance-metric>.raptor<is-raptor>
        │       ├── <context-len>.o<is-ordered>
        │       │   ├── context.jsonl
        │       │   ├── answer.jsonl
        │       │   ├── (qa_job.jsonl)
        │       │   ├── (qa_result.jsonl)
        │       │   ├── (rating.jsonl)
        │       │   ├── (rating_job.jsonl)
        │       │   └── (rating_result.jsonl)
        │       └── index.jsonl
        └── raptor_tree

raptor_tree/ stores the tree generated by raptor for the pdfs.
index.jsonl stores the SHT nodes sorted in indexing order for the queries.
context.jsonl stores the generated contexts for the queries.
answer.jsonl stores the answers for the queries using the generated contexts.
rating.jsonl stores the LLM's ratings for the generated answers. Only Qasper dataset has this file.
qa_job.jsonl, qa_result.jsonl, rating_job.jsonl, and rating_result.jsonl are intermediate files for openai batch api.

The configurations for the baselines can be indicated from the path to the results.

Structured-RAG results

data
└── <your-dataset>
    └── <node-embedding-model>.<summarization-model>.c<chunk-size>.s<summarization-len>
        ├── <query-embedding-model>.<distance-metric>.h<embed-hierarchy>
        │   ├── <context-len>.l<context-raw>.h<context-hierarchy>
        │   │   ├── context.jsonl
        │   │   ├── answer.jsonl
        │   │   ├── (qa_job.jsonl)
        │   │   ├── (qa_result.jsonl)
        │   │   ├── (rating.jsonl)
        │   │   ├── (rating_job.jsonl)
        │   │   └── (rating_result.jsonl)
        │   └── index.jsonl
        ├── sht
        └── sht_vis

sht/ stores the SHTs for the queried pdfs.
sht_vis/ stores the visualized SHTs.

The configurations for Structured-RAG can be inferred from the path to the results.

GROBID results

Folder grobid/ stores the results for SHTs generated by GROBID.

data
└── <your-dataset>
    └── grobid
        ├── <node-embedding-model>.<summarization-model>.c<chunk-size>.s<summarization-len>
        │   └── <query-embedding-model>.<distance-metric>.h<embed-hierarchy>
        │       ├── <context-len>.l<context-raw>.h<context-hierarchy>
        │       │   ├── context.jsonl
        │       │   ├── answer.jsonl
        │       │   ├── (qa_job.jsonl)
        │       │   ├── (qa_result.jsonl)
        │       │   ├── (rating.jsonl)
        │       │   ├── (rating_job.jsonl)
        │       │   └── (rating_result.jsonl)
        │       └── index.jsonl
        ├── grobid
        └── node_clustering

grobid/ stores the immediate results returned by GROBID.
node_clustering/ stores the imtermediate results that help build SHTs from GROBID results.

Intrinsic SHT results

Folder intrinsic/ stores the results for the human-labeled intrinsic SHTs.

data
└── <your-dataset>
    └── intrinsic
        ├── <node-embedding-model>.<summarization-model>.c<chunk-size>.s<summarization-len>
        │   └── <query-embedding-model>.<distance-metric>.raptor<is-raptor>
        │       ├── <context-len>.o<is-ordered>
        │       │   ├── context.jsonl
        │       │   ├── answer.jsonl
        │       │   ├── (qa_job.jsonl)
        │       │   ├── (qa_result.jsonl)
        │       │   ├── (rating.jsonl)
        │       │   ├── (rating_job.jsonl)
        │       │   └── (rating_result.jsonl)
        │       └── index.jsonl
        ├── heading_identification
        ├── human_label
        └── node_clustering

human_label/ stores the human labels.
heading_idenfication/ and node_clustering/ are deducted from the human labels.

Reproduction

Evaluation scripts

Folder: eval/

eval_{dataset}.py evaluate the accuracy, f1, or ratings by LLM-as-a-judge of the corresponding dataset.

Data Generation scripts

run_structured_rag.py generates the Strcuctured-RAG results.
run_grobid.py generates the GROBID SHTs. Then you can use StructuredRAG to generate (similar to run_structured_rag.py) their indexes and contexts.
run_raptor.py generates the RAPTOR trees and their indexes and contexts.
run_vanilla.py generates the vanilla chunks and their indexes. Then you can use StructuredRAG to generate (similar to run_structured_rag.py) their contexts.
run_bm25.py generates the results of Structured-RAG and baselines (vanilla & raptor) using BM25 as the embedder for nodes and queries.
compare.py calculates the number of $C$-templatizeda and well-formatted files (calc()). It also caculates the average precentage of SHT nodes that have robust hierarchical information (AVG(%hierarchy-robust nodes)), or that of SHT nodes that have hierarchical information exactly the same as the intrinsic ones (AVG(%hierarchy-intrinsic nodes)) (count()).

QA scripts

Folder: batches/

We use openai batch api for question-answering and LLM-as-a-judge.

batch.py generates a batch file storing a batch of qa/rating (llm-as-a-judge) tasks, uploads it to the server, and creates a corresponding batch job. You can also check the status of a batch, retrieve the result of the tasks from the server, and delete the batch job from the server.
llm_judge_prompt.txt stores the prompt template for LLM-as-a-judge to rate an answer compared with the gold answer.

References

Raptor

directory: raptor

This was copied from raptor's codebase.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prerequisites

Install VGT

Install packages

Issues

OpenAI key

Example Usage

Data preparation

Heading identification

Sound SHT extraction & Structured-RAG

Development

Add a new embedding model

Add a new summarization model

Data

Shared data

Baselines results

Structured-RAG results

GROBID results

Intrinsic SHT results

Reproduction

Evaluation scripts

Data Generation scripts

QA scripts

References

Raptor

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
agents		agents
batches		batches
data		data
eval		eval
graphrag-pypi/civic		graphrag-pypi/civic
graphrag		graphrag
llm_doc_parse		llm_doc_parse
raptor		raptor
script_grobid		script_grobid
structured_rag		structured_rag
.gitignore		.gitignore
README.md		README.md
compare.py		compare.py
config.py		config.py
logging_config.py		logging_config.py
my_env.yml		my_env.yml
run_bm25.py		run_bm25.py
run_grobid.py		run_grobid.py
run_raptor.py		run_raptor.py
run_structured_rag.py		run_structured_rag.py
run_vanilla.py		run_vanilla.py

Folders and files

Latest commit

History

Repository files navigation

Prerequisites

Install VGT

Install packages

Issues

OpenAI key

Example Usage

Data preparation

Heading identification

Sound SHT extraction & Structured-RAG

Development

Add a new embedding model

Add a new summarization model

Data

Shared data

Baselines results

Structured-RAG results

GROBID results

Intrinsic SHT results

Reproduction

Evaluation scripts

Data Generation scripts

QA scripts

References

Raptor

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages