Legal KG V1.1

Document assumption

The following knowledge graph is evaluated based on the Eurolex document structure. In particular, the following assumptions were considered:

Each act is composed of one or more chapters
Each chapter is composed of one or more articles
- In some cases, articles will be divided into sections rather than directly into chapters (Chapter -> Section -> Article)
Articles may cite other articles or paragraphs
An article is typically composed of a series of paragraphs
Case law interprets articles; one or more case law may refer to a specific article or act

Knowledge Graph

Nodes

In order to maintain the uniqueness of nodes such as articles or chapters (which remain constant across different documents), it was decided to utilise a string that combines the unique identifier of the act (CELEX) provided by Eurolex and the identifier of the chapter/article, etc.

Act

id: String 
title: String
eurlex_url: String

Recitals

id: String         #CELEX+num_recital
number: String     #Recital number -> 1, 2
text: String       #actual content of the recital

Chapter

id: String         #CELEX+num_chapter
number: String     #Roman numerals -> I, II, III
title: String

Section

id: String         #CELEX+num_section
title: String

Article

id: String        #CELEX+num_article
title: String
text: String

Paragraph

id: String         #CELEX+num_paragraph
text: String

Case Law (WIP)

id: String           #case law identifier

Document Parsing and Data Retrieval

All the data used to fill the KG is retrieved from EUR-Lex documents. In particular, specific acts are parsed from the English HTML format into a specific structured data object. Furthermore, the document information section is parsed to match the associated case law.

PS: The parser ignores ongoing case law; we only consider completed case law. (XXX interprets YYY).

Architecture

Tech Stack

Layer	Technology	Role
Language	Python	Entire codebase
Graph DB	Neo4j (Docker)	Stores nodes and vector embeddings
LLM orchestration	LangChain + `langchain-neo4j`	RetrievalQA chain, Neo4j vector store integration
LLM / Embeddings	OpenAI API (`langchain-openai`)	Answer generation and paragraph embeddings
Semantic re-ranking	`sentence-transformers` (`all-MiniLM-L6-v2`)	Topic similarity filtering in GraphEnrichedRetriever
NLP	NLTK	Tokenization and lemmatization in the ASKE pipeline
HTML parsing	BeautifulSoup4	Parsing EUR-Lex legal documents
Web scraping	Playwright (Chromium)	Headless browser downloads bypassing AWS WAF
Config	`python-dotenv`	Loads secrets from `.env`

Pipeline Overview

The system operates in three sequential phases, each with its own entry point:

graph_init.py          →   aske_pipeline.py         →   rag_pipeline.py
Phase 1: Graph Init        Phase 2: Topic Extraction     Phase 3: RAG Query

Step 1: Graph Initialization (`graph_init.py`)

Downloads the four legal HTML documents from EUR-Lex, parses them into a structured knowledge graph, and generates paragraph-level vector embeddings.

BrowserFetcher launches a headless Chromium browser, navigates each EUR-Lex URL, and saves the rendered HTML to docs/ — this bypasses the AWS WAF JavaScript challenge that blocks plain HTTP requests
EURLexHTMLParser (BeautifulSoup) parses each HTML file into a structured hierarchy: Act → Chapter → Section → Article → Paragraph
MetadataParser fetches the document metadata page for each act and extracts case-law "Interpreted by" relationships
GraphLoader writes all nodes and edges to Neo4j using parameterized Cypher queries
Neo4jGraph.generate_text_embeddings encodes every Paragraph node with all-MiniLM-L6-v2 and stores the vector on the node
Neo4jGraph.create_vector_index creates a COSINE similarity vector index on Paragraph.textEmbedding for fast nearest-neighbour lookup

Step 2: Topic Extraction (`aske_pipeline.py`)

Runs the ASKE (Automatic Semantic Knowledge Extraction) algorithm over all paragraph nodes, iteratively expanding a set of legal concepts and linking the most relevant topics back to each paragraph.

All Paragraph nodes are fetched from Neo4j
TextPreprocessor tokenizes paragraphs into sentences, lemmatizes each word, and produces sentence-level chunks (first sentence skipped as it usually just contains the paragraph number)
ASKETopicExtractor.run_aske_cycle runs for N_GENERATIONS iterations; each generation has four phases:
- Chunk Classification — cosine similarity between chunk and concept embeddings; chunks above threshold α are assigned to that concept
- Deactivate Unused — concepts that received zero classifications are marked inactive and excluded from further enrichment
- Terminology Enrichment — for each active concept, candidate terms are extracted from classified chunks using TF-IDF with bigrams; WordNet definitions are fetched and embedded; terms are scored with a discriminative penalty (sim_to_concept − 0.5 × max_sim_to_others) to down-rank generic terms; the top γ terms above threshold β are added to the concept
- Concept Derivation — terms within each concept are clustered with Affinity Propagation; each distinct cluster spawns a new concept labelled by its centroid term
ASKETopicExtractor.aggregate_topics_by_paragraph selects the top-3 topics per paragraph based on maximum chunk-level similarity score
Neo4jGraph.update_paragraph_topics writes Topic nodes and (Paragraph)-[:RELATED_TO]->(Topic) edges to Neo4j
A human-readable report is written to results/aske_report.txt listing every active concept with its associated terms

Tunable parameters (top of aske_pipeline.py):

Parameter	Default	Meaning
`N_GENERATIONS`	20	ASKE iterations
`ALPHA`	0.3	Chunk-classification similarity threshold
`BETA`	0.4	Terminology-enrichment acceptance threshold
`GAMMA`	10	Max new terms added per concept per generation

Step 3: RAG Query (`rag_pipeline.py`)

Answers user questions by combining topic-aware filtering with vector search, re-ranking results, and passing the top-k paragraphs as grounded context to an OpenAI LLM.

The user query is encoded with all-MiniLM-L6-v2 and compared against all Topic node embeddings (cosine similarity threshold: 0.35); the top-5 matching topics are selected
GraphEnrichedRetriever fetches all Paragraph nodes linked to those topics via (Paragraph)-[:RELATED_TO]->(Topic)
In parallel, a Neo4j COSINE vector search retrieves the nearest Paragraph nodes to the query embedding
Both result sets are deduplicated and merged
All candidate paragraphs are re-ranked by cosine similarity to the query; the top-5 are returned as context
A LangChain RetrievalQA chain passes the context with ANSWER_SYNTHESIS_PROMPT_v2 to an OpenAI LLM (temperature=0); the prompt enforces strict article citation accuracy
The answer is written to results/prova.txt

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
img		img
model		model
results		results
service		service
.gitignore		.gitignore
README.md		README.md
aske_pipeline.py		aske_pipeline.py
config.py		config.py
docker-compose.yml		docker-compose.yml
graph_init.py		graph_init.py
rag_pipeline.py		rag_pipeline.py
requirements.txt		requirements.txt
topic.json		topic.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Legal KG V1.1

Document assumption

Knowledge Graph

Nodes

Act

Recitals

Chapter

Section

Article

Paragraph

Case Law (WIP)

Document Parsing and Data Retrieval

Architecture

Tech Stack

Pipeline Overview

Step 1: Graph Initialization (`graph_init.py`)

Step 2: Topic Extraction (`aske_pipeline.py`)

Step 3: RAG Query (`rag_pipeline.py`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Legal KG V1.1

Document assumption

Knowledge Graph

Nodes

Act

Recitals

Chapter

Section

Article

Paragraph

Case Law (WIP)

Document Parsing and Data Retrieval

Architecture

Tech Stack

Pipeline Overview

Step 1: Graph Initialization (graph_init.py)

Step 2: Topic Extraction (aske_pipeline.py)

Step 3: RAG Query (rag_pipeline.py)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Step 1: Graph Initialization (`graph_init.py`)

Step 2: Topic Extraction (`aske_pipeline.py`)

Step 3: RAG Query (`rag_pipeline.py`)

Packages