The following knowledge graph is evaluated based on the Eurolex document structure. In particular, the following assumptions were considered:
- Each act is composed of one or more chapters
- Each chapter is composed of one or more articles
- In some cases, articles will be divided into sections rather than directly into chapters (Chapter -> Section -> Article)
- Articles may cite other articles or paragraphs
- An article is typically composed of a series of paragraphs
- Case law interprets articles; one or more case law may refer to a specific article or act
In order to maintain the uniqueness of nodes such as articles or chapters (which remain constant across different documents), it was decided to utilise a string that combines the unique identifier of the act (CELEX) provided by Eurolex and the identifier of the chapter/article, etc.
id: String
title: String
eurlex_url: String
id: String #CELEX+num_recital
number: String #Recital number -> 1, 2
text: String #actual content of the recital
id: String #CELEX+num_chapter
number: String #Roman numerals -> I, II, III
title: String
id: String #CELEX+num_section
title: String
id: String #CELEX+num_article
title: String
text: String
id: String #CELEX+num_paragraph
text: String
id: String #case law identifier
All the data used to fill the KG is retrieved from EUR-Lex documents. In particular, specific acts are parsed from the English HTML format into a specific structured data object. Furthermore, the document information section is parsed to match the associated case law.
PS: The parser ignores ongoing case law; we only consider completed case law. (XXX interprets YYY).
| Layer | Technology | Role |
|---|---|---|
| Language | Python | Entire codebase |
| Graph DB | Neo4j (Docker) | Stores nodes and vector embeddings |
| LLM orchestration | LangChain + langchain-neo4j |
RetrievalQA chain, Neo4j vector store integration |
| LLM / Embeddings | OpenAI API (langchain-openai) |
Answer generation and paragraph embeddings |
| Semantic re-ranking | sentence-transformers (all-MiniLM-L6-v2) |
Topic similarity filtering in GraphEnrichedRetriever |
| NLP | NLTK | Tokenization and lemmatization in the ASKE pipeline |
| HTML parsing | BeautifulSoup4 | Parsing EUR-Lex legal documents |
| Web scraping | Playwright (Chromium) | Headless browser downloads bypassing AWS WAF |
| Config | python-dotenv |
Loads secrets from .env |
The system operates in three sequential phases, each with its own entry point:
graph_init.py → aske_pipeline.py → rag_pipeline.py
Phase 1: Graph Init Phase 2: Topic Extraction Phase 3: RAG Query
Downloads the four legal HTML documents from EUR-Lex, parses them into a structured knowledge graph, and generates paragraph-level vector embeddings.
BrowserFetcherlaunches a headless Chromium browser, navigates each EUR-Lex URL, and saves the rendered HTML todocs/— this bypasses the AWS WAF JavaScript challenge that blocks plain HTTP requestsEURLexHTMLParser(BeautifulSoup) parses each HTML file into a structured hierarchy: Act → Chapter → Section → Article → ParagraphMetadataParserfetches the document metadata page for each act and extracts case-law "Interpreted by" relationshipsGraphLoaderwrites all nodes and edges to Neo4j using parameterized Cypher queriesNeo4jGraph.generate_text_embeddingsencodes every Paragraph node withall-MiniLM-L6-v2and stores the vector on the nodeNeo4jGraph.create_vector_indexcreates a COSINE similarity vector index onParagraph.textEmbeddingfor fast nearest-neighbour lookup
Runs the ASKE (Automatic Semantic Knowledge Extraction) algorithm over all paragraph nodes, iteratively expanding a set of legal concepts and linking the most relevant topics back to each paragraph.
- All Paragraph nodes are fetched from Neo4j
TextPreprocessortokenizes paragraphs into sentences, lemmatizes each word, and produces sentence-level chunks (first sentence skipped as it usually just contains the paragraph number)ASKETopicExtractor.run_aske_cycleruns forN_GENERATIONSiterations; each generation has four phases:- Chunk Classification — cosine similarity between chunk and concept embeddings; chunks above threshold
αare assigned to that concept - Deactivate Unused — concepts that received zero classifications are marked inactive and excluded from further enrichment
- Terminology Enrichment — for each active concept, candidate terms are extracted from classified chunks using TF-IDF with bigrams; WordNet definitions are fetched and embedded; terms are scored with a discriminative penalty (
sim_to_concept − 0.5 × max_sim_to_others) to down-rank generic terms; the topγterms above thresholdβare added to the concept - Concept Derivation — terms within each concept are clustered with Affinity Propagation; each distinct cluster spawns a new concept labelled by its centroid term
- Chunk Classification — cosine similarity between chunk and concept embeddings; chunks above threshold
ASKETopicExtractor.aggregate_topics_by_paragraphselects the top-3 topics per paragraph based on maximum chunk-level similarity scoreNeo4jGraph.update_paragraph_topicswrites Topic nodes and(Paragraph)-[:RELATED_TO]->(Topic)edges to Neo4j- A human-readable report is written to
results/aske_report.txtlisting every active concept with its associated terms
Tunable parameters (top of aske_pipeline.py):
| Parameter | Default | Meaning |
|---|---|---|
N_GENERATIONS |
20 | ASKE iterations |
ALPHA |
0.3 | Chunk-classification similarity threshold |
BETA |
0.4 | Terminology-enrichment acceptance threshold |
GAMMA |
10 | Max new terms added per concept per generation |
Answers user questions by combining topic-aware filtering with vector search, re-ranking results, and passing the top-k paragraphs as grounded context to an OpenAI LLM.
- The user query is encoded with
all-MiniLM-L6-v2and compared against all Topic node embeddings (cosine similarity threshold: 0.35); the top-5 matching topics are selected GraphEnrichedRetrieverfetches all Paragraph nodes linked to those topics via(Paragraph)-[:RELATED_TO]->(Topic)- In parallel, a Neo4j COSINE vector search retrieves the nearest Paragraph nodes to the query embedding
- Both result sets are deduplicated and merged
- All candidate paragraphs are re-ranked by cosine similarity to the query; the top-5 are returned as context
- A LangChain
RetrievalQAchain passes the context withANSWER_SYNTHESIS_PROMPT_v2to an OpenAI LLM (temperature=0); the prompt enforces strict article citation accuracy - The answer is written to
results/prova.txt
