This project is an example of GraphRAG, providing a system for processing documents, extracting entities and relationships, and managing them in a graph database. It leverages OpenAI's GPT models for natural language processing tasks and Neo4j for graph database management.
app.py: Main application script that initializes components and runs the document processing and querying workflow.graph_manager.py: Manages the graph database, including building and reprojecting the graph, calculating centrality measures, and managing graph operations.query_handler.py: Handles user queries by leveraging the graph data and OpenAI's GPT models for natural language processing.document_processor.py: Processes documents by splitting them into chunks, extracting entities and relationships, and summarizing them.graph_database.py: Manages the connection to the Neo4j graph database.logger.py: Provides a logging utility to log messages to both console and file with configurable log levels.
-
Clone the repository:
git clone [email protected]:stephenc222/example-graphrag-with-neo4j.git cd example-graphrag-with-neo4j
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory and add the following variables:OPENAI_API_KEY=your_openai_api_key DB_URL=your_neo4j_db_url DB_USERNAME=your_neo4j_username DB_PASSWORD=your_neo4j_password LOG_LEVEL=INFO # Optional, default is INFO
-
Build the neo4j docker container
sh build.sh
-
Start the neo4j docker container
sh start.sh
-
Install the dependencies
pip install -r requirements.txt
-
Run the application:
python app.py
-
Initial Indexing: The application will first index the initial set of documents defined in
constants.pyasDOCUMENTS. -
Querying: After indexing, the application will handle a predefined query to extract themes from the documents. Centrality measures will also be calculated to enhance the query responses.
-
Reindexing with New Documents: The application will then add new documents defined in
constants.pyasDOCUMENTS_TO_ADD_TO_INDEXand reindex the graph. -
Second Query: After reindexing, the application will handle another predefined query to extract themes from the updated set of documents.
- Overview: Acts as the entry point of the application.
- Responsibilities:
- Initializes the components: logger, document processor, graph manager, and query handler.
- Handles the main workflow:
- Performs initial indexing of documents.
- Executes a user query.
- Reindexes the graph with new documents.
- Runs a second user query based on the updated graph.
- Uses the logging utility to track the workflow progress.
- Overview: Manages graph-related operations in the Neo4j database.
- Responsibilities:
- Builds the graph from document summaries.
- Reprojects the graph for community and centrality analysis.
- Performs calculations such as degree centrality, betweenness centrality, and closeness centrality.
- Supports reindexing with new documents and recalculating centrality measures.
- Overview: Handles natural language queries.
- Responsibilities:
- Extracts answers from the graph using centrality measures.
- Uses OpenAI GPT models to provide concise answers based on graph data and centrality results.
- Overview: Manages the extraction and summarization of entities and relationships from documents.
- Responsibilities:
- Splits documents into chunks.
- Extracts entities and relationships from the chunks using OpenAI GPT models.
- Summarizes the extracted entities and relationships for graph processing.
- Overview: Manages the Neo4j database connection.
- Responsibilities:
- Provides utility functions to connect to the Neo4j database.
- Clears the database if necessary.
- Overview: Provides a logging utility for the application.
- Responsibilities:
- Logs messages to both console and file.
- Supports configurable log levels via environment variables (
LOG_LEVEL). - Ensures logs are created in the correct format.
Centrality measures help identify the most important nodes (entities) in a graph based on their structural properties. These measures help in identifying key themes and influential concepts in the documents.
-
Degree Centrality: Measures how many connections a node has. Nodes with a high degree centrality are the most connected and can represent key topics or ideas in the document set.
-
Betweenness Centrality: Identifies nodes that act as bridges between other nodes. Nodes with high betweenness centrality often represent concepts that connect different themes.
-
Closeness Centrality: Measures how quickly a node can reach all other nodes. Entities with high closeness centrality are well-connected to all other entities and can be key summarizers or connectors of information.
# Example for calculating centrality
graph_manager = GraphManager(db_url, db_username, db_password)
graph_manager.calculate_centrality_measures()-
Initial Indexing: The system processes an initial set of documents, extracting entities and relationships, and storing them in a Neo4j graph.
-
Querying: A user query is handled by leveraging the centrality measures calculated from the graph, providing an intelligent answer using the OpenAI GPT model.
-
Reindexing: The system reindexes the graph when new documents are added, recalculates the centrality measures, and processes another user query.
# Query the system after indexing
query = "What are the main themes in these documents?"
answer = query_handler.ask_question_with_centrality(query)
print(f"Answer: {answer}")Each component has its own logger, ensuring that log messages provide insight into the progress of document processing, graph operations, and query handling.
The log level can be configured dynamically at runtime using the LOG_LEVEL environment variable.
openai: For interacting with OpenAI's GPT models.dotenv: For loading environment variables from a.envfile.neo4j: For interacting with the Neo4j graph database.pickle: For saving and loading processed data.logging: For tracking the workflow progress across the application.
This project is licensed under the MIT License. See the LICENSE file for details.