Skip to content

Batch generate embeddings in add_nodes / add_edges#2201

Merged
fabubaker merged 13 commits intomasterfrom
feat/batch-create-embeddings
Jul 31, 2025
Merged

Batch generate embeddings in add_nodes / add_edges#2201
fabubaker merged 13 commits intomasterfrom
feat/batch-create-embeddings

Conversation

@fabubaker
Copy link
Contributor

@fabubaker fabubaker commented Jul 29, 2025

What changes were proposed in this pull request?

Currently, GqlMutableGraph.add_nodes / GqlMutableGraph.add_edges repeatedly calls the embedding function for each node/edge. This PR modifies it to use the function just once for an entire batch of nodes/edges.

Why are the changes needed?

Optimizes how embeddings are created, for eg, by using a batch call to OpenAI.

Does this PR introduce any user-facing change? If yes is this documented?

No.

How was this patch tested?

Unit tests were added to mutable_graph.rs.

Are there any further changes required?

No.

@fabubaker fabubaker requested a review from ricopinazo July 29, 2025 20:09
@fabubaker fabubaker self-assigned this Jul 29, 2025
@fabubaker fabubaker changed the title Batch generate embeddings in add_nodes Batch generate embeddings in add_nodes / add_edges Jul 30, 2025
let vectors = self.cache.get_embeddings(docs).await?;

for (id, vector) in ids.iter().zip(vectors) {
self.edge_db.insert_vector(*id, &vector)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time you call insert_vector this is going to update the index. We should probably change that function to accept a vector of embeddings instead, so we can write everything to arroy and only in the end update the index just once. The same applies to nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@fabubaker fabubaker force-pushed the feat/batch-create-embeddings branch from 7e268b1 to 0dcf33e Compare July 31, 2025 15:44
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ fabubaker
❌ github-actions[bot]
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Collaborator

@ricopinazo ricopinazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just left a small comment that you might want to have a look at, but it's not the end of the world

}

// Generate embeddings
let edges: Vec<_> = edges.into_iter().collect::<Result<Vec<_>, _>>()?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if might be posible to collect here just once

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first collect uses ? to return the first error it encounters in edges, so both collect are needed.
That said I did collapse both into one chain and got rid of the intermediate variable.

type GraphAlgorithmPlugin {
shortest_path(source: String!, targets: [String!]!, direction: String): [ShortestPathOutput!]!
pagerank(iterCount: Int!, threads: Int, tol: Float): [PagerankOutput!]!
shortest_path(source: String!, targets: [String!]!, direction: String): [ShortestPathOutput!]!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just letting a comment here to remember we should make sure whatever generates de GraphQL schema file should do it in a predictable way to avoid these kinds of conflicts

@fabubaker fabubaker merged commit 79dab9c into master Jul 31, 2025
5 of 6 checks passed
@fabubaker fabubaker deleted the feat/batch-create-embeddings branch July 31, 2025 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants