improvement(kb): deferred content fetching and metadata-based hashes for connectors#4044
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PR SummaryMedium Risk Overview Switches several connectors’ Includes targeted fixes: adds Salesforce Reviewed by Cursor Bugbot for commit 44b23c2. Configure here. |
|
@cursor review |
|
@greptile |
Greptile SummaryThis PR improves knowledge base connector performance by converting 9 connectors to a deferred content pattern (lightweight stubs during listing, full content fetched only for new/changed docs) and switching all 16 connectors from CPU-intensive SHA-256 content hashing to metadata-based hashes (e.g. Key changes per connector:
The deferred pattern is correctly implemented — Confidence Score: 4/5Safe to merge with one fix: the Outlook getDocument missing focusedOnly filter causes getDocument to be called on every sync for affected conversations, defeating the optimization. 15 of 16 connectors are cleanly implemented with correct deferred content and hash patterns. The Salesforce and Reddit bug fixes are correct. The one real issue is in the Outlook connector: the focusedOnly filter applied during listDocuments is not mirrored in getDocument, meaning the stub hash (based on focused-only messages) will permanently differ from the getDocument hash (based on all non-draft messages) for conversations that contain newer non-focused messages. This causes getDocument to be called on every sync run for those conversations — silently undoing the performance optimization for a subset of users with the default focusedOnly=true config. apps/sim/connectors/outlook/outlook.ts — getDocument needs the focusedOnly inferenceClassification filter to keep hashes consistent with listDocuments stubs.
|
| Filename | Overview |
|---|---|
| apps/sim/connectors/outlook/outlook.ts | Introduces deferred Outlook conversations — stubs filtered by focusedOnly, but getDocument fetches all non-draft messages, causing a permanent hash mismatch and redundant getDocument calls on every sync for conversations with newer non-focused messages. |
| apps/sim/connectors/jira/jira.ts | Clean split into issueToStub (no description/comment fields) and issueToFullDocument; listDocuments now requests only lightweight fields; getDocument fetches the full field set including description and comments. Metadata hash is correctly identical between stub and full doc. |
| apps/sim/connectors/salesforce/salesforce.ts | Fixes PublishStatus missing from OBJECT_FIELDS (causing all articles to return null); adds WHERE clause for PublishStatus='Online'; adds a getDocument guard to catch articles that go offline between list and get. Deferred stub pattern correctly implemented. |
| apps/sim/connectors/reddit/reddit.ts | Switches from volatile score/num_comments hash to stable created_utc; introduces deferred content so comments are only fetched for new/changed posts. Accepted tradeoff: comment-only changes won't re-trigger a sync. |
| apps/sim/connectors/google-docs/google-docs.ts | Deferred pattern cleanly implemented using Drive modifiedTime as hash. getDocument fetches Docs API content only for changed files and correctly returns null for trashed or non-Docs mimeType files. |
| apps/sim/connectors/google-sheets/google-sheets.ts | Deferred pattern using spreadsheet-level modifiedTime as hash; all sheets share the same hash so any edit triggers re-fetch of all tabs — an accepted limitation of Google Sheets API granularity. Implementation is correct. |
| apps/sim/connectors/zendesk/zendesk.ts | Articles remain inline (body already present in listing response), tickets are correctly deferred via ticketToStub. getDocument handles both prefixes and fetches ticket comments lazily. |
| apps/sim/connectors/intercom/intercom.ts | Articles remain inline, conversations become deferred. contentHash uses UNIX timestamp for stable change detection. getDocument correctly re-constructs the contentHash identically. |
| apps/sim/connectors/fireflies/fireflies.ts | Listing now omits heavy sentences/summary GraphQL fields; deferred getDocument fetches them lazily. Hash uses date + duration — accepted tradeoff per prior discussion. |
| apps/sim/connectors/asana/asana.ts | Switches to metadata hash using modified_at; content remains inline (no deferred pattern). Simple, correct change. |
| apps/sim/connectors/linear/linear.ts | Switches to updatedAt metadata hash; content remains inline since GraphQL response already contains all issue data. Simple, correct change. |
Sequence Diagram
sequenceDiagram
participant SE as Sync Engine
participant C as Connector
participant API as External API
participant KB as Knowledge Base
Note over SE,KB: listDocuments phase (all connectors)
SE->>C: listDocuments(cursor?)
C->>API: Fetch lightweight metadata only
API-->>C: id, title, modifiedTime (no body)
C-->>SE: [stubs] contentDeferred=true, contentHash=provider:id:modifiedTime
Note over SE,KB: Change detection
SE->>KB: Compare stub.contentHash vs stored hash
alt Hash unchanged (doc not modified)
SE->>KB: Keep existing content, skip getDocument
else Hash changed or new doc
SE->>C: getDocument(externalId)
C->>API: Fetch full content (body, comments, etc.)
API-->>C: Full document data
C-->>SE: ExternalDocument, contentDeferred=false, same contentHash
SE->>KB: Upsert content + metadata
end
Note over SE,KB: Connectors still inline (no deferral): Asana, Linear, HubSpot, Webflow, WordPress, ServiceNow, Google Calendar
Reviews (3): Last reviewed commit: "fix(kb): add missing connector sync cron..." | Re-trigger Greptile
|
@cursor review |
|
@greptile |
… prevent hash divergence
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit b5e33dd. Configure here.
|
@greptile |
|
@cursor review |
The connector sync endpoint existed but had no cron job configured to trigger it, meaning scheduled syncs would never fire. Co-Authored-By: Claude Opus 4.6 <[email protected]>
There was a problem hiding this comment.
✅ Bugbot reviewed your changes and found no new issues!
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 44b23c2. Configure here.
Summary
listDocumentsreturns lightweight stubs, content only fetched viagetDocumentfor new/changed docscontentHash(e.g.provider:id:modifiedTime) — eliminates CPU-intensive hashing and enables change detection without fetching contentPublishStatusmissing from KnowledgeArticleVersion field list (was causing all articles to return null)score,num_comments) causing unnecessary re-syncs every runType of Change
Testing
Tested manually
Checklist