fix(connectors): contentDeferred pattern + validation fixes across all connectors#3793
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryMedium Risk Overview Improves sync execution safety and reliability. The sync engine now de-dupes Written by Cursor Bugbot for commit fcde037. Configure here. |
Greptile SummaryThis PR converts all 10 knowledge-base connectors from an eager content-download model to a lazy Confidence Score: 4/5PR is safe to merge; all previously flagged critical concerns are resolved and only minor P2 efficiency observations remain. All prior review concerns (Evernote retryOptions, listingCapped propagation, Confluence cloudId, deletion reconciliation) are correctly addressed. The contentDeferred pattern is implemented consistently. Score is 4 rather than 5 because the Gmail historyId concern could cause an invisible performance regression worth a quick confirmation. apps/sim/connectors/gmail/gmail.ts — verify historyId is present in threads.list responses, or the skip-getDocument optimisation never fires for Gmail. Important Files Changed
Sequence DiagramsequenceDiagram
participant SE as sync-engine
participant C as Connector
participant API as External API
participant DB as Database
participant Q as ProcessingQueue
SE->>C: listDocuments(cursor, syncContext)
C->>API: List metadata (no content download)
API-->>C: [{id, title, contentHash_stub}]
C-->>SE: ExternalDocumentList (contentDeferred: true)
SE->>SE: Build pendingOps (add/update via stub hash vs stored hash)
loop SYNC_BATCH_SIZE chunks
SE->>SE: Split into deferredOps + readyOps
par Hydrate deferred docs
SE->>C: getDocument(id, syncContext)
C->>API: Fetch full content for single doc
API-->>C: Full document
C-->>SE: ExternalDocument (contentDeferred: false)
end
SE->>SE: Post-hydration hash check (skip re-embed if hash unchanged)
SE->>DB: addDocument / updateDocument
DB-->>SE: DocumentData[]
SE->>Q: processDocumentsWithQueue(batchDocs)
end
SE->>SE: Deletion reconciliation (skip if listingCapped and not fullSync)
SE->>DB: hardDeleteDocuments(removedIds)
|
32a04dc to
4c19754
Compare
|
@greptile |
|
@cursor review |
…d fix validation issues All 10 connectors now use contentDeferred: true in listDocuments, returning lightweight metadata stubs instead of downloading content during listing. Content is fetched lazily via getDocument only for new/changed documents, preventing Trigger.dev task timeouts on large syncs. Connector-specific fixes from validation audit: - Google Drive: metadata-based contentHash, orderBy for deterministic pagination, precise maxFiles, byte-length size check with truncation warning - OneDrive: metadata-based contentHash, orderBy for deterministic pagination - SharePoint: metadata-based contentHash, byte-length size check - Dropbox: metadata-based contentHash using content_hash field - Notion: code/equation block extraction, empty page fallback to title, reduced CHILD_PAGE_CONCURRENCY to 5, syncContext parameter - Confluence: syncContext caching for cloudId, reduced label concurrency to 5 - Gmail: use joinTagArray for label tags - Obsidian: syncRunId-based stub hash for forced re-fetch, mtime-based hash in getDocument, .trim() on vaultUrl, lightweight validateConfig - Evernote: retryOptions threaded through apiFindNotesMetadata and apiGetNote - GitHub: added contentDeferred: false to getDocument, syncContext parameter Infrastructure: - sync-engine: added syncRunId to syncContext for Obsidian change detection - confluence/utils: replaced raw fetch with fetchWithRetry, added retryOptions - oauth: added supportsRefreshTokenRotation: false for Dropbox - Updated add-connector and validate-connector skills with contentDeferred docs Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile |
|
@cursor review |
…tions, UTF-8 safety
- Sync engine: merge metadata from getDocument during deferred hydration,
so Gmail/Obsidian/Confluence tags and metadata survive the stub→full transition
- Evernote: pass retryOptions {retries:3, backoff:500} from listDocuments and
getDocument callers into apiFindNotesMetadata and apiGetNote
- Google Drive + SharePoint: safe UTF-8 truncation that walks back to the last
complete character boundary instead of splitting multi-byte chars
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile |
|
@cursor review |
maxRetries/initialDelayMs instead of retries/backoff to match the RetryOptions interface from lib/knowledge/documents/utils. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…s after hydration - Merge title from getDocument during deferred hydration so Gmail documents get the email Subject header instead of the snippet text - After hydration, compare the hydrated contentHash against the stored DB hash — if they match, skip the update. This prevents Obsidian (and any connector with a force-refresh stub hash) from re-uploading and re-processing unchanged documents every sync Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@cursor review |
|
@greptile |
…merge sourceUrl Three sync engine gaps identified during audit: 1. Duplicate externalId guard: if a connector returns the same externalId across pages (pagination overlap), skip the second occurrence to prevent unique constraint violations on add and double-uploads on update. 2. Deletion reconciliation: previously required explicit fullSync or syncMode='full', meaning docs deleted from the source accumulated in the KB forever. Now runs on all non-incremental syncs (which return ALL docs). Includes a safety threshold: if >50% of existing docs (and >5 docs) would be deleted, skip and warn — protects against partial listing failures. Explicit fullSync bypasses the threshold. 3. sourceUrl merge: hydration now picks up sourceUrl from getDocument, falling back to the stub's sourceUrl if getDocument doesn't set one. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…e maxFiles guard
- Confluence: use `version?.number` directly (undefined) in metadata instead
of `?? ''` (empty string) to prevent Number('') = 0 passing NaN check in
mapTags. Hash still uses `?? ''` for string interpolation.
- Google Drive: add early return when previouslyFetched >= maxFiles to prevent
effectivePageSize <= 0 which violates the API's pageSize requirement (1-1000).
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@cursor review |
|
@greptile |
…iation
- Confluence: fetchLabelsForPages now tries both /pages/{id}/labels and
/blogposts/{id}/labels, preventing label loss when getDocument hydrates
blogpost content (previously returned empty labels on 404).
- Sync engine: skip deletion reconciliation when listing was capped
(maxFiles/maxThreads). Connectors signal this via syncContext.listingCapped.
Prevents incorrect deletion of docs beyond the cap that still exist in source.
fullSync override still forces deletion for explicit cleanup.
- Google Drive & Gmail: set syncContext.listingCapped = true when cap is hit.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@greptile |
|
@cursor review |
… caps OneDrive, Dropbox, SharePoint, Confluence (v2 + CQL), and Notion (3 listing functions) now set syncContext.listingCapped = true when their respective maxFiles/maxPages limit is hit. Without this, the sync engine's deletion reconciliation would run against an incomplete listing and incorrectly hard-delete documents that exist in the source but fell outside the cap window. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
All review comments have been addressed and resolved:
All threads resolved. Retriggering reviews. |
|
@cursor review |
|
@greptile |
…ebooks All calls to apiListTags and apiListNotebooks in both listDocuments and getDocument now pass retryOptions for consistent retry protection across all Thrift RPC calls. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Fixed the Evernote retryOptions gap (fcde037) — all |
|
@cursor review |
|
@greptile |
|
Resolved the Confluence labels comment — label-only edits not being detected is a pre-existing limitation of Confluence's version-based change detection (labels don't increment |
|
@cursor review |
|
@greptile |
Summary
contentDeferred: truepattern —listDocumentsnow returns lightweight metadata stubs (no content download), and content is fetched lazily viagetDocumentonly for new/changed documents. This prevents task timeouts on large syncs (root cause of Google Drive, Fireflies failures).add-connectorandvalidate-connectorskills to document thecontentDeferredbest practice.Connectors converted to contentDeferred
gdrive:${id}:${modifiedTime}onedrive:${id}:${lastModifiedDateTime}sharepoint:${id}:${lastModifiedDateTime}dropbox:${id}:${content_hash}notion:${pageId}:${last_edited_time}confluence:${pageId}:${versionNumber}gmail:${threadId}:${historyId}obsidian:stub:${path}:${syncRunId}, doc:obsidian:${path}:${mtime}evernote:${guid}:${updated}git-sha:${sha}(pre-existing)Infrastructure changes
sync-engine.ts: addedsyncRunIdtosyncContextfor Obsidian change detectionconfluence/utils.ts: replaced rawfetch()withfetchWithRetry, addedretryOptionsparameteroauth.ts: addedsupportsRefreshTokenRotation: falsefor DropboxTest plan
lastSyncDocCountmatches actual KB document count after syncbun run lint— passes cleanbunx tsc --noEmit— no errors in changed files