Skip to content

feat: spill large inputs to disk to prevent IPC frame overflow#2804

Merged
michaeldwan merged 5 commits intomainfrom
md/io-spillover
Mar 6, 2026
Merged

feat: spill large inputs to disk to prevent IPC frame overflow#2804
michaeldwan merged 5 commits intomainfrom
md/io-spillover

Conversation

@michaeldwan
Copy link
Copy Markdown
Member

Summary

  • When an input payload exceeds 6 MiB, the parent spills it to a file under /tmp/coglet/predictions/{id}/inputs/ and sends only the file path over the IPC slot socket. The worker reads the file back, parses it, and deletes it before running the prediction.
  • This mirrors the existing output-spill mechanism and prevents the 8 MiB LengthDelimitedCodec frame limit from being exceeded by large inputs.
  • Unified MAX_INLINE_IPC_SIZE constant (previously separate input/output constants that could drift).

Changes

Rust (crates/coglet/)

  • protocol.rs: SlotRequest::Predict now has optional input/input_file fields with skip_serializing_if. Added prediction_id() accessor and rehydrate_input() method that reads from inline value or spill file. Spill file is cleaned up before JSON parsing (so corrupt files don't leak). 4 unit tests + snapshot.
  • service.rs: New build_slot_request() serializes input, checks against threshold, spills to disk if needed. Prediction dir layout changed from /tmp/coglet/outputs/{id} to /tmp/coglet/predictions/{id}/{inputs,outputs}. 3 unit tests.
  • worker.rs: Uses rehydrate_input() with proper failure handling — sends SlotResponse::Failed back to parent instead of silently dropping the prediction. Removed duplicate MAX_INLINE_OUTPUT_SIZE in favor of shared MAX_INLINE_IPC_SIZE.
  • codec.rs: Updated test fixture for new field layout.
  • Cargo.toml: Added tempfile to dev-dependencies.

Integration tests (integration-tests/)

  • harness.go: cmdCurl now supports @file syntax to POST large bodies from files.
  • coglet_large_input.txtar: New test — generates 7 MiB JSON payload, POSTs via @file, verifies prediction succeeds via webhook.

Test plan

  • mise run test:rust — 165/165 pass
  • mise run fmt:rust:fix — clean
  • mise run lint:rust — clean (clippy)
  • coglet_large_input integration test — PASS
  • coglet_large_output integration test — PASS (regression guard for dir layout change)
  • string_predictor integration test — PASS (inline input path still works)

When an input payload exceeds 6 MiB (MAX_INLINE_IPC_SIZE), the parent
serializes it to a file under /tmp/coglet/predictions/{id}/inputs/ and
sends only the file path over the slot socket. The worker reads the file
back, parses it, and deletes it before running the prediction.

This mirrors the existing output-spill mechanism and prevents the 8 MiB
LengthDelimitedCodec frame limit from being exceeded by large inputs.

Key changes:
- protocol.rs: SlotRequest::Predict now has optional input/input_file
  fields, prediction_id() accessor, and rehydrate_input() method
- service.rs: build_slot_request() checks size and spills to disk;
  prediction dir layout changed to /tmp/coglet/predictions/{id}/
- worker.rs: uses rehydrate_input() with proper failure handling that
  sends SlotResponse::Failed instead of silently dropping predictions
- Unified MAX_INLINE_IPC_SIZE constant (was separate input/output consts)
- Integration test with 7 MiB payload via new harness @file curl syntax
- Axum's default 2 MiB body limit rejected large input payloads with
  HTTP 413 before they could reach the spill-to-disk logic. Set
  DefaultBodyLimit to 100 MiB on the router.
- Fix pre-existing rustfmt comment alignment in truncate_worker_log tests
  that CI's rustfmt version caught.
@michaeldwan michaeldwan marked this pull request as ready for review March 5, 2026 15:06
@michaeldwan michaeldwan requested a review from a team as a code owner March 5, 2026 15:06
@michaeldwan michaeldwan requested a review from markphelps March 5, 2026 15:06
markphelps
markphelps previously approved these changes Mar 5, 2026
Copy link
Copy Markdown
Contributor

@markphelps markphelps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one nit

Comment thread integration-tests/harness/harness.go Outdated
good find

Co-authored-by: Mark Phelps <[email protected]>
Signed-off-by: Michael Dwan <[email protected]>
Comment thread integration-tests/harness/harness.go Fixed
Comment thread integration-tests/harness/harness.go Outdated
markphelps and others added 2 commits March 5, 2026 09:54
…ariable'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Signed-off-by: Mark Phelps <[email protected]>
Copy link
Copy Markdown
Contributor

@tempusfrangit tempusfrangit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to turn off notifications for this repo :P. mark approved, approving to unblock.

@michaeldwan michaeldwan merged commit 8c44a7a into main Mar 6, 2026
33 checks passed
@michaeldwan michaeldwan deleted the md/io-spillover branch March 6, 2026 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants