What we know:
- Users experience save failures
- User retries the save (immediately or later)
- Retry succeeds WITHOUT user changing data
- This happens intermittently (not every time)
What we DON'T know:
- What error message the user sees
- Whether the failure is at the storage layer or elsewhere
- What varies between the failing and succeeding requests
- Whether this is even related to the S3 errors we found in logs
We found these S3 errors in production logs:
RequestHeaderSectionTooLarge- HTTP headers exceed 8KB limitKeyTooLongError- S3 object key exceeds 1024 bytes
CRITICAL: We have NOT confirmed these are the errors users experience intermittently. These could be:
- Unrelated errors from different operations
- One-time occurrences from specific edge cases
- Not the "intermittent save failure" issue at all
- Need to capture every error that reaches the user
- Need to know which errors are intermittent vs permanent
- When a save fails, we don't track it
- When a save succeeds, we don't know if it's a retry
- No way to identify "same data, different outcome"
- Only logging failures means we can't see the pattern
- Need to see: fail → success sequences
- Need to measure: how often saves fail vs succeed
- What changes between attempt 1 (fail) and attempt 2 (success)?
- Key length? Content size? Timing? Connection state?
Added logging to save operations:
[SAVE_ATTEMPT] save_json: key=... key_len=... content_size=...
[SAVE_SUCCESS] save_json: key=...
[SAVE_ATTEMPT] save_bytes_from_data: key=... key_len=... content_size=... type=...
[SAVE_SUCCESS] save_bytes_from_data: key=...
Purpose:
- See every save attempt (fail or succeed)
- Track key lengths and content sizes
- Identify patterns in successful vs failed saves
Added to exception handler:
[STORAGE_ERROR] ClientError: {error_code} in {function}()
[STORAGE_ERROR] Key: {key}... (len={length})
[STORAGE_ERROR] Content size: {size}
[STORAGE_ERROR] Error response: {full_response}
Purpose:
- Capture EVERY error, not just specific codes
- See full S3 error response
- Understand error distribution (which errors are common?)
For known problematic errors (KeyTooLongError, RequestHeaderSectionTooLarge):
- Break down key into components
- Show exact byte lengths
- Log boto3 configuration
Purpose: If these ARE the intermittent errors, we'll understand why
-
Identify the error:
grep "\[STORAGE_ERROR\]" /var/log/storymapjs/error.log | tail -50
This shows the actual error code and response
-
Find the failed save attempt:
grep "\[SAVE_ATTEMPT\]" /var/log/storymapjs/error.log | grep -B 2 -A 10 "STORAGE_ERROR"
This shows what was being saved when it failed
-
Look for retry pattern:
grep "key=storymapjs/USER_ID/STORYMAP_ID" /var/log/storymapjs/error.logReplace USER_ID and STORYMAP_ID from the failed save. This shows all attempts for the same StoryMap
-
Compare failure to success: Look at the
key_lenandcontent_sizevalues:- Are they identical? → Error is not data-dependent
- Are they different? → Identifies what changed between attempts
Pattern 1: Same key/size, different outcomes
[SAVE_ATTEMPT] save_json: key=storymapjs/user/map/draft.json... key_len=856 content_size=12345
[STORAGE_ERROR] ClientError: RequestHeaderSectionTooLarge
[SAVE_ATTEMPT] save_json: key=storymapjs/user/map/draft.json... key_len=856 content_size=12345
[SAVE_SUCCESS] save_json: key=storymapjs/user/map/draft.json...
Conclusion: Error is intermittent even with identical data → Likely boto3/S3/network issue
Pattern 2: Different size, different outcomes
[SAVE_ATTEMPT] save_json: key_len=856 content_size=12345
[STORAGE_ERROR] ClientError: RequestHeaderSectionTooLarge
[SAVE_ATTEMPT] save_json: key_len=745 content_size=10000
[SAVE_SUCCESS]
Conclusion: User changed something (removed image, edited content) → Data-dependent error
Pattern 3: Specific error code dominates
[STORAGE_ERROR] ClientError: RequestHeaderSectionTooLarge (90% of errors)
[STORAGE_ERROR] ClientError: KeyTooLongError (5% of errors)
[STORAGE_ERROR] ClientError: SlowDown (5% of errors)
Conclusion: Focus investigation on the dominant error
Pattern 4: No errors in logs but user reports failure
[SAVE_ATTEMPT] save_json
[SAVE_SUCCESS] save_json
(but user says it failed)
Conclusion: Error is happening outside storage layer (frontend? middleware? nginx?)
- Focus on the specific error code identified
- Analyze the pattern (data-dependent vs intermittent)
- Implement targeted fix
- Add logging at API layer (before storage calls)
- Check nginx/gunicorn logs for upstream issues
- Add frontend error tracking
- Investigate boto3 retry behavior
- Check S3 service status during failure times
- Consider connection pooling issues
- Look at AWS Signature V4 generation variance
- Implement validation (key length, content size limits)
- Add user-facing warnings before save
- Provide clearer error messages
What we've deployed:
- Comprehensive logging of all save operations
- Full error context capture
- Success tracking for correlation
- Slug length limiting (200 chars) to prevent one class of long keys
What we're waiting for:
- Real production error occurrence
- Log analysis to identify actual error
- Pattern identification (intermittent vs data-dependent)
No assumptions - just data collection and analysis.