[RFC] Bug in Lambda Instance lifecycle management#360
Open
darkryder wants to merge 1 commit intoopen-lambda:mainfrom
Open
[RFC] Bug in Lambda Instance lifecycle management#360darkryder wants to merge 1 commit intoopen-lambda:mainfrom
darkryder wants to merge 1 commit intoopen-lambda:mainfrom
Conversation
In the current implementation, as LambdaInstances are sent a kill signal, their clean up is sent to a separate buffered cleanup channel. Therefore, when LambdaFunction wants to scale down, it is imperitive that it actually sends idle invocations to be cleaned up. Currently that is not the case, we send the latest invocation to be cleaned up. The simplest solution is to simply send one kill signal to a shared channel, and the idle invocation will pick it up.
darkryder
commented
Nov 10, 2025
| lastScaling = &now | ||
| } else if f.instances.Len() > desiredInstances { | ||
| f.printf("reduce instances to %d", f.instances.Len()-1) | ||
| waitChan := f.instances.Back().Value.(*LambdaInstance).AsyncKill() |
Author
There was a problem hiding this comment.
This was the problematic line -- we chose the most recently spun up LambdaInstance, which need not be the idle one.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I think there’s a bug in how Lambda instances are scheduled for scale-down.
Currently, there’s a cleanup task in
LambdaFunctionresponsible for managingLambdaInstancelifecycles. However, in the current implementation, it appears that we can enqueue instances that are still running. When the cleanup queue (cleanupChan) reaches its capacity of 32 pending instances, the outerLambdaFunc:Task()method becomes blocked. This, in turn, prevents new invocations from being enqueued.A simple workaround would be to increase the size of the queue (the default
cleaupChanhas 32 slots). However, a more systematic solution might be to use a shared unbuffered channel and implement a notify_one-style mechanism for idle instances — effectively waking one instance at a time that’s ready to be terminated. By definition, if we’re ready to scale down, there must be at least one idle worker — so the Task() loop should never block.There might be cleaner ways to fix this, hence the RFC.
Here are the steps to replicate the bug.
sudo ./ol worker init -p myworker -i ol-minlimits.runtime_sec) to 100s.The workload is basically 48 1second jobs and 48 5second jobs. In the current code, we find that the turnaround time for 1second jobs is 1second for ~30 jobs and 5-15 seconds for the rest 18 jobs.
This PR shows that all 48 1second jobs will indeed see 1second turnaround time.
Approach nits
I had to switch to using
nInstancesinstead of maintaining an explicit list of instances. If we decide to maintain a list, we’ll likely need a handshake mechanism, since we’d want to carefully remove the specificLambdaInstancethat actually received the shutdown signal.