[RFC] Bug in Lambda Instance lifecycle management by darkryder · Pull Request #360 · open-lambda/open-lambda

darkryder · 2025-11-09T20:39:47Z

I think there’s a bug in how Lambda instances are scheduled for scale-down.

Currently, there’s a cleanup task in LambdaFunction responsible for managing LambdaInstance lifecycles. However, in the current implementation, it appears that we can enqueue instances that are still running. When the cleanup queue (cleanupChan) reaches its capacity of 32 pending instances, the outer LambdaFunc:Task() method becomes blocked. This, in turn, prevents new invocations from being enqueued.

A simple workaround would be to increase the size of the queue (the default cleaupChan has 32 slots). However, a more systematic solution might be to use a shared unbuffered channel and implement a notify_one-style mechanism for idle instances — effectively waking one instance at a time that’s ready to be terminated. By definition, if we’re ready to scale down, there must be at least one idle worker — so the Task() loop should never block.

There might be cleaner ways to fix this, hence the RFC.

Here are the steps to replicate the bug.

sudo ./ol worker init -p myworker -i ol-min
In config.json, increase timeout per function (limits.runtime_sec) to 100s.
Modify the echo.py example to

import time
def f(event):
  sleep_for = event.get('sleep_for', 0)
  if sleep_for > 0:
      time.sleep(sleep_for)
  return event

Register echo.py
Spin up the worker
Run the following script which forces the pathological case of cleanup queue building up.

#!/bin/bash
# When there are 32 slow functions, the 33rd function's execution will be blocked even if there is capacity.

set -eu

time curl -sS -X POST localhost:5000/run/echo -d '{"hello": "World", "sleep_for": 0}';

# goal is to fill up the cleanup channel
# let's say we quickly vary short and long requests. With each iteration, number of outstanding requests will 
# increase and decrease. When it decreases, the last request will be queued up in the cleanup channel.
# Since each last request is a long request, the cleanup channel will fill in 32 iterations.
# Slow requests after that will see large latency.

# long runtime: 50s
# short runtime: 1s

stagger() {
    (time curl -sS -X POST localhost:5000/run/echo -d '{"job": "test", "sleep_for": 1}'; echo) &
    sleep 0.1
    (time curl -sS -X POST localhost:5000/run/echo -d '{"job": "test", "sleep_for": 50}'; echo) &
}

for i in {0..48}; do
    stagger
    sleep 1
done

wait

The workload is basically 48 1second jobs and 48 5second jobs. In the current code, we find that the turnaround time for 1second jobs is 1second for ~30 jobs and 5-15 seconds for the rest 18 jobs.

This PR shows that all 48 1second jobs will indeed see 1second turnaround time.

Approach nits

I had to switch to using nInstances instead of maintaining an explicit list of instances. If we decide to maintain a list, we’ll likely need a handshake mechanism, since we’d want to carefully remove the specific LambdaInstance that actually received the shutdown signal.

In the current implementation, as LambdaInstances are sent a kill signal, their clean up is sent to a separate buffered cleanup channel. Therefore, when LambdaFunction wants to scale down, it is imperitive that it actually sends idle invocations to be cleaned up. Currently that is not the case, we send the latest invocation to be cleaned up. The simplest solution is to simply send one kill signal to a shared channel, and the idle invocation will pick it up.

darkryder · 2025-11-10T14:03:24Z

go/worker/lambda/lambdaFunction.go

 			lastScaling = &now
-		} else if f.instances.Len() > desiredInstances {
-			f.printf("reduce instances to %d", f.instances.Len()-1)
-			waitChan := f.instances.Back().Value.(*LambdaInstance).AsyncKill()


This was the problematic line -- we chose the most recently spun up LambdaInstance, which need not be the idle one.

darkryder commented Nov 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Bug in Lambda Instance lifecycle management#360

[RFC] Bug in Lambda Instance lifecycle management#360
darkryder wants to merge 1 commit intoopen-lambda:mainfrom
darkryder:bug-killLambdaInstance

darkryder commented Nov 9, 2025

Uh oh!

darkryder Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

darkryder commented Nov 9, 2025

Approach nits

Uh oh!

darkryder Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant