Skip to content

[RFC] Bug in Lambda Instance lifecycle management#360

Open
darkryder wants to merge 1 commit intoopen-lambda:mainfrom
darkryder:bug-killLambdaInstance
Open

[RFC] Bug in Lambda Instance lifecycle management#360
darkryder wants to merge 1 commit intoopen-lambda:mainfrom
darkryder:bug-killLambdaInstance

Conversation

@darkryder
Copy link

I think there’s a bug in how Lambda instances are scheduled for scale-down.

Currently, there’s a cleanup task in LambdaFunction responsible for managing LambdaInstance lifecycles. However, in the current implementation, it appears that we can enqueue instances that are still running. When the cleanup queue (cleanupChan) reaches its capacity of 32 pending instances, the outer LambdaFunc:Task() method becomes blocked. This, in turn, prevents new invocations from being enqueued.

A simple workaround would be to increase the size of the queue (the default cleaupChan has 32 slots). However, a more systematic solution might be to use a shared unbuffered channel and implement a notify_one-style mechanism for idle instances — effectively waking one instance at a time that’s ready to be terminated. By definition, if we’re ready to scale down, there must be at least one idle worker — so the Task() loop should never block.

There might be cleaner ways to fix this, hence the RFC.

Here are the steps to replicate the bug.

  1. sudo ./ol worker init -p myworker -i ol-min
  2. In config.json, increase timeout per function (limits.runtime_sec) to 100s.
  3. Modify the echo.py example to
import time
def f(event):
  sleep_for = event.get('sleep_for', 0)
  if sleep_for > 0:
      time.sleep(sleep_for)
  return event
  1. Register echo.py
  2. Spin up the worker
  3. Run the following script which forces the pathological case of cleanup queue building up.
#!/bin/bash
# When there are 32 slow functions, the 33rd function's execution will be blocked even if there is capacity.

set -eu

time curl -sS -X POST localhost:5000/run/echo -d '{"hello": "World", "sleep_for": 0}';

# goal is to fill up the cleanup channel
# let's say we quickly vary short and long requests. With each iteration, number of outstanding requests will 
# increase and decrease. When it decreases, the last request will be queued up in the cleanup channel.
# Since each last request is a long request, the cleanup channel will fill in 32 iterations.
# Slow requests after that will see large latency.

# long runtime: 50s
# short runtime: 1s

stagger() {
    (time curl -sS -X POST localhost:5000/run/echo -d '{"job": "test", "sleep_for": 1}'; echo) &
    sleep 0.1
    (time curl -sS -X POST localhost:5000/run/echo -d '{"job": "test", "sleep_for": 50}'; echo) &
}

for i in {0..48}; do
    stagger
    sleep 1
done

wait

The workload is basically 48 1second jobs and 48 5second jobs. In the current code, we find that the turnaround time for 1second jobs is 1second for ~30 jobs and 5-15 seconds for the rest 18 jobs.

This PR shows that all 48 1second jobs will indeed see 1second turnaround time.

Approach nits

I had to switch to using nInstances instead of maintaining an explicit list of instances. If we decide to maintain a list, we’ll likely need a handshake mechanism, since we’d want to carefully remove the specific LambdaInstance that actually received the shutdown signal.

In the current implementation, as LambdaInstances are sent a kill signal,
their clean up is sent to a separate buffered cleanup channel. Therefore,
when LambdaFunction wants to scale down, it is imperitive that it actually
sends idle invocations to be cleaned up. Currently that is not the case, we
send the latest invocation to be cleaned up. The simplest solution is to
simply send one kill signal to a shared channel, and the idle invocation
will pick it up.
lastScaling = &now
} else if f.instances.Len() > desiredInstances {
f.printf("reduce instances to %d", f.instances.Len()-1)
waitChan := f.instances.Back().Value.(*LambdaInstance).AsyncKill()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the problematic line -- we chose the most recently spun up LambdaInstance, which need not be the idle one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant