Detect and cleanup leaky processes#29196
Conversation
jonahwilliams
left a comment
There was a problem hiding this comment.
Its possible we have several tests that are leaking processes and/or occasionally leaking processes. To make the process of rolling this out somewhat smoother (especially in the event that these are hard to find) we might want to consider some sort of soft-break initially. I'm not sure if that is possible to do though without updating the agent logic.
| final Set<RunningProcessInfo> beforeRunningDartInstances = await getRunningProcesses( | ||
| processName: 'dart$exe', | ||
| ).toSet(); | ||
| beforeRunningDartInstances.forEach(print); |
There was a problem hiding this comment.
Do these prints end up in the devicelab logs? If so we might want to only print on a failure, and in that case make sure we print both sets together so it is easier to found.
There was a problem hiding this comment.
They do. My thought is that this is useful information if anything goes wrong and we need to debug something that's hard to reproduce locally.
| ).toList(); | ||
| afterRunningDartInstances.forEach(print); | ||
| for(final RunningProcessInfo info in afterRunningDartInstances) { | ||
| if (!beforeRunningDartInstances.contains(info)) { |
There was a problem hiding this comment.
Above prints seem redundant with this
There was a problem hiding this comment.
This one is printing out exactly why we thought we had a matching process. Maybe I'm being paranoid, but this should only add ~5-10 lines extra to the logs.
|
I'd be game for a softer failure but we don't really have facilities for that. This issue is already causing failures in the devicelab on Windows - the main difference here is that this would help us not need manual intervention to continue testing. Maybe we should only make it fail on Windows for now? |
6534a71 to
cebc03f
Compare
cebc03f to
6a4b980
Compare
|
Trying to determine if the devicelab can use an existing orange status for this without needing to reimage the whole thing |
| } | ||
|
|
||
| class TaskResultCheckProcesses extends TaskResult { | ||
| TaskResultCheckProcesses() : super.success(null); |
There was a problem hiding this comment.
What does passing null here do?
There was a problem hiding this comment.
Doesn't set any data in the task result - leaves it empty. Other callers already use TaskResult.success(null) if they aren't doing a benchmark and have no other data to report.
| stderrDone.complete(); | ||
| }); | ||
|
|
||
| await Future.wait<void>( |
There was a problem hiding this comment.
Is there any sort of timeout process in the devicelab? I worry that if we change "] For a more detailed help message, press "h". To detach, press "d"; to quit, press "q"'" we're gonna get some odd failures.
There was a problem hiding this comment.
15 minutes.
I'm not sure of a better line to look for - I'm open to suggestions, but we could also just catch this failure and update it worst case.
There was a problem hiding this comment.
As long as it times out eventually this is good for now
jonahwilliams
left a comment
There was a problem hiding this comment.
LGTM modulo question about help message fragility
Description
Today, if a devicelab test leaks a dart process, it isnt' marked as a failure. If it's on Windows, this will cause the subsequent tasks on the bot to fail because they can't clean up files that the dart.exe process is holding.
This PR will check that we don't leak dart processes - if we do, we'll mark the task as failed and kill any dart processes we didn't have to begin with. This will allow subsequent runs to run successfully if they don't leak.
Related Issues
#29141
Checklist
Before you create this PR confirm that it meets all requirements listed below by checking the relevant checkboxes (
[x]). This will ensure a smooth and quick review process.///).flutter analyze --flutter-repo) does not report any problems on my PR.Breaking Change
Does your PR require Flutter developers to manually update their apps to accommodate your change?