Skip to content

Detect and cleanup leaky processes#29196

Merged
dnfield merged 14 commits intoflutter:masterfrom
dnfield:device_lab_leaky_processes
Mar 22, 2019
Merged

Detect and cleanup leaky processes#29196
dnfield merged 14 commits intoflutter:masterfrom
dnfield:device_lab_leaky_processes

Conversation

@dnfield
Copy link
Contributor

@dnfield dnfield commented Mar 12, 2019

Description

Today, if a devicelab test leaks a dart process, it isnt' marked as a failure. If it's on Windows, this will cause the subsequent tasks on the bot to fail because they can't clean up files that the dart.exe process is holding.

This PR will check that we don't leak dart processes - if we do, we'll mark the task as failed and kill any dart processes we didn't have to begin with. This will allow subsequent runs to run successfully if they don't leak.

Related Issues

#29141

Checklist

Before you create this PR confirm that it meets all requirements listed below by checking the relevant checkboxes ([x]). This will ensure a smooth and quick review process.

  • I read the Contributor Guide and followed the process outlined there for submitting PRs.
  • My PR includes tests for all changed/updated/fixed behaviors (See Test Coverage).
  • All existing and new tests are passing.
  • I updated/added relevant documentation (doc comments with ///).
  • The analyzer (flutter analyze --flutter-repo) does not report any problems on my PR.
  • I read and followed the Flutter Style Guide, including Features we expect every widget to implement.
  • I signed the CLA.
  • I am willing to follow-up on review comments in a timely manner.

Breaking Change

Does your PR require Flutter developers to manually update their apps to accommodate your change?

  • Yes, this is a breaking change (Please read Handling breaking changes). - but only in a good way and it doesn't impact the framework 😉
  • No, this is not a breaking change.

@dnfield dnfield added a: tests "flutter test", flutter_test, or one of our tests team: flakes labels Mar 12, 2019
@dnfield dnfield requested a review from jonahwilliams March 12, 2019 01:41
Copy link
Contributor

@jonahwilliams jonahwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its possible we have several tests that are leaking processes and/or occasionally leaking processes. To make the process of rolling this out somewhat smoother (especially in the event that these are hard to find) we might want to consider some sort of soft-break initially. I'm not sure if that is possible to do though without updating the agent logic.

final Set<RunningProcessInfo> beforeRunningDartInstances = await getRunningProcesses(
processName: 'dart$exe',
).toSet();
beforeRunningDartInstances.forEach(print);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these prints end up in the devicelab logs? If so we might want to only print on a failure, and in that case make sure we print both sets together so it is easier to found.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do. My thought is that this is useful information if anything goes wrong and we need to debug something that's hard to reproduce locally.

).toList();
afterRunningDartInstances.forEach(print);
for(final RunningProcessInfo info in afterRunningDartInstances) {
if (!beforeRunningDartInstances.contains(info)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above prints seem redundant with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is printing out exactly why we thought we had a matching process. Maybe I'm being paranoid, but this should only add ~5-10 lines extra to the logs.

@dnfield
Copy link
Contributor Author

dnfield commented Mar 12, 2019

I'd be game for a softer failure but we don't really have facilities for that.

This issue is already causing failures in the devicelab on Windows - the main difference here is that this would help us not need manual intervention to continue testing.

Maybe we should only make it fail on Windows for now?

@dnfield dnfield force-pushed the device_lab_leaky_processes branch 2 times, most recently from 6534a71 to cebc03f Compare March 12, 2019 20:41
@dnfield dnfield force-pushed the device_lab_leaky_processes branch from cebc03f to 6a4b980 Compare March 12, 2019 20:46
@dnfield
Copy link
Contributor Author

dnfield commented Mar 12, 2019

Trying to determine if the devicelab can use an existing orange status for this without needing to reimage the whole thing

@dnfield dnfield requested a review from jonahwilliams March 19, 2019 21:19
}

class TaskResultCheckProcesses extends TaskResult {
TaskResultCheckProcesses() : super.success(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does passing null here do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't set any data in the task result - leaves it empty. Other callers already use TaskResult.success(null) if they aren't doing a benchmark and have no other data to report.

stderrDone.complete();
});

await Future.wait<void>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any sort of timeout process in the devicelab? I worry that if we change "] For a more detailed help message, press "h". To detach, press "d"; to quit, press "q"'" we're gonna get some odd failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 minutes.

I'm not sure of a better line to look for - I'm open to suggestions, but we could also just catch this failure and update it worst case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it times out eventually this is good for now

Copy link
Contributor

@jonahwilliams jonahwilliams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo question about help message fragility

@dnfield dnfield merged commit ecfdd7e into flutter:master Mar 22, 2019
@dnfield dnfield deleted the device_lab_leaky_processes branch March 22, 2019 21:32
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

a: tests "flutter test", flutter_test, or one of our tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants