DOCK-2299: Correctly handle thrown Error during push by svonworl · Pull Request #5290 · dockstore/dockstore

svonworl · 2022-12-21T01:11:27Z

Description
This PR modifies the webservice's GitHub apps push-processing code to more gracefully handle an Error thrown by the code within. Previously, Exceptions were handled properly, but Errors slightly not: pushes resulting in Errors were logged as "success"-ful in the GitHub apps logs. We discovered this problem because Ash created a WDL workflow that referenced itself via a http url, which Cromwell happily imported until overflowing the stack, resulting in a StackOverflowError.

The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...

Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the Error in the github apps log message. For example, StackOverflowError. Could this leak any sensitive information? Please consider.

We catch RuntimeException | Error on purpose, rather than Throwable, so that if we change the inner code so it throws any new check exceptions, we'll know about it.

The codebots really don't like us catching Errors, even though I believe it is the right call in this case. They also want us to update to more modern Java features and constructs. But I didn't see anything really objectionable in their feedback.

I updated TransactionHelper to handle Errors similarly. There is probably other code scattered throughout our system that will skip logging/cleanup operations if an Error is thrown.

Review Instructions
Fork https://github.com/dockstore-testing/recursive-wdl and register it on both qa and staging. Check the github apps logs on each. qa should report that the push failed, and staging should report that the push succeeded.

Issue
https://ucsc-cgl.atlassian.net/browse/DOCK-2299
#5274

Security
There may be related issues, I have contacted the appropriate personnel.

Please make sure that you've checked the following before submitting your pull request. Thanks!

Check that you pass the basic style checks and unit tests by running mvn clean install
Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
Follow the existing JPA patterns for queries, using named parameters, to avoid SQL injection
If you are changing dependencies, check the Snyk status check or the dashboard to ensure you are not introducing new high/critical vulnerabilities
Assume that inputs to the API can be malicious, and sanitize and/or check for Denial of Service type values, e.g., massive sizes
Do not serve user-uploaded binary images through the Dockstore API
Ensure that endpoints that only allow privileged access enforce that with the @RolesAllowed annotation
Do not create cookies, although this may change in the future
If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.

codecov · 2022-12-21T01:11:47Z

Codecov Report

Attention: 18 lines in your changes are missing coverage. Please review.

Comparison is base (fa76e2a) 73.54% compared to head (217c0ea) 73.48%.
Report is 251 commits behind head on develop.

❗ Current head 217c0ea differs from pull request most recent head 9399d9c. Consider uploading reports for the commit 9399d9c to get more accurate results

Files	Patch %	Lines
...ockstore/webservice/helpers/TransactionHelper.java	19.04%	15 Missing and 2 partials ⚠️
...webservice/resources/AbstractWorkflowResource.java	94.11%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             develop    #5290      +/-   ##
=============================================
- Coverage      73.54%   73.48%   -0.07%     
+ Complexity      4471     4463       -8     
=============================================
  Files            296      296              
  Lines          16914    16928      +14     
  Branches        1862     1865       +3     
=============================================
  Hits           12440    12440              
- Misses          3590     3597       +7     
- Partials         884      891       +7

Flag	Coverage Δ
bitbuckettests	`28.30% <0.00%> (-0.03%)`	⬇️
integrationtests	`58.34% <52.63%> (-0.02%)`	⬇️
languageparsingtests	`10.82% <0.00%> (-0.01%)`	⬇️
toolintegrationtests	`30.45% <0.00%> (-0.03%)`	⬇️
unit-tests_and_non-confidential-tests	`27.13% <0.00%> (-0.03%)`	⬇️
workflowintegrationtests	`40.65% <0.00%> (-0.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

denis-yuen

In general, if the issue is that the cromwell/scala bridge code can throw Errors, shouldn't our catch clauses be closer to the Scala bridge code or just inside it? This approach seems to result in collateral damage or too much "by-catch. For example, this seems to affect updateUserWorkflows as well which has nothing to do with WDL parsing.

I do wonder since

Catching either Throwable or Error will also catch OutOfMemoryError and InternalError, from which an application should not attempt to recover.

could happen in big parts of "our" code too.

The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...

Sounds ok, in general in favour of a high-level integration test approach without mocks. That said, if cromwell changes their code to avoid a stack overflow, I think I'm ok with that. Just make a code comment

denis-yuen · 2022-12-21T15:30:48Z

            isSuccessful &= createWorkflowsAndVersionsFromDockstoreYml(dockstoreYaml12.getTools(), repository, gitReference, installationId, username, dockstoreYml, AppTool.class, messageWriter);

-        } catch (Exception ex) {
+        } catch (RuntimeException | DockstoreYamlHelper.DockstoreYamlException | Error throwable) {


Ok with the general idea, but if we're breaking with conventions because the Scala/Cromwell bridge code is acting against expectations, shouldn't this be a lot "closer" to the Scala bridge class to avoid "by-catch"?

denis-yuen · 2022-12-21T15:33:04Z

    private static final Logger LOG = LoggerFactory.getLogger(TransactionHelper.class);
    private Session session;
-    private RuntimeException thrown;
+    private Throwable thrown;


By the same token, the changes in this file seem to affect /updateUserWorkflows which should maybe be deleted anyway (we probably don't use it)

svonworl · 2022-12-21T16:26:19Z

In general, if the issue is that the cromwell/scala bridge code can throw Errors, shouldn't our catch clauses be closer to the Scala bridge code or just inside it? This approach seems to result in collateral damage or too much "by-catch. For example, this seems to affect updateUserWorkflows as well which has nothing to do with WDL parsing.

I do wonder since

Catching either Throwable or Error will also catch OutOfMemoryError and InternalError, from which an application should not attempt to recover.

could happen in big parts of "our" code too.

The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...

Sounds ok, in general in favour of a high-level integration test approach without mocks. That said, if cromwell changes their code to avoid a stack overflow, I think I'm ok with that. Just make a code comment

Any of our code can potentially throw an Error, and we should respond to that eventuality in a reasonable way. This PR modifies the enclosing catch so that when it attempts to log an Error-related failure, it does so correctly. After that, the resource handler ends. It's not trying to recover, but rather provide the user with correct information about the failed push, because what it was reporting previously, if any Error was thrown, was wrong. If we don't change the code, it tells the user something incorrect, and if it simply exits without logging anything, the user won't know what has happened, because this endpoint is invoked behind the scenes via the webhook callback...

Similarly, the TransactionHelper mods are designed to do the same, to attempt some cleanup and to log some information before rethrowing the Error, so that we have more information about what happened. If you feel super strongly about this, I'm fine with reverting the TransactionHelper changes, they're not as essential...

And sure, we can also wrap Cromwell to translate the StackOverflowError into something else. But, I kinda feel like it's a severe enough issue that maybe we let it ripple out and terminate the thread with the other Errors. We still need the enclosing catch to make sure that the failed push is logged correctly...

coverbeck

Good discussion...

Researching it, I lean towards catching this particular StackOverflowError error closer to Cromwell:

Referring to the Joshua Bloch "Effective Java" book, he says, although it encompasses unchecked exceptions as well, which seems overkill: "If a program throws an unchecked exception or an error, it is generally the case that recovery is impossible and continued execution would do more harm than good"
In particular, if we get an OOM error and catch it, I worry that this might put the web service in a indeterminate state.
The JavaDoc for VirtualError, which OOM and StackOverflow subclass, says "Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating."
Confusingly, Dockstore seems to recover just fine from a StackOverflowError in this case.
But, according to a random user on Stack Overflow in this post, it is not always guaranteed you can recover from a StackOverflowError.

Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the Error in the github apps log message. For example, StackOverflowError. Could this leak any sensitive information? Please consider.

I feel it would, a little bit -- the user would have the keys on how to trigger VM exception(s) in our web service.

Writing/researching all the above, I do think it's best to catch the StackOverflowError, only, not errors in general, closer to Cromwell.

denis-yuen · 2022-12-21T18:19:53Z

Any of our code can potentially throw an Error, and we should respond to that eventuality in a reasonable way.

Yes, but my understanding of the Java documentation is that this is usually an OS problem so "we" as a platform can respond to it but not in the Java code.

For example as previously noted

An Error is a subclass of Throwable that indicates serious problems that a reasonable application should not try to catch. Most such errors are abnormal conditions. The ThreadDeath error, though a "normal" condition, is also a subclass of Error because most applications should not try to catch it.

If our belief is that the Scala libraries are indicating a serious problem incorrectly, i.e. a non-serious problem that we can deal with, then reiterating that the Error (or StackOverflowError) catch should occur "closer" to the problem and not accidentally catch Errors that may be thrown by non-Scala code.

e.g.

In particular, if we get an OOM error and catch it, I worry that this might put the web service in a indeterminate state.

denis-yuen · 2022-12-21T18:21:52Z

If we don't change the code, it tells the user something incorrect, and if it simply exits without logging anything, the user won't know what has happened, because this endpoint is invoked behind the scenes via the webhook callback...

This seems like a little of a different problem though. For this, maybe we should log a failure first and then only overwrite it with a success afterwards. Thinking that if the whole application crashes due to a genuine error (or if Scala does have a real problem), then we would leave behind evidence of a problem in the github app logs.

svonworl · 2023-05-04T16:18:25Z

test

denis-yuen · 2023-05-04T17:53:20Z

test

❓

denis-yuen

Old PR with merge conflicts
Looking at stale PRs and noise in the daily reminders, is this PR still alive and/or worth keeping @svonworl ?

denis-yuen · 2024-03-11T18:12:22Z

One other observation @svonworl I think "draft" PRs do not show up on the daily reminder, so if you wanted to keep this for inspiration/cherry-picking, that can be a route

coverbeck

I thought this was fixed, perhaps by a different PR/branch?

svonworl · 2024-03-12T17:27:52Z

I thought this was fixed, perhaps by a different PR/branch?

The original ticket was to address bad registration behavior due to recursive WDLs: specifically, Ash noticed that when he attempted to register a recursive WDL, no entry appeared, and nothing appeared in the app logs. In a different PR, we fixed the original problem by patching the code closer to where the corresponding StackOverflowError occurs, for that specific case.

However, the merged PR didn't address the general case of an Error thrown in the registration code. This PR was intended to remedy that, but there was pushback, and I was not able to merge. I still believe that when an Error is thrown, it is preferable for the surrounding code to attempt some basic cleanup and logging, rather than let the Error pass through, silently, unhandled. Thus, I kept this PR around, in case the team ever subscribes to that point of view.

denis-yuen · 2024-03-12T18:09:07Z

Thus, I kept this PR around, in case the team ever subscribes to that point of view.

In the meantime, though, we get a reminder every day about it.
I'm going to put this back into draft and I think that will (may?) stop that reminder

Edit to add: ok, worked

svonworl added 9 commits December 19, 2022 17:21

rejigger to handle Error

ea07641

adjust TransactionHelper

1d2e1f7

add rough IT

b7dc8a5

revert a change

6949d73

fix oopsie

17583f9

enable mock-maker-inline and enhance IT

266a69a

move and repair test

26dd0fc

use recurive WDL for testing

92845b0

retarget to dockstore-testing repo

217c0ea

svonworl self-assigned this Dec 21, 2022

svonworl requested review from coverbeck, david4096 and denis-yuen December 21, 2022 01:24

denis-yuen requested changes Dec 21, 2022

View reviewed changes

coverbeck reviewed Dec 21, 2022

View reviewed changes

tweak comment [skip ci]

d0a1a12

update comment

9399d9c

denis-yuen requested review from coverbeck and denis-yuen March 11, 2024 17:55

denis-yuen requested changes Mar 11, 2024

View reviewed changes

coverbeck reviewed Mar 12, 2024

View reviewed changes

denis-yuen marked this pull request as draft March 12, 2024 18:09

Conversation

svonworl commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

denis-yuen left a comment

Choose a reason for hiding this comment

Uh oh!

denis-yuen Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

denis-yuen Dec 21, 2022

Choose a reason for hiding this comment

Uh oh!

svonworl commented Dec 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coverbeck left a comment

Choose a reason for hiding this comment

Uh oh!

denis-yuen commented Dec 21, 2022

Uh oh!

denis-yuen commented Dec 21, 2022

Uh oh!

svonworl commented May 4, 2023

Uh oh!

denis-yuen commented May 4, 2023

Uh oh!

denis-yuen left a comment

Choose a reason for hiding this comment

Uh oh!

denis-yuen commented Mar 11, 2024

Uh oh!

coverbeck left a comment

Choose a reason for hiding this comment

Uh oh!

svonworl commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

denis-yuen commented Mar 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

svonworl commented Dec 21, 2022 •

edited

Loading

codecov bot commented Dec 21, 2022 •

edited

Loading

svonworl commented Dec 21, 2022 •

edited

Loading

svonworl commented Mar 12, 2024 •

edited

Loading

denis-yuen commented Mar 12, 2024 •

edited

Loading