DOCK-2299: Correctly handle thrown Error during push#5290
DOCK-2299: Correctly handle thrown Error during push#5290
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #5290 +/- ##
=============================================
- Coverage 73.54% 73.48% -0.07%
+ Complexity 4471 4463 -8
=============================================
Files 296 296
Lines 16914 16928 +14
Branches 1862 1865 +3
=============================================
Hits 12440 12440
- Misses 3590 3597 +7
- Partials 884 891 +7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
denis-yuen
left a comment
There was a problem hiding this comment.
In general, if the issue is that the cromwell/scala bridge code can throw Errors, shouldn't our catch clauses be closer to the Scala bridge code or just inside it? This approach seems to result in collateral damage or too much "by-catch. For example, this seems to affect updateUserWorkflows as well which has nothing to do with WDL parsing.
I do wonder since
Catching either Throwable or Error will also catch OutOfMemoryError and InternalError, from which an application should not attempt to recover.
could happen in big parts of "our" code too.
The integration test processes a similarly-recursive WDL, which it assumes will trigger a StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach wherein Errors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...
Sounds ok, in general in favour of a high-level integration test approach without mocks. That said, if cromwell changes their code to avoid a stack overflow, I think I'm ok with that. Just make a code comment
| isSuccessful &= createWorkflowsAndVersionsFromDockstoreYml(dockstoreYaml12.getTools(), repository, gitReference, installationId, username, dockstoreYml, AppTool.class, messageWriter); | ||
|
|
||
| } catch (Exception ex) { | ||
| } catch (RuntimeException | DockstoreYamlHelper.DockstoreYamlException | Error throwable) { |
There was a problem hiding this comment.
Ok with the general idea, but if we're breaking with conventions because the Scala/Cromwell bridge code is acting against expectations, shouldn't this be a lot "closer" to the Scala bridge class to avoid "by-catch"?
| private static final Logger LOG = LoggerFactory.getLogger(TransactionHelper.class); | ||
| private Session session; | ||
| private RuntimeException thrown; | ||
| private Throwable thrown; |
There was a problem hiding this comment.
By the same token, the changes in this file seem to affect /updateUserWorkflows which should maybe be deleted anyway (we probably don't use it)
Any of our code can potentially throw an Similarly, the And sure, we can also wrap Cromwell to translate the |
coverbeck
left a comment
There was a problem hiding this comment.
Good discussion...
Researching it, I lean towards catching this particular StackOverflowError error closer to Cromwell:
- Referring to the Joshua Bloch "Effective Java" book, he says, although it encompasses unchecked exceptions as well, which seems overkill: "If a program throws an unchecked exception or an error, it is generally the case that recovery is impossible and continued execution would do more harm than good"
- In particular, if we get an OOM error and catch it, I worry that this might put the web service in a indeterminate state.
- The JavaDoc for VirtualError, which OOM and StackOverflow subclass, says "Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating."
- Confusingly, Dockstore seems to recover just fine from a StackOverflowError in this case.
- But, according to a random user on Stack Overflow in this post, it is not always guaranteed you can recover from a StackOverflowError.
Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the Error in the github apps log message. For example, StackOverflowError. Could this leak any sensitive information? Please consider.
I feel it would, a little bit -- the user would have the keys on how to trigger VM exception(s) in our web service.
Writing/researching all the above, I do think it's best to catch the StackOverflowError, only, not errors in general, closer to Cromwell.
Yes, but my understanding of the Java documentation is that this is usually an OS problem so "we" as a platform can respond to it but not in the Java code. For example as previously noted
If our belief is that the Scala libraries are indicating a serious problem incorrectly, i.e. a non-serious problem that we can deal with, then reiterating that the e.g.
|
This seems like a little of a different problem though. For this, maybe we should log a failure first and then only overwrite it with a success afterwards. Thinking that if the whole application crashes due to a genuine error (or if Scala does have a real problem), then we would leave behind evidence of a problem in the github app logs. |
|
test |
❓ |
denis-yuen
left a comment
There was a problem hiding this comment.
Old PR with merge conflicts
Looking at stale PRs and noise in the daily reminders, is this PR still alive and/or worth keeping @svonworl ?
|
One other observation @svonworl I think "draft" PRs do not show up on the daily reminder, so if you wanted to keep this for inspiration/cherry-picking, that can be a route |
coverbeck
left a comment
There was a problem hiding this comment.
I thought this was fixed, perhaps by a different PR/branch?
The original ticket was to address bad registration behavior due to recursive WDLs: specifically, Ash noticed that when he attempted to register a recursive WDL, no entry appeared, and nothing appeared in the app logs. In a different PR, we fixed the original problem by patching the code closer to where the corresponding StackOverflowError occurs, for that specific case. However, the merged PR didn't address the general case of an Error thrown in the registration code. This PR was intended to remedy that, but there was pushback, and I was not able to merge. I still believe that when an Error is thrown, it is preferable for the surrounding code to attempt some basic cleanup and logging, rather than let the Error pass through, silently, unhandled. Thus, I kept this PR around, in case the team ever subscribes to that point of view. |
In the meantime, though, we get a reminder every day about it. Edit to add: ok, worked |
Description
This PR modifies the webservice's GitHub apps push-processing code to more gracefully handle an
Errorthrown by the code within. Previously,Exceptions were handled properly, butErrors slightly not: pushes resulting inErrors were logged as "success"-ful in the GitHub apps logs. We discovered this problem because Ash created a WDL workflow that referenced itself via a http url, which Cromwell happily imported until overflowing the stack, resulting in aStackOverflowError.The integration test processes a similarly-recursive WDL, which it assumes will trigger a
StackOverflowError. The test will break if Cromwell changes that behavior. I tried another approach whereinErrors were triggered via mocks, but it didn't work, and it'd be fragile, anyways...Currently, to give the user a bit of a clue as to what has happened, this PR includes the short class name of the
Errorin the github apps log message. For example,StackOverflowError. Could this leak any sensitive information? Please consider.We catch
RuntimeException | Erroron purpose, rather thanThrowable, so that if we change the inner code so it throws any new check exceptions, we'll know about it.The codebots really don't like us catching
Errors, even though I believe it is the right call in this case. They also want us to update to more modern Java features and constructs. But I didn't see anything really objectionable in their feedback.I updated
TransactionHelperto handleErrors similarly. There is probably other code scattered throughout our system that will skip logging/cleanup operations if anErroris thrown.Review Instructions
Fork https://github.com/dockstore-testing/recursive-wdl and register it on both qa and staging. Check the github apps logs on each. qa should report that the push failed, and staging should report that the push succeeded.
Issue
https://ucsc-cgl.atlassian.net/browse/DOCK-2299
#5274
Security
There may be related issues, I have contacted the appropriate personnel.
Please make sure that you've checked the following before submitting your pull request. Thanks!
mvn clean install@RolesAllowedannotation