Conversation
After some experimentation, it seems that calling `grep` too much and spaced over some "longer" length of time in the nspawn test environment causes `grep` to return 2. This is a workaround to see if calling grep only 4 times vs 28 will make the test less flaky for bionic-arm64.
|
Hard to say if it's 100% stable now but it passed on all of the runs (this time it's TEST-24 failing in bionic-i386) |
|
Hmm, but why was it failing and why is it passing now? I see that calling grep less times might be slightly more efficient, but isn't this just covering some race condition or other problem? |
Oh it's definitely covering for some other problem. But I've run out of cycles for now to get to the root of it. I was able to consistently reproduce it locally (on Fedora) by inserting |
|
I spent some time looking at this one and it turns out I think my inclination would be to fix this by removing However, I didn't stop to look at why output was being redirected to the I also didn't check why writing to the One more thing I noticed is that Perhaps this should be set up in a way that normal startup is fully completed first, then Here's the strace output I collected from this run. In particular, the last few lines show the error clearly: |
|
cc @evverx who knows this code really well! |
I *think* this was originally added to make it easier to see what was happening in tests. Later we added the functionality to print the journal on failure, so this redirection has stopped being useful. In systemd#13719 (comment) @filbranden shows that grep tries to write to stdout and fails. In general, we should not assume that writing to the console it always possible. We have special code to handle this in pid1 after all: 99 19:22:10.731965 fstat(1, <unfinished ...> 99 19:22:10.731993 <... fstat resumed>{st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0 99 19:22:10.732070 write(1, "ExecStartPost={ path=/bin/echo ; argv[]=/bin/echo ${4_four_ex} ; ignore_errors=no ; start_time=[Mon 2019-10-07 19:22:10 PDT] ; stop_time=[Mon 209-10-07 19:22:10 PDT] ; pid=97 ; code=exited ; status=0 }\n", 203) = -1 EIO (Input/output error) 99 19:22:10.732174 write(2, "grep: ", 6) = -1 EIO (Input/output error) 99 19:22:10.732226 write(2, "write error", 11) = -1 EIO (Input/output error) 99 19:22:10.732263 write(2, ": Input/output error", 20) = -1 EIO (Input/output error) 99 19:22:10.732298 write(2, "\n", 1 <unfinished ...> 99 19:22:10.732325 <... write resumed>) = -1 EIO (Input/output error) 99 19:22:10.732349 exit_group(2) = ? 99 19:22:10.732424 +++ exited with 2 +++ Removing the redirection should make the tests less flakey. Replaces systemd#13719. While at it, also drop NotifyAccess=all. I think it was added purposefully in TEST-20-MAINPIDGAMES, and then cargo culted to newer tests.
|
See #13746. |
|
I think my approach in #13746 is better ;) |
|
@filbranden thank you for looking into this! Given that I've never seen the tests fail in VMs and lxc containers due to grep failing to write anything when Regarding #13746, that approach is better in the sense that it swept the underlying issue under the rug completely but it's still there. Plus the premise of that PR isn't entirely accurate because "the functionality to print the journal on failure" has never been implemented on Ubuntu CI. That's true the journals are kept among artifacts there and theoretically it's possible to download and inspect them but in practice they are just piling up there without any structure that would help to figure out what exactly happens there. I'm not sure what's going on on CentOS CI though. Judging by https://github.com/systemd/systemd-centos-ci/blob/9047b07c210c4e3a63048f2c517fa94bf2b8a190/agent/testsuite-rhel7.sh#L59, it saves the files somewhere (ignoring all the failures along the way). |
At the end of each console log is a landing page URL, which should provide an access to all necessary artifacts (including the journals), like this one: https://ci.centos.org/job/systemd-pr-build-vagrant/3794/artifact//systemd-centos-ci/index.html The journals are collected only in case the test failed though, to not clutter the logs too much. |
|
Also, the journal-collection mechanism is a little bit different in the upstream testsuite script as it includes support for coredumpctl-related stuff. The RHEL7 and RHEL8 script vary depending on the available functionality I had at hand, but the result should be pretty much the same (accessible journals in case something went awry). |
|
The part of Centos CI where backtraces are collected is really great I must say! Thank you! Though, as far I can remember, it was inspired by networkctl crashing on Ubuntu CI on i386 machines and when I was complaining about how hard it was for me to spin up a VM manually to get the backtrace :-), I was kind of hoping the coredumpctl stuff would make it there (which has never happened unfortunately). Anyway, are all the logs saved? If, for example, TEST-24 fails in "unprivileged" nspawn containers only, how easy is it to figure out which of those 20 files correspond to the last boot? |
|
On an absolutely unrelated note, speaking of CentOS CI, I watched @davide125's talk at All Systems Go the other day where as far as I can tell it was mentioned that facebook folks are planning to kind of run the scripts from systemd-centos-ci on their infrastructure internally. @anitazha @davide125 @filbranden I'm wondering if it's possible to use that infrastructure upstream. I'm asking because at some point the infrastructure provided by the CentOS project reached its limits and some tests like the one running |
|
@evverx the internal infra may not be the best fit for this, but we can probably work something out. Lemme talk to a few people here and get back to you. |
|
@davide125 thank you! FWIW apart from CentOS CI, I'm not sure if you're interested in fuzzing but it seems we'll have to turn off Fuzzit next year (which we use for fuzzing systemd when PRs are opened) because we don't have a cluster to run our own instance (which is needed because we apparently don't have money to cover the cost or the raw resources we use there) so if it would be possible to find 90CPUs and 180Gb RAM it would also be great. Thank you! |
|
@evverx, what's the current capacity requirements for the CentOS CI ? |
That's true :-)
IIRC only QEMU tests generate persistent journals, nspawn ones are included in the host one (and this one is dumped at the end as well, although just in a text form). I guess this would welcome some improvements.
The CentOS CI situation got somewhat alive since the conference, mainly in two ways:
Of course, all this is just words, for now - the email threads, which should support them, should appear in following days. Both to begin the migration (and discuss the process), and to go through the possible ways of improving the CentOS CI resource pool. I wanted to make a more official statement once things were in movement, but we apparently managed to stumble upon it earlier than I expected :-) |
I think it depends on how the tests will be run there. Recently we discussed that running the "test/TEST-" one by one (on Ubuntu CI) or in kind of parallel on one beefy machine only (on CentOS CI) isn't going to be suitable for PRs (mostly because it already takes too long for the tests to finish) and we'd most likely have to turn some of the tests off to keep it more or less responsive. It would be best if we could spread about 120 tests (40 tests * 3 PRs) across a number of machines at the same time. If we're talking about additionally running all the "test/TEST-*" under ASan/UBsan (currently we run TEST-01-BASIC only) we'd need to run 240 large tests simultaneously. |
That's definitely good news! I'm generally out of the loop so apologies for bringing this up once again :-) Regarding Fuzzit, as far as I know, there was an attempt to apply for several grants. I'm hoping it will work out because I don't think anyone wants to maintain a kubernetes cluster. I also annoy the OSS-Fuzz team from time to time about the regression testing on their side but given that generally OSS-Fuzz revolves around the chromium workflow, which is different from most projects using GitHub and PRs, it's unlikely to happen :-) |
@mrc0mmand I'm wondering if it would be possible to move whatever is run on Travis CI, Semaphore CI and Azure Pipelines there. My understanding is that a lot of different services are used because the systemd project has never been able to afford to buy anything (up until now as far as I understand) and at the same time tries to somehow use free services that are supposed to be used mostly by relatively small projects (as opposed to |
|
By the way, @anitazha would it be OK if I invited you to the systemd organization on Coverity Scan so that you could view reports at https://scan.coverity.com/projects/350? Another option would be to press the "add me to the project" button and wait for someone to get round to approving the request. We also have a list of people who have access to the systemd project on OSS-Fuzz at https://github.com/google/oss-fuzz/blob/master/projects/systemd/project.yaml (though I still have no idea how to make non-gmail accounts work properly there). I (or anybody else from that list) could add your email address there if you're interested. It's also possible to subscribe to reports from Fuzzit by signing up at https://app.fuzzit.dev/login and adding the email address linked to the github account there. I'm sorry for bringing up several different topics here. It's just these days I'm not here on GitHub and am trying to discuss everything in one fell swoop while I'm at it. |
I'd say it should be possible (well, it should be definitely possible from the technological point of view), but it's hard to say for sure without knowing the resource limits (for now). However, it would be great :-) |
|
@mrc0mmand and one last question. Is it going to be |
Ah, thanks, that's something I forgot to mention. The thing is, that the migration won't have any effect on the machines we use for the actual testing, as we're going to migrate only the scheduler, so it's still going to be |
|
OK. FWIW Travis CI announced that now it's possible to test on arm: https://blog.travis-ci.com/2019-10-07-multi-cpu-architecture-support. I think if it clogs up at some point, it should probably be possible to move the |
Excellent. Also, @xnox mentioned during ASG that |
|
Well, I'm glad Travis CI is getting more useful. I'm wondering, given that it appears a lot of things are discussed at ASG, if it was mentioned there when the "ppc64le" Ubuntu CI webhook will be turned on? I haven't heard from anyone on this matter for months. |
|
What I completely forgot to mention is that I think (apart from fixing "StandardOutput=tty" in nspawn containers if it's broken somehow) it would be better to replace |
@evverx sure I am interested in all the things. Luckily I've been using a gmail for this stuff so if you could add the.anitazha at gmail dot com where appropriate, that would be swell. |
|
@anitazha I've added the email address everywhere. If I didn't slip up anywhere, you should have access to reports on Coverity Scan at https://scan.coverity.com/projects/350 (the "view defects" tab) and the Fuzzit dashboard at https://app.fuzzit.dev/orgs/systemd/dashboard. Let me know if those aren't working. Just in case, Fuzzit is going to send alerts from [email protected]. https://oss-fuzz.com should start working once google/oss-fuzz#2935 is merged. |
|
@evverx Coverity and oss-fuzz seem to be working. Not sure about fuzzit... Whenever I login to the site it prompts me to install the fuzzit app in GitHub. Isn't it already installed for the systemd org? |
@anitazha No, it isn't. When Fuzzit was turned on, the GitHub integration didn't exist. As far as I can remember, the idea was among other things to make it possible for Fuzzit to report the status here on GitHub as LGTM and Azure Pipelines (which are installed for the systemd organization) do (which is much more conventional than receiving alerts and going to the site). Anyway, @yevgenypats could you help me out? Could we somehow avoid installing the app for contributors who are already members of the systemd organization? |
|
@evverx - exactly like you said the current systemd organisation is not connected to github thus it asks to install the application if the user is not added manually to the systemd organisation. would you like any contributor for systemd to be added automatically to the systemd organisation? this is possible. Also if you add the user manually it shouldn't ask him to install the application. |
Let me think about it :-)
I seem to have put the email address on the list of people receiving notifications only and completely forgot to add @anitazha to the organization there. I'll fix it shortly. Thank you! |
After some experimentation, it seems that calling
greptoo much andspaced over some "longer" length of time in the nspawn test environment
causes
grepto return 2. This is a workaround to see if callinggrep only 4 times vs 28 will make the test less flaky for bionic-arm64.