Skip to content

test: Make TEST-40 less flaky#13719

Closed
anitazha wants to merge 1 commit intosystemd:masterfrom
anitazha:stabilize_test_40
Closed

test: Make TEST-40 less flaky#13719
anitazha wants to merge 1 commit intosystemd:masterfrom
anitazha:stabilize_test_40

Conversation

@anitazha
Copy link
Member

@anitazha anitazha commented Oct 3, 2019

After some experimentation, it seems that calling grep too much and
spaced over some "longer" length of time in the nspawn test environment
causes grep to return 2. This is a workaround to see if calling
grep only 4 times vs 28 will make the test less flaky for bionic-arm64.

After some experimentation, it seems that calling `grep` too much and
spaced over some "longer" length of time in the nspawn test environment
causes `grep` to return 2. This is a workaround to see if calling
grep only 4 times vs 28 will make the test less flaky for bionic-arm64.
@anitazha
Copy link
Member Author

anitazha commented Oct 4, 2019

Hard to say if it's 100% stable now but it passed on all of the runs (this time it's TEST-24 failing in bionic-i386)

@keszybz
Copy link
Member

keszybz commented Oct 4, 2019

Hmm, but why was it failing and why is it passing now? I see that calling grep less times might be slightly more efficient, but isn't this just covering some race condition or other problem?

@anitazha
Copy link
Member Author

anitazha commented Oct 4, 2019

Hmm, but why was it failing and why is it passing now? I see that calling grep less times might be slightly more efficient, but isn't this just covering some race condition or other problem?

Oh it's definitely covering for some other problem. But I've run out of cycles for now to get to the root of it. I was able to consistently reproduce it locally (on Fedora) by inserting sleep 5 into the second loop or adding additional sets of loops running, systemctl-ing, and grep-ing output and running the test (if someone else wanted to take a stab at it...).

@filbranden
Copy link
Contributor

I spent some time looking at this one and it turns out grep does fail trying to write to the tty.

I think my inclination would be to fix this by removing StandardOutput=tty and StandardError=tty from the generated testsuite.service, since I don't really see any problems with sending that output to the journal instead.

However, I didn't stop to look at why output was being redirected to the tty (I noticed the same is true in many other TESTs, this one was probably copy-pasta cargo-culting, not sure why the others were doing that though.)

I also didn't check why writing to the tty was failing. I noticed recent PR #12758 which is related to tty/console in nspawn. So maybe that's related? Haven't checked yet...

One more thing I noticed is that testsuite.service starts through a testsuite.target (which default.target point to), but a lot of the startup is happening in parallel with testsuite.service running. In running this test interactively, I noticed systemd-logind was starting around the time it was failing, also some console services. So maybe that's related too?

Perhaps this should be set up in a way that normal startup is fully completed first, then testsuite.service starts? Having it race with the rest of the startup seems like it could cause other similar trouble as well...

Here's the strace output I collected from this run.

In particular, the last few lines show the error clearly:

99    19:22:10.731965 fstat(1,  <unfinished ...>
99    19:22:10.731993 <... fstat resumed>{st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
99    19:22:10.732070 write(1, "ExecStartPost={ path=/bin/echo ; argv[]=/bin/echo ${4_four_ex} ; ignore_errors=no ; start_time=[Mon 2019-10-07 19:22:10 PDT] ; stop_time=[Mon 209-10-07 19:22:10 PDT] ; pid=97 ; code=exited ; status=0 }\n", 203) = -1 EIO (Input/output error)
99    19:22:10.732174 write(2, "grep: ", 6) = -1 EIO (Input/output error)
99    19:22:10.732226 write(2, "write error", 11) = -1 EIO (Input/output error)
99    19:22:10.732263 write(2, ": Input/output error", 20) = -1 EIO (Input/output error)
99    19:22:10.732298 write(2, "\n", 1 <unfinished ...>
99    19:22:10.732325 <... write resumed>) = -1 EIO (Input/output error)
99    19:22:10.732349 exit_group(2)     = ?
99    19:22:10.732424 +++ exited with 2 +++

@filbranden
Copy link
Contributor

cc @evverx who knows this code really well!

keszybz added a commit to keszybz/systemd that referenced this pull request Oct 8, 2019
I *think* this was originally added to make it easier to see what was happening
in tests. Later we added the functionality to print the journal on failure, so
this redirection has stopped being useful.

In systemd#13719 (comment)
@filbranden shows that grep tries to write to stdout and fails. In general,
we should not assume that writing to the console it always possible. We have
special code to handle this in pid1 after all:

99    19:22:10.731965 fstat(1,  <unfinished ...>
99    19:22:10.731993 <... fstat resumed>{st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0), ...}) = 0
99    19:22:10.732070 write(1, "ExecStartPost={ path=/bin/echo ; argv[]=/bin/echo ${4_four_ex} ; ignore_errors=no ; start_time=[Mon 2019-10-07 19:22:10 PDT] ; stop_time=[Mon 209-10-07 19:22:10 PDT] ; pid=97 ; code=exited ; status=0 }\n", 203) = -1 EIO (Input/output error)
99    19:22:10.732174 write(2, "grep: ", 6) = -1 EIO (Input/output error)
99    19:22:10.732226 write(2, "write error", 11) = -1 EIO (Input/output error)
99    19:22:10.732263 write(2, ": Input/output error", 20) = -1 EIO (Input/output error)
99    19:22:10.732298 write(2, "\n", 1 <unfinished ...>
99    19:22:10.732325 <... write resumed>) = -1 EIO (Input/output error)
99    19:22:10.732349 exit_group(2)     = ?
99    19:22:10.732424 +++ exited with 2 +++

Removing the redirection should make the tests less flakey.

Replaces systemd#13719.

While at it, also drop NotifyAccess=all. I think it was added purposefully in
TEST-20-MAINPIDGAMES, and then cargo culted to newer tests.
@keszybz
Copy link
Member

keszybz commented Oct 8, 2019

See #13746.

@keszybz
Copy link
Member

keszybz commented Oct 8, 2019

I think my approach in #13746 is better ;)

@evverx
Copy link
Contributor

evverx commented Oct 8, 2019

@filbranden thank you for looking into this! Given that I've never seen the tests fail in VMs and lxc containers due to grep failing to write anything when StandardOutput is set to tty It indeed seems the console is somehow broken in nspawn containers.

Regarding #13746, that approach is better in the sense that it swept the underlying issue under the rug completely but it's still there. Plus the premise of that PR isn't entirely accurate because "the functionality to print the journal on failure" has never been implemented on Ubuntu CI. That's true the journals are kept among artifacts there and theoretically it's possible to download and inspect them but in practice they are just piling up there without any structure that would help to figure out what exactly happens there. I'm not sure what's going on on CentOS CI though. Judging by https://github.com/systemd/systemd-centos-ci/blob/9047b07c210c4e3a63048f2c517fa94bf2b8a190/agent/testsuite-rhel7.sh#L59, it saves the files somewhere (ignoring all the failures along the way).

@mrc0mmand
Copy link
Member

I'm not sure what's going on on CentOS CI though. Judging by https://github.com/systemd/systemd-centos-ci/blob/9047b07c210c4e3a63048f2c517fa94bf2b8a190/agent/testsuite-rhel7.sh#L59, it saves the files somewhere (ignoring all the failures along the way).

At the end of each console log is a landing page URL, which should provide an access to all necessary artifacts (including the journals), like this one: https://ci.centos.org/job/systemd-pr-build-vagrant/3794/artifact//systemd-centos-ci/index.html

The journals are collected only in case the test failed though, to not clutter the logs too much.

@mrc0mmand
Copy link
Member

mrc0mmand commented Oct 8, 2019

Also, the journal-collection mechanism is a little bit different in the upstream testsuite script as it includes support for coredumpctl-related stuff. The RHEL7 and RHEL8 script vary depending on the available functionality I had at hand, but the result should be pretty much the same (accessible journals in case something went awry).

@evverx
Copy link
Contributor

evverx commented Oct 8, 2019

The part of Centos CI where backtraces are collected is really great I must say! Thank you! Though, as far I can remember, it was inspired by networkctl crashing on Ubuntu CI on i386 machines and when I was complaining about how hard it was for me to spin up a VM manually to get the backtrace :-), I was kind of hoping the coredumpctl stuff would make it there (which has never happened unfortunately).

Anyway, are all the logs saved? If, for example, TEST-24 fails in "unprivileged" nspawn containers only, how easy is it to figure out which of those 20 files correspond to the last boot?

@evverx
Copy link
Contributor

evverx commented Oct 8, 2019

On an absolutely unrelated note, speaking of CentOS CI, I watched @davide125's talk at All Systems Go the other day where as far as I can tell it was mentioned that facebook folks are planning to kind of run the scripts from systemd-centos-ci on their infrastructure internally. @anitazha @davide125 @filbranden I'm wondering if it's possible to use that infrastructure upstream. I'm asking because at some point the infrastructure provided by the CentOS project reached its limits and some tests like the one running systemd-networkd built with gcc under ASan/UBsan were turned off to keep it more or less responsive when PRs are opened. But as far as I know if we decide to run all the components of systemd under ASan+UBsan one day, it won't be possible because the 30 machines (@mrc0mmand please correct me if I'm wrong) are already barely moving so to speak.

@davide125
Copy link
Member

@evverx the internal infra may not be the best fit for this, but we can probably work something out. Lemme talk to a few people here and get back to you.

@evverx
Copy link
Contributor

evverx commented Oct 8, 2019

@davide125 thank you!

FWIW apart from CentOS CI, I'm not sure if you're interested in fuzzing but it seems we'll have to turn off Fuzzit next year (which we use for fuzzing systemd when PRs are opened) because we don't have a cluster to run our own instance (which is needed because we apparently don't have money to cover the cost or the raw resources we use there) so if it would be possible to find 90CPUs and 180Gb RAM it would also be great. Thank you!

@davide125
Copy link
Member

@evverx, what's the current capacity requirements for the CentOS CI ?

@mrc0mmand
Copy link
Member

Though, as far I can remember, it was inspired by networkctl crashing on Ubuntu CI on i386 machines and when I was complaining about how hard it was for me to spin up a VM manually to get the backtrace :-)

That's true :-)

Anyway, are all the logs saved? If, for example, TEST-24 fails in "unprivileged" nspawn containers only, how easy is it to figure out which of those 20 files correspond to the last boot?

IIRC only QEMU tests generate persistent journals, nspawn ones are included in the host one (and this one is dumped at the end as well, although just in a text form). I guess this would welcome some improvements.

On an absolutely unrelated note, speaking of CentOS CI, I watched @davide125's talk at All Systems Go the other day where as far as I can tell it was mentioned that facebook folks are planning to kind of run the scripts from systemd-centos-ci on their infrastructure internally. @anitazha @davide125 @filbranden I'm wondering if it's possible to use that infrastructure upstream. I'm asking because at some point the infrastructure provided by the CentOS project reached its limits and some tests like the one running systemd-networkd built with gcc under ASan/UBsan were turned off to keep it more or less responsive when PRs are opened. But as far as I know if we decide to run all the components of systemd under ASan+UBsan one day, it won't be possible because the 30 machines (@mrc0mmand please correct me if I'm wrong) are already barely moving so to speak.

The CentOS CI situation got somewhat alive since the conference, mainly in two ways:

  1. we had a discussion with @davide125 on ASG! and there's a possibility of sponsoring resources in the CentOS CI by Facebook

  2. we (majority of plumbers team from Brno) had an internal call (yesterday) with the CentOS CI admin (where we also discussed the first point), and the outcome is to move the systemd's Jenkins master to the existing OpenShift instance in the CentOS infra, which should make it more flexible and thus be able to allocate more resources for our needs.

Of course, all this is just words, for now - the email threads, which should support them, should appear in following days. Both to begin the migration (and discuss the process), and to go through the possible ways of improving the CentOS CI resource pool.

I wanted to make a more official statement once things were in movement, but we apparently managed to stumble upon it earlier than I expected :-)

@evverx
Copy link
Contributor

evverx commented Oct 8, 2019

@evverx, what's the current capacity requirements for the CentOS CI ?

I think it depends on how the tests will be run there. Recently we discussed that running the "test/TEST-" one by one (on Ubuntu CI) or in kind of parallel on one beefy machine only (on CentOS CI) isn't going to be suitable for PRs (mostly because it already takes too long for the tests to finish) and we'd most likely have to turn some of the tests off to keep it more or less responsive. It would be best if we could spread about 120 tests (40 tests * 3 PRs) across a number of machines at the same time. If we're talking about additionally running all the "test/TEST-*" under ASan/UBsan (currently we run TEST-01-BASIC only) we'd need to run 240 large tests simultaneously.

@evverx
Copy link
Contributor

evverx commented Oct 8, 2019

The CentOS CI situation got somewhat alive since the conference

That's definitely good news! I'm generally out of the loop so apologies for bringing this up once again :-)

Regarding Fuzzit, as far as I know, there was an attempt to apply for several grants. I'm hoping it will work out because I don't think anyone wants to maintain a kubernetes cluster. I also annoy the OSS-Fuzz team from time to time about the regression testing on their side but given that generally OSS-Fuzz revolves around the chromium workflow, which is different from most projects using GitHub and PRs, it's unlikely to happen :-)

@evverx
Copy link
Contributor

evverx commented Oct 8, 2019

the outcome is to move the systemd's Jenkins master to the existing OpenShift instance in the CentOS infra, which should make it more flexible and thus be able to allocate more resources for our needs.

@mrc0mmand I'm wondering if it would be possible to move whatever is run on Travis CI, Semaphore CI and Azure Pipelines there. My understanding is that a lot of different services are used because the systemd project has never been able to afford to buy anything (up until now as far as I understand) and at the same time tries to somehow use free services that are supposed to be used mostly by relatively small projects (as opposed to systemd, which is a giant umbrella project hitting all the limits on free tiers eventually). I'd be happy to stop receiving numerous notifications from all those services and somewhat centralize the CI :-) I can count how many resources are used there if it helps.

@evverx
Copy link
Contributor

evverx commented Oct 9, 2019

By the way, @anitazha would it be OK if I invited you to the systemd organization on Coverity Scan so that you could view reports at https://scan.coverity.com/projects/350? Another option would be to press the "add me to the project" button and wait for someone to get round to approving the request. We also have a list of people who have access to the systemd project on OSS-Fuzz at https://github.com/google/oss-fuzz/blob/master/projects/systemd/project.yaml (though I still have no idea how to make non-gmail accounts work properly there). I (or anybody else from that list) could add your email address there if you're interested. It's also possible to subscribe to reports from Fuzzit by signing up at https://app.fuzzit.dev/login and adding the email address linked to the github account there.

I'm sorry for bringing up several different topics here. It's just these days I'm not here on GitHub and am trying to discuss everything in one fell swoop while I'm at it.

@mrc0mmand
Copy link
Member

the outcome is to move the systemd's Jenkins master to the existing OpenShift instance in the CentOS infra, which should make it more flexible and thus be able to allocate more resources for our needs.

@mrc0mmand I'm wondering if it would be possible to move whatever is run on Travis CI, Semaphore CI and Azure Pipelines there. My understanding is that a lot of different services are used because the systemd project has never been able to afford to buy anything (up until now as far as I understand) and at the same time tries to somehow use free services that are supposed to be used mostly by relatively small projects (as opposed to systemd, which is a giant umbrella project hitting all the limits on free tiers eventually). I'd be happy to stop receiving numerous notifications from all those services and somewhat centralize the CI :-) I can count how many resources are used there if it helps.

I'd say it should be possible (well, it should be definitely possible from the technological point of view), but it's hard to say for sure without knowing the resource limits (for now). However, it would be great :-)

@evverx
Copy link
Contributor

evverx commented Oct 9, 2019

@mrc0mmand and one last question. Is it going to be x86_64 only or will other architectures be supported as well?

@mrc0mmand
Copy link
Member

@mrc0mmand and one last question. Is it going to be x86_64 only or will other architectures be supported as well?

Ah, thanks, that's something I forgot to mention. The thing is, that the migration won't have any effect on the machines we use for the actual testing, as we're going to migrate only the scheduler, so it's still going to be x86_64 only (as that's what the CentOS CI pool actually supports). I'm not sure if there are any plans for alternative architectures in the future. It would be good to explore it a little bit further, though.

@evverx
Copy link
Contributor

evverx commented Oct 9, 2019

OK. FWIW Travis CI announced that now it's possible to test on arm: https://blog.travis-ci.com/2019-10-07-multi-cpu-architecture-support. I think if it clogs up at some point, it should probably be possible to move the x86_64 bits to the OpenShift instance where they belong according to the documentation :-)

@mrc0mmand
Copy link
Member

OK. FWIW Travis CI announced that now it's possible to test on arm: https://blog.travis-ci.com/2019-10-07-multi-cpu-architecture-support. I think if it clogs up at some up, it should probably be possible to move the x86_64 bits to the OpenShift instance where they belong according to the documentation :-)

Excellent. Also, @xnox mentioned during ASG that ppc64le support should land in Travis CI as well, so moving the x86_64 parts somewhere else definitely makes sense.

@evverx
Copy link
Contributor

evverx commented Oct 9, 2019

Well, I'm glad Travis CI is getting more useful. I'm wondering, given that it appears a lot of things are discussed at ASG, if it was mentioned there when the "ppc64le" Ubuntu CI webhook will be turned on? I haven't heard from anyone on this matter for months.

@evverx
Copy link
Contributor

evverx commented Oct 9, 2019

What I completely forgot to mention is that I think (apart from fixing "StandardOutput=tty" in nspawn containers if it's broken somehow) it would be better to replace StandardOutput=tty with StandardOutput=journal+console instead of sending everything to the journal (which isn't saved sometimes and generally isn't the easiest thing to find on both Ubuntu CI and CentOS CI). This way journald would try to additionally forward logs to the console (which should work most of the time and when it didn't, journald would just ignore any errors) (making the CI a little bit more debugabble).

@anitazha
Copy link
Member Author

anitazha commented Oct 9, 2019

By the way, @anitazha would it be OK if I invited you to the systemd organization on Coverity Scan so that you could view reports at https://scan.coverity.com/projects/350? Another option would be to press the "add me to the project" button and wait for someone to get round to approving the request. We also have a list of people who have access to the systemd project on OSS-Fuzz at https://github.com/google/oss-fuzz/blob/master/projects/systemd/project.yaml (though I still have no idea how to make non-gmail accounts work properly there). I (or anybody else from that list) could add your email address there if you're interested. It's also possible to subscribe to reports from Fuzzit by signing up at https://app.fuzzit.dev/login and adding the email address linked to the github account there.

@evverx sure I am interested in all the things. Luckily I've been using a gmail for this stuff so if you could add the.anitazha at gmail dot com where appropriate, that would be swell.

@evverx
Copy link
Contributor

evverx commented Oct 9, 2019

@anitazha I've added the email address everywhere. If I didn't slip up anywhere, you should have access to reports on Coverity Scan at https://scan.coverity.com/projects/350 (the "view defects" tab) and the Fuzzit dashboard at https://app.fuzzit.dev/orgs/systemd/dashboard. Let me know if those aren't working. Just in case, Fuzzit is going to send alerts from [email protected].

https://oss-fuzz.com should start working once google/oss-fuzz#2935 is merged.

@anitazha anitazha deleted the stabilize_test_40 branch October 9, 2019 23:47
@anitazha
Copy link
Member Author

@evverx Coverity and oss-fuzz seem to be working. Not sure about fuzzit... Whenever I login to the site it prompts me to install the fuzzit app in GitHub. Isn't it already installed for the systemd org?

@evverx
Copy link
Contributor

evverx commented Oct 10, 2019

Not sure about fuzzit... Whenever I login to the site it prompts me to install the fuzzit app in GitHub. Isn't it already installed for the systemd org?

@anitazha No, it isn't. When Fuzzit was turned on, the GitHub integration didn't exist. As far as I can remember, the idea was among other things to make it possible for Fuzzit to report the status here on GitHub as LGTM and Azure Pipelines (which are installed for the systemd organization) do (which is much more conventional than receiving alerts and going to the site). Anyway, @yevgenypats could you help me out? Could we somehow avoid installing the app for contributors who are already members of the systemd organization?

@yevgenypats
Copy link
Contributor

@evverx - exactly like you said the current systemd organisation is not connected to github thus it asks to install the application if the user is not added manually to the systemd organisation. would you like any contributor for systemd to be added automatically to the systemd organisation? this is possible. Also if you add the user manually it shouldn't ask him to install the application.

@evverx
Copy link
Contributor

evverx commented Oct 10, 2019

would you like any contributor for systemd to be added automatically to the systemd organisation?

Let me think about it :-)

Also if you add the user manually it shouldn't ask him to install the application.

I seem to have put the email address on the list of people receiving notifications only and completely forgot to add @anitazha to the organization there. I'll fix it shortly. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

7 participants