*** DSpider has quit IRC | 01:19 | |
*** iurygregory has quit IRC | 01:41 | |
*** ysandeep|sick is now known as ysandeep | 02:27 | |
*** ykarel has joined #opendev | 08:40 | |
*** DSpider has joined #opendev | 08:54 | |
*** ykarel has quit IRC | 09:16 | |
*** ykarel has joined #opendev | 09:34 | |
*** ykarel has quit IRC | 11:27 | |
*** tosky has joined #opendev | 12:14 | |
*** hamalq has joined #opendev | 12:48 | |
*** knikolla has quit IRC | 13:17 | |
*** knikolla has joined #opendev | 13:19 | |
*** hamalq has quit IRC | 14:49 | |
*** hamalq has joined #opendev | 15:27 | |
*** fressi has joined #opendev | 16:02 | |
*** hamalq_ has joined #opendev | 16:15 | |
*** hamalq has quit IRC | 16:19 | |
*** fressi has quit IRC | 16:20 | |
*** tosky has quit IRC | 16:29 | |
*** fressi has joined #opendev | 16:43 | |
*** Alex_Gaynor has joined #opendev | 17:06 | |
Alex_Gaynor | Hey, on pyca, we're seeing intermittent (but somewhat frequent) network errors from arm64 machines, for example https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a27/5410/ae72d6b91239c5262ed0b28792f76c7449a42ec6/check/pyca-cryptography-ubuntu-bionic-py36-arm64/a27b5f5/job-output.txt (search "Clone wycheproof") | 17:07 |
---|---|---|
*** hamalq_ has quit IRC | 17:28 | |
*** tosky has joined #opendev | 17:28 | |
fungi | Alex_Gaynor: mmm, yeah the nodes in that cloud only have unique global ipv6 addresses so share an ipv4 nat for reaching v4-only sites like github.com, in the past we've seen similar issues with the nat table getting overrun by too many simultaneous tracked states. i wonder if it could be ths same situation this time | 17:32 |
fungi | one workaround would be to declare those repositories as required projects in the zuul jobs, so that our zuul executors cache and push them onto the job nodes, then you're only cloning locally from that copy on the filesystem | 17:33 |
fungi | we've also seen wrapping remote network operations like that in a retry to help if it problem is reasonably random | 17:34 |
fungi | also cloning those in a pre phase playbook instead of the run phase would cause zuul to just automatically and silently retry the build (up to three times by default) if that failed | 17:35 |
Alex_Gaynor | How do I put something in a pre-phase playbook? | 17:36 |
* fungi looks at that job definition real fast | 17:36 | |
Alex_Gaynor | Ah, `pre-run` key it looks like. Let me try this | 17:37 |
fungi | yeah, looks like that's all happening in the .zuul.playbooks/playbooks/tox/main.yaml called from the run phase of pyca-cryptography-base but you could move a lot of those tasks into a different playbook called in pre-run | 17:39 |
fungi | basically any job setup should generally be done in pre-run and then tasks you actually expect a bad patch to fail would be all you put in the run phase | 17:40 |
fungi | that way things like network blips hitting the pre-run setup for the build would just cause it to be retried | 17:41 |
Alex_Gaynor | Ok, PR here https://github.com/pyca/cryptography/pull/5644 let's see if it helps! | 17:41 |
fungi | looks like ianw set up the initial jobs there so may have additional input once he's awake and around (should be nearly his monday morning now, though i don't recall if he was planning to be at the computer this week) | 17:41 |
Alex_Gaynor | Good news, the retries appear to work, at least 😬 | 17:43 |
fungi | so, yeah, that ought to make the builds more robust (lather rinse repeat for other stuff you want going on in pre-run instead of run), but we also need to look into what's going on with that cloud | 17:44 |
fungi | we've got a static node in a separate tenant there so i ought to at least be able to tell whether their ipv4 routing on the whole is busted | 17:44 |
Alex_Gaynor | Yeah, I'm a bit concerned that what we're going to learn is that the failure rate on this clone is high enough that even 3 retries doesn't make it robust. But hopefully this at least helps. | 17:44 |
fungi | and yes, if all three retries fail, the build will report a "retry_limit" failure result | 17:45 |
Alex_Gaynor | fungi: can `pre-run` playbooks access things from `vars.`? | 17:46 |
fungi | should be able to, yes | 17:46 |
*** fdegir has quit IRC | 17:46 | |
Alex_Gaynor | Great, thanks much! | 17:46 |
fungi | on saturday they noticed keepalived had stopped across all of their control plane, which killed the cloud apis so we weren't booting anything there. possible there's something else generally broken there at the moment too | 17:47 |
*** fdegir has joined #opendev | 17:47 | |
Alex_Gaynor | Ooof | 17:49 |
fungi | so the good news is that our mirror node there (which has a 1:1 ipv4 "floating ip" nat assigned) is able to clone from github | 17:51 |
fungi | so it's not total ipv4 routing failure there at least | 17:51 |
fungi | leading me to increasingly suspect the many:1 nat shared by the job nodes | 17:51 |
fungi | kevinz takes care of that environment, but isn't in here at the moment (it's also very early in his part of the world right now) | 17:52 |
fungi | our max-servers there is only 40, so in theory there are at most that many nodes sharing the same v4 nat, but at the moment utilization is low and it looks like you're probably the only one using nodes there so an overload of the nat table seems unlikely: https://grafana.opendev.org/d/pwrNXt2Mk/nodepool-linaro?orgId=1 | 17:55 |
fungi | i'll e-mail kevinz now so he'll hopefully see it once he wakes up | 17:57 |
fungi | #status log e-mailed kevinz about apparent nat problem in linaro-us cloud, cc'd infra-root inbox | 18:00 |
openstackstatus | fungi: finished logging | 18:00 |
fungi | we could take that provider offline in our nodepool config for now, but it's the only one providing arm64 nodes (and those are all it provides) so any arm64 builds would just queue indefinitely until it's returned to service | 18:02 |
Alex_Gaynor | From our perpsective that'd be strictly worse, things are succeeding at a high enough rate that we can passing PRs with the retries. | 18:03 |
*** fdegir has quit IRC | 18:06 | |
*** cgoncalves has quit IRC | 18:41 | |
openstackgerrit | James E. Blair proposed openstack/project-config master: Add google/wycheproof to pyca Zuul tenant https://review.opendev.org/c/openstack/project-config/+/766864 | 19:40 |
corvus | Alex_Gaynor, fungi, ianw: i agree https://github.com/pyca/cryptography/pull/5644 should help (and is objectively the better way to write the job anyway). i think we can go a step further with https://review.opendev.org/c/openstack/project-config/+/766864 and actually have zuul do all the cloning ahead of time; that would reduce the amount of public internet traffic from the job, which may avoid retries | 19:43 |
corvus | (to be clear, that's step 1; if that lands, i'll write the step 2 change and propose to to pyca/cryptography) | 19:43 |
Alex_Gaynor | I'm also doing https://github.com/pyca/cryptography/pull/5645 | 19:53 |
corvus | Alex_Gaynor: ++ those are all great to have in a pre-run | 19:54 |
ianw | o/ i'll have to go through my notes but i have a vague feeling we might have seen ipv4 issues in that cloud before | 21:04 |
ianw | corvus: it seemed everyone was positive about the new zuul sumary plugin repo, what's the next step? | 21:06 |
*** slaweq has quit IRC | 21:29 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: WIP: initalize gerrit in testing https://review.opendev.org/c/opendev/system-config/+/765224 | 21:29 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: system-config-run-review: remove review-dev server https://review.opendev.org/c/opendev/system-config/+/766867 | 21:29 |
*** cgoncalves has joined #opendev | 21:44 | |
*** DSpider has quit IRC | 22:50 | |
*** tkajinam has joined #opendev | 23:00 | |
*** prometheanfire has quit IRC | 23:46 | |
*** Green_Bird has joined #opendev | 23:54 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!