Sunday, 2020-12-13

*** DSpider has quit IRC		01:19
*** iurygregory has quit IRC		01:41
*** ysandeep\|sick is now known as ysandeep		02:27
*** ykarel has joined #opendev		08:40
*** DSpider has joined #opendev		08:54
*** ykarel has quit IRC		09:16
*** ykarel has joined #opendev		09:34
*** ykarel has quit IRC		11:27
*** tosky has joined #opendev		12:14
*** hamalq has joined #opendev		12:48
*** knikolla has quit IRC		13:17
*** knikolla has joined #opendev		13:19
*** hamalq has quit IRC		14:49
*** hamalq has joined #opendev		15:27
*** fressi has joined #opendev		16:02
*** hamalq_ has joined #opendev		16:15
*** hamalq has quit IRC		16:19
*** fressi has quit IRC		16:20
*** tosky has quit IRC		16:29
*** fressi has joined #opendev		16:43
*** Alex_Gaynor has joined #opendev		17:06
Alex_Gaynor	Hey, on pyca, we're seeing intermittent (but somewhat frequent) network errors from arm64 machines, for example https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a27/5410/ae72d6b91239c5262ed0b28792f76c7449a42ec6/check/pyca-cryptography-ubuntu-bionic-py36-arm64/a27b5f5/job-output.txt (search "Clone wycheproof")	17:07
*** hamalq_ has quit IRC		17:28
*** tosky has joined #opendev		17:28
fungi	Alex_Gaynor: mmm, yeah the nodes in that cloud only have unique global ipv6 addresses so share an ipv4 nat for reaching v4-only sites like github.com, in the past we've seen similar issues with the nat table getting overrun by too many simultaneous tracked states. i wonder if it could be ths same situation this time	17:32
fungi	one workaround would be to declare those repositories as required projects in the zuul jobs, so that our zuul executors cache and push them onto the job nodes, then you're only cloning locally from that copy on the filesystem	17:33
fungi	we've also seen wrapping remote network operations like that in a retry to help if it problem is reasonably random	17:34
fungi	also cloning those in a pre phase playbook instead of the run phase would cause zuul to just automatically and silently retry the build (up to three times by default) if that failed	17:35
Alex_Gaynor	How do I put something in a pre-phase playbook?	17:36
* fungi looks at that job definition real fast		17:36
Alex_Gaynor	Ah, `pre-run` key it looks like. Let me try this	17:37
fungi	yeah, looks like that's all happening in the .zuul.playbooks/playbooks/tox/main.yaml called from the run phase of pyca-cryptography-base but you could move a lot of those tasks into a different playbook called in pre-run	17:39
fungi	basically any job setup should generally be done in pre-run and then tasks you actually expect a bad patch to fail would be all you put in the run phase	17:40
fungi	that way things like network blips hitting the pre-run setup for the build would just cause it to be retried	17:41
Alex_Gaynor	Ok, PR here https://github.com/pyca/cryptography/pull/5644 let's see if it helps!	17:41
fungi	looks like ianw set up the initial jobs there so may have additional input once he's awake and around (should be nearly his monday morning now, though i don't recall if he was planning to be at the computer this week)	17:41
Alex_Gaynor	Good news, the retries appear to work, at least 😬	17:43
fungi	so, yeah, that ought to make the builds more robust (lather rinse repeat for other stuff you want going on in pre-run instead of run), but we also need to look into what's going on with that cloud	17:44
fungi	we've got a static node in a separate tenant there so i ought to at least be able to tell whether their ipv4 routing on the whole is busted	17:44
Alex_Gaynor	Yeah, I'm a bit concerned that what we're going to learn is that the failure rate on this clone is high enough that even 3 retries doesn't make it robust. But hopefully this at least helps.	17:44
fungi	and yes, if all three retries fail, the build will report a "retry_limit" failure result	17:45
Alex_Gaynor	fungi: can `pre-run` playbooks access things from `vars.`?	17:46
fungi	should be able to, yes	17:46
*** fdegir has quit IRC		17:46
Alex_Gaynor	Great, thanks much!	17:46
fungi	on saturday they noticed keepalived had stopped across all of their control plane, which killed the cloud apis so we weren't booting anything there. possible there's something else generally broken there at the moment too	17:47
*** fdegir has joined #opendev		17:47
Alex_Gaynor	Ooof	17:49
fungi	so the good news is that our mirror node there (which has a 1:1 ipv4 "floating ip" nat assigned) is able to clone from github	17:51
fungi	so it's not total ipv4 routing failure there at least	17:51
fungi	leading me to increasingly suspect the many:1 nat shared by the job nodes	17:51
fungi	kevinz takes care of that environment, but isn't in here at the moment (it's also very early in his part of the world right now)	17:52
fungi	our max-servers there is only 40, so in theory there are at most that many nodes sharing the same v4 nat, but at the moment utilization is low and it looks like you're probably the only one using nodes there so an overload of the nat table seems unlikely: https://grafana.opendev.org/d/pwrNXt2Mk/nodepool-linaro?orgId=1	17:55
fungi	i'll e-mail kevinz now so he'll hopefully see it once he wakes up	17:57
fungi	#status log e-mailed kevinz about apparent nat problem in linaro-us cloud, cc'd infra-root inbox	18:00
openstackstatus	fungi: finished logging	18:00
fungi	we could take that provider offline in our nodepool config for now, but it's the only one providing arm64 nodes (and those are all it provides) so any arm64 builds would just queue indefinitely until it's returned to service	18:02
Alex_Gaynor	From our perpsective that'd be strictly worse, things are succeeding at a high enough rate that we can passing PRs with the retries.	18:03
*** fdegir has quit IRC		18:06
*** cgoncalves has quit IRC		18:41
openstackgerrit	James E. Blair proposed openstack/project-config master: Add google/wycheproof to pyca Zuul tenant https://review.opendev.org/c/openstack/project-config/+/766864	19:40
corvus	Alex_Gaynor, fungi, ianw: i agree https://github.com/pyca/cryptography/pull/5644 should help (and is objectively the better way to write the job anyway). i think we can go a step further with https://review.opendev.org/c/openstack/project-config/+/766864 and actually have zuul do all the cloning ahead of time; that would reduce the amount of public internet traffic from the job, which may avoid retries	19:43
corvus	(to be clear, that's step 1; if that lands, i'll write the step 2 change and propose to to pyca/cryptography)	19:43
Alex_Gaynor	I'm also doing https://github.com/pyca/cryptography/pull/5645	19:53
corvus	Alex_Gaynor: ++ those are all great to have in a pre-run	19:54
ianw	o/ i'll have to go through my notes but i have a vague feeling we might have seen ipv4 issues in that cloud before	21:04
ianw	corvus: it seemed everyone was positive about the new zuul sumary plugin repo, what's the next step?	21:06
*** slaweq has quit IRC		21:29
openstackgerrit	Ian Wienand proposed opendev/system-config master: WIP: initalize gerrit in testing https://review.opendev.org/c/opendev/system-config/+/765224	21:29
openstackgerrit	Ian Wienand proposed opendev/system-config master: system-config-run-review: remove review-dev server https://review.opendev.org/c/opendev/system-config/+/766867	21:29
*** cgoncalves has joined #opendev		21:44
*** DSpider has quit IRC		22:50
*** tkajinam has joined #opendev		23:00
*** prometheanfire has quit IRC		23:46
*** Green_Bird has joined #opendev		23:54

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!