Wednesday, 2020-07-08

*** ryohayakawa has joined #opendev		00:06
openstackgerrit	Merged opendev/system-config master: Don't install the track-upstream cron on review-test https://review.opendev.org/739840	00:26
gouthamr	fungi: o/ sorry - i think it's been an issue since last week, https://zuul.opendev.org/t/openstack/builds?job_name=manila-tempest-plugin-lvm	00:29
gouthamr	i now see only rax-ord failures, so maybe i am confused, let me double check and compile some findings	00:32
ianw	https://zuul.opendev.org/t/openstack/build/60671a686fec409e9f9226707ab03916/log/job-output.txt seems to be a dfw failure	00:32
gouthamr	ianw: ack, just looked at the last ten runs and they've failed 5 times in rax-iad, 2 times in rax-ord and once in rax-dfw, and two other failures are test failures that occur sporadically; this job was quite reliable prior to this	00:42
gouthamr	a reboot event in the logs looks like this: https://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/controller/logs/screen-n-api.txt#3382	00:45
ianw	gouthamr: tbh i wouldn't think it was rax specific; we really haven't changed any settings	00:47
gouthamr	ianw: ty, good to know - i'll check if i can twiddle some node and test settings	00:48
ianw	ansible_memfree_mb: 7747	00:48
ianw	that's about where i'd expect it	00:48
gouthamr	^^ i was looking for something to indicate that, thanks ianw :)	00:49
ianw	https://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/zuul-info/host-info.controller.yaml	00:49
gouthamr	does that get collected prior to devstack though?	00:49
ianw	during devstack you've got	00:50
ianw	https://zuul.opendev.org/t/openstack/build/8b50366a1f3c4203985deba7a1d43605/log/controller/logs/screen-dstat.txt	00:50
ianw	right before the "reboot" there you've got	00:52
ianw	4608k 8186M	00:52
ianw	so it doesn't seem the host is under memory pressure at that point	00:52
gouthamr	ianw: it does, no? (used free buff cach) 5755M 126M 148M 1954M	00:56
ianw	that's still ~2gb in cache there, the number before was swap so it's not heavily into swap	00:57
ianw	i wouldn't say it's not OOM, but it doesn't feel like a smoking gun at that point of the job	00:58
gouthamr	ianw: ah, thanks; its consistently the same set of tests failing, i'll check if there's something weird we're doing..	01:00
*** dtroyer has quit IRC		01:03
openstackgerrit	Merged opendev/system-config master: gitea: install proxy https://review.opendev.org/739865	01:08
openstackgerrit	Ian Wienand proposed opendev/base-jobs master: promote-deployment: use upload-afs-synchronize https://review.opendev.org/739876	02:30
ianw	change to deploy the reverse proxy is queued now	02:43
*** dtroyer has joined #opendev		02:58
ianw	hrm, did i not open port 3081? ...	03:27
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414	03:46
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea: open port 3081 https://review.opendev.org/739884	03:55
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414	03:57
openstackgerrit	Ian Wienand proposed opendev/system-config master: gita-lb : update listeners to reverse proxy https://review.opendev.org/739889	04:07
ianw	mordred: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/console feels like a real problem with 726263/	04:09
ianw	raise Exception("you need a C compiler to build uWSGI")	04:10
ianw	i wonder if there's a new uwsgi release and it's not a wheel?	04:10
ianw	... no that's not it	04:10
ianw	it almost looks like a race condition, like uwsgi was building before dpkg finished	04:13
*** ykarel\|away is now known as ykarel		04:14
mordred	Weird	04:14
mordred	I'll investigate in the morning if you haven't figured it out yet	04:15
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: write-inventory: add per-host variables https://review.opendev.org/739892	05:03
*** raukadah is now known as chandankumar		05:09
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys to inventory; give host key in launch-node script https://review.opendev.org/739412	05:10
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414	05:10
*** marios has joined #opendev		05:17
*** ryo_hayakawa has joined #opendev		05:19
*** ryo_hayakawa has quit IRC		05:19
*** ryo_hayakawa has joined #opendev		05:19
*** tobiash has joined #opendev		05:20
*** ryohayakawa has quit IRC		05:21
*** DSpider has joined #opendev		05:54
openstackgerrit	Merged opendev/system-config master: gitea: open port 3081 https://review.opendev.org/739884	05:57
ianw	clarkb / fungi : ^ i've run out of time to put that into production -- you should be able to manually choose a host to redirect on gitea-lb for initial testing, then https://review.opendev.org/739889 would redirect all	06:38
*** ianw is now known as ianw_pto		06:38
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414	06:51
openstackgerrit	Ian Wienand proposed opendev/system-config master: [wip] add host keys on bridge https://review.opendev.org/739414	07:17
openstackgerrit	Ian Wienand proposed zuul/zuul-jobs master: write-inventory: add per-host variables https://review.opendev.org/739892	07:21
*** tosky has joined #opendev		07:33
*** fressi has joined #opendev		07:34
*** zbr has joined #opendev		07:35
*** hashar has joined #opendev		07:39
openstackgerrit	Merged zuul/zuul-jobs master: fetch-coverage-output: direct link to coverage data https://review.opendev.org/739099	07:59
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:02
*** dtantsur\|afk is now known as dtantsur		08:10
*** AJaeger has quit IRC		08:13
*** jaicaa_ has quit IRC		08:14
*** tkajinam has quit IRC		08:23
*** bolg has joined #opendev		08:27
*** zbr7 has joined #opendev		08:30
*** diablo_rojo has quit IRC		08:41
*** jaicaa has joined #opendev		09:00
*** AJaeger has joined #opendev		09:00
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add host keys on bridge https://review.opendev.org/739414	09:28
ianw_pto	mordred: ^ might be of interest to you as it ties in with launching nodes. with small stack of ^, we won't forget to add host keys when creating new servers, and also keep host keys committed which seems good too	09:30
ianw_pto	https://review.opendev.org/739412 adds the host keys for all our existing servers	09:30
*** ryo_hayakawa has quit IRC		09:33
*** rpittau has joined #opendev		09:53
openstackgerrit	Merged openstack/diskimage-builder master: Fix DIB scripts python version https://review.opendev.org/739844	10:06
*** bhagyashris is now known as bhagyashris\|brb		10:42
*** bhagyashris\|brb is now known as bhagyashris		10:59
*** ysandeep is now known as ysandeep\|brb		11:11
*** ysandeep\|brb is now known as ysandeep		11:25
*** hashar is now known as hasharNap		12:02
*** xiaolin has joined #opendev		12:19
*** ysandeep is now known as ysandeep\|mtg		12:29
*** roman_g has joined #opendev		12:29
roman_g	Hello team. Could someone look at NODE_FAILURE's we are experiencing with 16GB and 32GB nodes, please? Example: https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-32GB-gate-test	12:32
roman_g	And here is another example: https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-gate-test	12:34
*** hasharNap is now known as hashar		12:57
*** rpittau has quit IRC		12:57
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	12:58
*** rpittau has joined #opendev		13:00
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	13:01
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	13:07
*** fressi has quit IRC		13:08
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	13:09
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: WIP: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	13:14
AJaeger	roman_g: one of our clouds is down - see https://review.opendev.org/739605	13:16
*** lpetrut_ has joined #opendev		13:21
roman_g	AJaeger all right, thank you. I'm turning off voting on our jobs, as we still get some of runs successful. Where could I keep an eye on getting cloud up and running again? Mailing list?	13:27
*** ysandeep\|mtg is now known as ysandeep		13:30
AJaeger	roman_g: I don't think we announce this normally, let's see whether fungi has more info ^	13:33
openstackgerrit	Thierry Carrez proposed openstack/project-config master: Define maintain-github-openstack-mirror job https://review.opendev.org/738228	13:40
openstackgerrit	Thierry Carrez proposed openstack/project-config master: Define maintain-github-openstack-mirror job https://review.opendev.org/738228	13:42
AJaeger	infra-root, if the change above by ttx merges, the closing of open github issues can be stopped - do we have a change for that open? Or can anybody do one, please?	13:48
mordred	AJaeger: I've got a phone call in about 10 minutes - but I can work on a change for that after the call	13:50
ttx	AJaeger: fungi told me it's already ripped off, just needs a comment on the runnign server (won't be automatically put back)	13:50
mordred	yeah - I agree	13:51
mordred	it's no longer there	13:51
*** ysandeep is now known as ysandeep\|afk		13:52
ttx	Also does not hurt to have some overlap. I'm pretty sure the script will fail for one reason or another	13:52
AJaeger	ttx: overlap is fine - I wanted to avoid we forget to switch it off ;)	13:52
*** rpittau_ has joined #opendev		13:56
*** rpittau has quit IRC		13:58
*** ysandeep\|afk is now known as ysandeep		14:00
*** bhagyashris is now known as bhagyashris\|afk		14:07
*** bhagyashris\|afk is now known as bhagyashris		14:17
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	14:22
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	14:25
*** lpetrut_ has quit IRC		14:28
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: mirror-workspace-git-repos: enable pushing to windows host https://review.opendev.org/739977	14:35
*** fressi has joined #opendev		14:36
fungi	AJaeger: roman_g: sorry, delayed start today (ironically, dealing with hvac contractors at the house)... the openedge cloud which was providing those nonstandard node flavors is offline, yes. we make sure to have plenty of redundant sources for our standard 8gb flavor, but finding more donors to supply those 16gb and 32gb flavors would be necessary for better availability. no eta on openedge returning, could	14:37
fungi	be as long as october	14:37
fungi	(the reason me dealing with hvac contractors is ironic is that openedge is down because the air handler for that privately-run environment gave up the ghost)	14:38
openstackgerrit	Tobias Henkel proposed zuul/zuul-jobs master: DNM: Add unified synchronize-repos role that works with linux and windows https://review.opendev.org/740005	14:39
*** dmsimard3 has joined #opendev		14:41
*** dmsimard has quit IRC		14:41
*** dmsimard3 is now known as dmsimard		14:41
openstackgerrit	Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: DNM: test ensure-podman https://review.opendev.org/740006	14:41
*** fressi has quit IRC		14:46
*** priteau has joined #opendev		14:48
*** mlavalle has joined #opendev		14:50
*** fressi has joined #opendev		14:51
*** ysandeep is now known as ysandeep\|away		14:55
*** fressi has quit IRC		15:00
AJaeger	fungi, could you review 738228 , please?	15:02
*** ykarel is now known as ykarel\|away		15:04
*** priteau has quit IRC		15:04
openstackgerrit	Sorin Sbarnea (zbr) proposed zuul/zuul-jobs master: DNM: test ensure-podman https://review.opendev.org/740006	15:11
fungi	AJaeger: you bet	15:13
roman_g	fungi all right, thanks.	15:13
mordred	fungi: the other provider of those right now is city, yes?	15:14
roman_g	And kna	15:14
fungi	kna is the region in citycloud, yes	15:16
roman_g	Oh, yes	15:17
roman_g	>> as long as october - anything I can help with?	15:18
mordred	roman_g: I think for openedge it's mostly about getting his HVAC system replaced, so I don't think there's much we can do to help - unless someone wants to buy him a new HVAC :)	15:20
roman_g	All right =)	15:20
AJaeger	so, if we have two more regions for those nodes - should roman_g really see NODE_FAILURES?	15:21
mordred	we might want to search out ways to get large flavors from other providers ... perhaps there's a source of funding somewhere that could be given to vexxhost or one of the other clouds to provide some. not 100% sure what's the best path there	15:21
roman_g	Still seeing NODE_FAILUREs, yes.	15:21
roman_g	Probably citycloud can't schedule that many large instances.	15:21
roman_g	Any logs I can check on this regard by myself?	15:22
roman_g	So that I could have something in hand to go with it to Ericsson/CityCloud.	15:22
corvus	node_failure should never be seen; it means an unexpected error occurred	15:23
corvus	i can look into that in a few minutes	15:24
fungi	yeah, i want to say the last time this came up, citycloud has available quota for the nistances but they can't place them due to unavailability of suitable hypervisor host capacity (something like that)	15:24
roman_g	fungi, yep, I remember	15:24
roman_g	corvus thank you	15:24
fungi	andyeah, if that's (still) the case then our nodepool launcher logs should indicate nova api errors	15:25
*** Eighth_Doctor is now known as Conan_Kudo		15:28
*** Conan_Kudo is now known as Eighth_Doctor		15:28
smcginnis	Folks probably have seen this, but Gerrit has been donated to the new Open Usage Commons.	15:30
smcginnis	https://openusage.org/news/introducing-the-open-usage-commons/	15:30
corvus	zuul scheduler is out of disk space	15:34
mordred	oh look - wendar is on that board	15:34
mordred	corvus: :(	15:34
corvus	looks like a spike at 6:25	15:34
mordred	corvus: 23G in /var/log/zuul - current debug log is 5G	15:36
corvus	mordred: did it get rotated successfully?	15:38
mordred	corvus: yes	15:38
mordred	corvus: -rw-rw-rw- 1 zuuld zuuld 906501839 Jul 8 06:25 debug.log.1.gz	15:38
mordred	-rw-rw-rw- 1 zuuld zuuld 5386439154 Jul 8 15:38 debug.log	15:38
corvus	/root/.bup is very large; 25G ?	15:39
mordred	hrm. are we missing an exclusion?	15:40
corvus	mordred: the logs are on their own partition, so we can discount them	15:41
mordred	nod	15:41
mordred	corvus: /var/log is not in bup-excludes	15:41
corvus	seems like the 25G of bup out of 39G is a big fish	15:42
mordred	yeah	15:42
fungi	we recently removed a huge backup of old zuul from ~root there, right? maybe bup was tracking that previously?	15:43
fungi	#status log commented out old githubclosepull cronjob for github user on review.o.o	15:48
openstackstatus	fungi: finished logging	15:48
corvus	i'm confused; "bup ls" shows nothing	15:50
fungi	`bup ls` also returns nothing on the review server	15:51
corvus	git log returns fatal: your current branch 'master' does not have any commits yet	15:52
fungi	same on review.o.o, so maybe that's normal?	15:53
corvus	'git log root' also doesn't work	15:53
corvus	fungi: or none of our backups work at all. it's probably been 5 years since we did a restore test?	15:53
fungi	yeah, and `git ls-files` is similarly empty	15:53
fungi	trying to remember, i had to restore something fairly recently... what was it	15:53
fungi	oh, was a config backup on lists.o.o	15:54
fungi	checking	15:54
corvus	also, the backup server is close to full	15:54
corvus	'git log root' on the backup server works	15:55
corvus	i have no idea what the .bup directory on the client is for	15:55
fungi	and yeah, same observed behaviors on lists.o.o, so either this has broken in the last few months (since i did a restore there successfully) or it's how bup works	15:56
mordred	corvus: ~/.bup/index-cache is where all of the data is in that	15:57
corvus	i'm not sure if we can prune that and still have it work, or if we'd need to rotate the remote backup too	15:58
openstackgerrit	jacky06 proposed openstack/diskimage-builder master: Remove glance-registry https://review.opendev.org/739796	15:58
*** hashar has quit IRC		15:58
fungi	aha, `bup ls -al`	15:59
fungi	manpage ftw	15:59
corvus	that's not making anything more clear :(	15:59
fungi	strangely it contains empty .commit and .tag dirs	15:59
fungi	yeah	16:00
fungi	and nothing else?	16:00
mordred	I agree - and also don't know what that means	16:01
*** dtantsur is now known as dtantsur\|afk		16:01
corvus	i'm inclined to just blow away the client side of this, and see if backups still work	16:02
corvus	and probably it's also time to rotate the server side anyway, so if they don't work that would fix it	16:02
mordred	corvus: the client side of this doens't seem to be providing any value	16:02
corvus	mordred: well, i don't know what it does	16:02
corvus	it may be part of how it deduplicates stuff	16:03
corvus	or it may be nothing	16:03
mordred	nod	16:03
fungi	`bup index -p` provides the file index	16:03
fungi	and seems to work	16:03
*** diablo_rojo has joined #opendev		16:04
corvus	fungi: that produces no output for me on zuu01	16:04
fungi	well, works on lists.o.o	16:04
fungi	but yeah, empty for me on zuul.o.o	16:04
fungi	ahh, empty on review.o.o too	16:05
corvus	we don't usually use index, we just do split and join, right?	16:05
fungi	so i think it only prints paths because `bup index` was manually run there previously	16:05
corvus	yeah, that makes sense	16:05
fungi	and yes, our docs rely on bup join for restoration	16:05
corvus	so given our typical split/join usage of bup, i still don't know what the client side .bup dir is for	16:05
fungi	and it's how i managed to restore successfully on lists.o.o previously	16:05
*** marios is now known as marios\|out		16:06
fungi	i concur, it doesn't seem like it's providing anything	16:06
fungi	i wonder if it's providing some local cache to speed up incremental backup	16:06
openstackgerrit	Merged openstack/project-config master: Define maintain-github-openstack-mirror job https://review.opendev.org/738228	16:07
corvus	https://groups.google.com/forum/#!topic/bup-list/EP19diB3ADk	16:10
corvus	apparently we're not supposed to notice it, and we can delete it	16:11
corvus	fungi, mordred: i think i should rm -rf ~/.bup on zuul01 and then run bup init	16:12
mordred	I agree	16:13
mordred	++	16:13
fungi	corvus: we've failed to not notice it i suppose. i concur with this plan	16:13
corvus	#status log removed zuul01:/root/.bup and ran 'bup init' to clear the client-side bup cache which was very large (25G)	16:14
openstackstatus	corvus: finished logging	16:14
mordred	corvus: there is much more spaces now	16:14
corvus	so many spaces	16:15
fungi	all the spaces	16:15
fungi	do we need to restart the scheduler, or is it not impacted	16:15
corvus	i think the only impact would be failing to generate a key for a new project; i think we can just restart if someone complains	16:16
corvus	the time database may be out of date, but that should recover without a restart	16:19
*** roman_g has quit IRC		16:22
*** gouthamr has quit IRC		16:26
corvus	2020-07-08 14:39:59,163 ERROR nodepool.NodeLauncher: [e: 7d1d1ee806274202b8fba3d14450c712] [node_request: 300-0009826852] [node: 0017913446] Detailed node error: No valid host was found. There are not enough hosts available.	16:27
corvus	fungi: ^ that's what you were expecitng?	16:27
fungi	yep, that's what we were getting from them in the past	16:28
corvus	roman_g has left	16:29
fungi	i believe the explanation was they either lacked actual capacity on the compute nodes, or lacked sufficient capacity to place instances of the requested flavor on any	16:29
corvus	anyway, i think that's it for data collection; i'm not sure what we should do about that	16:29
corvus	other that perhaps remove the node types if we don't have any clouds that can reliably provide them	16:30
fungi	i think it's always been that way, which is why openedge started providing capacity for that size instances so at least they could be filled from somewhere	16:30
corvus	if we take a high level step back: i think our choices are either: 1) find a way to reliably provide the nodes; 2) don't offer the nodes; 3) keep retry-bashing the one provider that does offer them and mask the high failure rate from end users	16:32
corvus	exactly how we achive either of those is tactics, but i think those are the strategies	16:32
openstackgerrit	Merged openstack/project-config master: Add nested-virt-centos-8 label https://review.opendev.org/738161	16:33
fungi	any indication as to whether we are actually ever getting any of them at all in citycloud?	16:33
corvus	iiuc, this link from roman_g earlier suggests "yes sometimes" https://zuul.opendev.org/t/openstack/builds?job_name=airship-airshipctl-32GB-gate-test	16:33
corvus	49% of the time	16:34
fungi	oh, okay. i suspect that artificially lowering the quota won't help, it's probably a binpacking problem on the compute nodes	16:39
fungi	in theory there's sufficient resources, but finding a single compute node with the necessary capacity fails on placement	16:40
mordred	not enough hosts available seems like a form of a quota issue	16:40
mordred	like - from an end-user perspective - except with a worse error message	16:40
mordred	because it's not "this request is impossible" it's "this request is impossible right now"	16:41
mordred	I think that's me arguing for a form of 3	16:41
fungi	well, if they're running at 50% utilization in the host aggregate but evenly distributing the load across all their compute nodes and the nodes themselves can only handle up to 64gb ram, then they could be way underutilized but still unable to find a single compute node which can host a 32gb vm	16:41
mordred	totally	16:42
mordred	which is why it isn't Actually a quota issue nor would it necessarily be fixed by adjusting quota	16:42
fungi	anyway, no clue if that's the exact problem, lots of ifs in there, but we've reported it to them previously	16:42
mordred	but as an API consumer it's similar - a valid request for resources is rejected by the cloud	16:42
mordred	and that same request, retried later, might wor	16:43
mordred	work	16:43
fungi	i suspect the 16gb requests are succeeding more often than the 32gb requests	16:43
mordred	I'd also suspect that	16:43
*** chandankumar has quit IRC		16:44
*** ykarel\|away has quit IRC		16:46
corvus	if we want to retry-bash (3), then what tactic should we follow?	16:46
corvus	we currently try to predict whether a request will succeed, only try it if we think it will, and if it fails 3 times, give up	16:47
corvus	we can't predict this	16:47
corvus	do we want to special case the error message: Detailed node error: No valid host was found. There are not enough hosts available.	16:47
corvus	and discount that so it never counts against the retry attempts?	16:47
*** hashar has joined #opendev		16:47
corvus	and are we sure that's never a permanent error?	16:48
corvus	(like, what if we get that error for some other reason (data center loses power); are we okay sitting in a retry loop for that?)	16:48
fungi	i have a feeling there are cases where it is an effectively permanent error	16:50
fungi	like provider has a flavor which can never actually be placed on any of their infrastructure	16:50
fungi	i suppose the question is whether node_error and a manual recheck is preferable to having builds and their respective changes sit indefinitely in the queue	16:51
fungi	as roman_g speaks for the primary consumer of this flavor, i'm inclined to leave it this way until we can get his feedback on it	16:53
fungi	though more generally, i don't really know if it's better to have changes stuck in queue effectively forever, or to emit node_failure results when clouds are having placement issues	16:54
corvus	there is a launch-retries option, so we could bump it up to 1000 attempts or something, but unfortunately, that just means we're going to sit there trying the same api request over and over until it succeeds	16:57
corvus	given some of our jobs take hours, we might have to bump it up to 10,000 retry attempts to mask it.	16:58
corvus	it's possible our providers may notice that.	16:58
*** fressi has joined #opendev		17:00
corvus	maybe we could do something where if we detect that error, we say: (a) don't count this as a retry attempt; and (b) don't retry this until another server has been deleted	17:00
corvus	basically, don't count it as a retry and pause the provider	17:00
corvus	i think i could see that working and be sustainable. it may not be a simple change to nodepool though.	17:00
mordred	yeah	17:01
fungi	provider pausing does sound interesting	17:05
fungi	though i guess it would be a label-specific pause?	17:05
fungi	in situations like this the provider may have no trouble placing 8gb requests but still fail a majority of 32gb requests	17:06
*** chandankumar has joined #opendev		17:11
*** marios\|out has quit IRC		17:12
corvus	that would be counter-productive for us though, so i think we would want to completely pause and wait for it to drain	17:13
corvus	(this is generally how nodepool providers work -- one request at a time)	17:14
corvus	and of course, we would only want to do this if we knew we were the last remaining provider that could handle that request	17:14
*** rh-jelabarre has quit IRC		17:17
*** roman_g has joined #opendev		17:28
roman_g	fungi corvus I read logs. Thanks for analysis. I've got rid of 32GB jobs, and we only have left 16GB jobs as of now (waiting for Zuul to merge PS with WF +1, and other PS need to be rebased on future master).	17:37
*** hashar has quit IRC		17:43
corvus	roman_g: ok. that seems likely to reduce the number of incidents; will be interesting to confirm	17:59
roman_g	We impose quite of a load onto node pool in ayirshipctl project. For example, only today 28 patch sets have been updated, many of them were updated more than once. We used 6x 8GB, 2x 32GB, 2x 16GB. Now tea re down to 6x 8GB & 1x 16GB. I think I also can squash 2 of 8GB jobs into one.	18:04
roman_g	*airshipctl	18:04
roman_g	*we are down	18:05
*** sshnaidm\|ruck is now known as sshnaidm\|afk		18:25
mordred	https://www.perl.com/article/announcing-perl-7/	18:34
*** hashar has joined #opendev		18:36
fungi	but i had only just gotten used to perl 5?	18:43
*** gouthamr has joined #opendev		18:53
*** rpittau_ has quit IRC		19:01
*** roman_g has quit IRC		19:07
mordred	fungi: perl7 is apparently perl5	19:13
mordred	(but with a few of the common pragmas turned on by default)	19:13
mordred	poor perl 6	19:13
mordred	corvus: I've got to AFK for a bit - but it looks like https://review.opendev.org/#/c/726263/ has hit the nefarious "incorrect arch" issue	19:14
fungi	perl6 was a fun experiment, i suppose	19:14
mordred	corvus: you can see it in the console error log with the amd64 apt repos being used in the arm64 builder	19:14
*** rh-jelabarre has joined #opendev		19:15
mordred	(the symptom which causes it to fail is going to be that we build wheels for one arch in the builder image and then copy them to the wrong arch causing pip to not find wheels and want to build them - except it's the final image so there's no compiler - but the root cause is going to be the arch mismatch)	19:16
noonedeadpunk	hey everyone... need some help with zuul multinode jobs. For some reason I don't get ovs ip pingable here https://zuul.opendev.org/t/vexxhost/build/ebd87dde2ffb4bf2835c21e53e5be32e/log/job-output.txt#730	19:17
*** gmann_ has joined #opendev		19:17
*** gmann_ is now known as gmann		19:18
fungi	noonedeadpunk: what's the overlay there? vxlan or gre?	19:24
fungi	i think we've generally defaulted to vxlan in opendev because a number of our providers lack the conntrack plugin to allow gre to cross between compute nodes	19:25
corvus	mordred: i think i need an eli5 for that	19:26
noonedeadpunk	fungi: vxlan I guess - it's default one	19:27
noonedeadpunk	yeah, previous job has ip a output. https://zuul.opendev.org/t/vexxhost/build/28580dda752044e695992546e11171e3/log/job-output.txt#765	19:28
fungi	ahh, okay	19:28
corvus	mordred: i'm having particular trouble with "copy them to the wrong arch"	19:28
mordred	corvus: it's possible I said too many words	19:29
mordred	corvus: look at https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1994-2000 for the actual issue	19:29
mordred	the other words were a description of why the c compiler errors are misleading	19:29
mordred	but see on 1994 we're in linux/arm64 and then on line 2000 we're pulling from buster amd64	19:29
*** donnyd has joined #opendev		19:31
mordred	corvus: I don't have the whole picture - but the mismatch there matches the reported behavior from ianw from a few weeks ago	19:31
corvus	mordred: the most recent from line is "FROM docker.io/opendevorg/python-base:${PYTHON_VERSION}" are you saying that pulled the wrong image?	19:31
*** aannuusshhkkaa has joined #opendev		19:32
corvus	mordred: (i'm assuming "install-from-bindep" doesn't have any arch stuff on it, it's just saying "apt-get update", so an arch mismatch like that must be because the current image is wrong, yeah?)	19:32
mordred	corvus: that's the hypothesis - that somehow that's pulling the incorrect arch	19:32
*** mnaser has joined #opendev		19:32
mordred	we have yet to duplicate that behavior though	19:32
mordred	corvus: yeah.	19:32
mordred	corvus: I wonder if there is an issue at play with builder images here	19:33
mordred	since a multi-stage build dockerfile is a more complicated construct	19:33
mordred	but - honestly - since we've never seen this mistake from buildx in our other testing - I honestly have no clue	19:34
corvus	this output is exceedingly difficult to read with exceeding difficulty	19:35
corvus	it looks like it's constantly reprinting entire stanzas just to make incremental changes	19:35
mordred	ooh!	19:36
mordred	corvus: look at this:	19:36
mordred	corvus: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1484-1485	19:36
mordred	that's in the builder image stage	19:36
corvus	oh good find, i was focused on the python-base image	19:36
mordred	which is ostensibly an amd64 build which is deciding to install arm64 packages	19:37
mordred	corvus: and then the arm builder is also building arm: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1492-1493	19:37
corvus	mordred: so we've seen it crossed in both directions? arm->amd and amd->arm?	19:38
mordred	so - it seems like docker is using an arm version of python-builder for both arm and amd versions of the build stage - and then it's using the amd version of python-base in teh second stage for both arm and amd	19:39
mordred	yup	19:39
*** rpittau has joined #opendev		19:40
corvus	mordred: https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1847-1861	19:41
corvus	mordred: that series appears to show the same shas being used for both arches	19:41
corvus	mordred: moreover https://zuul.opendev.org/t/openstack/build/53a7d27cb2cd4ee2830b384a2d91b125/log/job-output.txt#1087-1093 (and the items above it) appears to show only one download for each image	19:42
corvus	all on arm64	19:42
mordred	corvus: I agree	19:42
mordred	corvus: https://zuul.opendev.org/t/openstack/build/700f3fa8cb5a49f1b75df928e61004e7/log/job-output.txt#1073 is from a working build that also shows same shas	19:42
fungi	noonedeadpunk: so the failure there is in the frrouting role you've added, i guess?	19:43
noonedeadpunk	yep... so I'm trying to add bgp, but need internal network for that...	19:43
*** jentoio has joined #opendev		19:43
corvus	mordred: i'm a little confused about how, if we downloaded only the arm images, we ended up with amd python-base	19:43
fungi	noonedeadpunk: oh, i see, it's tasks after frrouting	19:44
mordred	corvus: yah	19:44
mordred	corvus: well - are we sure we only downloaded arm for each? we seem to certainly have only downloaded one arch for each	19:45
mordred	but maybe what looks like arm downloading is actually downloading an amd?	19:45
corvus	mordred: not at all; i assumed that the arm64 thingy would download the arm64 arch	19:45
mordred	(because it certainly seems like we have amd python-base when we use it)	19:46
noonedeadpunk	fungi: when I launch 2 instances in our public cloud (which is added for CI as well), it works nicely	19:47
noonedeadpunk	So maybe there're some extra variables set in environment that makes things work differently...	19:48
*** jrosser has joined #opendev		19:48
corvus	mordred: i'm going back to the python-base build for this: https://zuul.opendev.org/t/openstack/build/05bd1a9f1f804f4e936e601ebad2e5b3/log/job-output.txt#1438-1447	19:48
corvus	mordred: 5f98 is the python-base we got	19:48
corvus	mordred: the way i read that is that we build one arch (5f98), the another arch (eebf), then the manifest list to bind them togethre (e4bd)	19:48
corvus	mordred: if that's right, then 5f98 indicates which arch we got (ie, it's not merely the manifest list)	19:49
mordred	I read that that way too	19:50
mordred	corvus: https://zuul.opendev.org/t/openstack/build/05bd1a9f1f804f4e936e601ebad2e5b3/log/job-output.txt#1093	19:50
mordred	push arch specific layers one at a time	19:50
*** hillpd has joined #opendev		19:51
fungi	noonedeadpunk: a network diagram for this would help, but based on the get routes and get ip a task outputs from 28580dda752044e695992546e11171e3 i infer that "primary" has both 172.24.4.1/23 and 172.24.4.2/23 bound to br-infra while "secondary" has 172.24.4.3 on its br-infra interface, then in ebd87dde2ffb4bf2835c21e53e5be32e "primary" is trying to ping 172.24.4.3 on "secondary" across br-int and seems to	19:51
fungi	have no entry in its arp table for that? (dumping the arp table might help clarify here)	19:51
*** zbr_ has joined #opendev		19:51
fungi	er, across br-infra not br-int	19:52
corvus	mordred: i think that tells us that 5f98 is amd because we pushed it first, yeah? that jives with your observed error that we got amd for -base	19:52
mordred	yeah	19:52
mordred	corvus: I wonder if it's as simple as taking out that one-at-a-time behavior which we shouldn;'t need any more because you added locking in zuul-registry	19:52
corvus	i'm not there yet	19:53
mordred	kk	19:53
*** rm_work has joined #opendev		19:53
corvus	mordred: moving to the -builder image; we saw sha bb4f in our error build. this looks like it was the second one pushed: https://zuul.opendev.org/t/openstack/build/86101c17f7fe4a23840142ed44ad0fab/log/job-output.txt#2756	19:54
corvus	mordred: that should mean bb4f is the arm image, which also matches what we observed	19:54
mordred	yah	19:55
corvus	mordred: so i think we've confirmed that we did indeed download the amd base and arm builder images	19:55
*** jbryce has joined #opendev		19:55
corvus	mordred: if we assume that the job is using only one of the arch images in the multi-stage build, how does it ever work?	19:56
corvus	mordred: shouldn't there always be an erroneous case?	19:56
noonedeadpunk	fungi: ok, I think multinode role makes some weird things....	19:56
mordred	corvus: not if there is a matching wheel for teh package on pypi - it might always be "broken" :but we might not notice	19:58
noonedeadpunk	fungi: this port is incorrect.. https://zuul.opendev.org/t/vexxhost/build/7e604332e91b4e38a418efc4414c20b0/log/job-output.txt#858	19:58
noonedeadpunk	wondering why it's created...	19:58
mordred	nevermind. there are no pypi wheels for uwsgi	19:58
fungi	noonedeadpunk: also worth noting, broadcast traffic (including arp who-has) across vxlan requires multicast be passed through the underlying infrastructure, which at least some of our providers disallow	19:59
mordred	corvus: so - yeah- I'm confused as to why all of the builds didn't fail :(	19:59
corvus	mordred: i feel like we have to entertain the theory that maybe there are 4 images on the builder despite us not having output indicating that?	20:00
noonedeadpunk	oh, but in sandbox I have the same port...	20:00
noonedeadpunk	so yeah, maybe problem is that such config is not allowed by some providers...	20:00
corvus	mordred: (and sometimes those 4 are [base-amd, base-arm, builder-amd, builder-arm] and sometimes they are [base-amd, base-amd, builder-arm, builder-arm]?	20:01
noonedeadpunk	fungi: can I ask for a hold? I guess I've recheck enough times and I probably should get at least once to the provider who has allow multicast...	20:03
fungi	noonedeadpunk: i want to say that the multi-node-bridge role establishes all ports as point-to-point to avoid needing to use broadcast traffic, so i'm not sure if it's really arp not making it through for that reason, but sure a hold would be a good way to explore this	20:03
fungi	noonedeadpunk: so an autohold for the vexxhost tenant, vexxhost/ansible-role-frrouting project, ffrouting-deploy job, 739717 change?	20:05
noonedeadpunk	yes, exactly)	20:05
*** knikolla has joined #opendev		20:05
fungi	noonedeadpunk: it's set now, recheck at your convenience and let me know when you've got a failure so i can find the nodes and add your ssh key to them	20:07
fungi	this fails early enough i expect that should be pretty quick	20:07
corvus	mordred: why are you thinking the 'push one at a time' isn't relevant?	20:09
*** johnsom_ has joined #opendev		20:10
*** mwhahaha has joined #opendev		20:10
corvus	mordred: (shouldn't the end state end up correct? and the job that fails isn't going to start until everything is pushed including the manifest list)	20:10
corvus	mordred: or rather, why is it relevant	20:10
mordred	corvus: I think I was thinking that maybe the subsequent pushes weren't actually overwriting things properly?	20:11
mordred	yeah	20:11
mordred	(I read isn't as is : ))	20:11
mordred	corvus: but - I don't have any specific output to indicate that - it was more a hunch from the sequence in how we built them matching up with our failure case shas	20:12
mordred	but - if that were actually the case I'd exepct all of them to fail the same way	20:12
corvus	ack	20:13
corvus	mordred: do you think it would help to hold a node and see if we can poke at the builders and see what images they have?	20:14
corvus	maybe we can confirm whether there's 2 or 4 or... i'm not familiar with how much we can poke around inside the buildx builders	20:14
mordred	corvus: probably? although now I'm a little worried about what happens if we don't hit it - so maybe we should remove the approval just in case - and might have to recheck it a couple of times	20:15
corvus	mordred: yes, i'll write some notes in a review and WIP it	20:15
mordred	ultimatley those builders should just be docker containers - so I'd hope we can exec into them	20:15
mordred	cool	20:15
*** wendallkaters has joined #opendev		20:16
*** seongsoocho has joined #opendev		20:17
corvus	mordred: take a look at my comments on https://review.opendev.org/726263	20:21
noonedeadpunk	fungi: it has already failed:) my key is here https://launchpad.net/~noonedeadpunk/+sshkeys	20:21
corvus	kevinz: are you still using these held nodes from march? http://paste.openstack.org/show/795676/	20:24
fungi	noonedeadpunk: ssh root@158.69.65.199 and root@158.69.70.221	20:26
noonedeadpunk	many thanks	20:26
fungi	sure, and please let me know once you're done with them	20:26
corvus	mordred: autohold in place and rechecked	20:26
mordred	corvus: lgtm	20:27
mordred	corvus: does the +A from ianw need to be removed? or will the -W be good enough?	20:27
fungi	-w will block merge	20:27
mordred	cool	20:28
mordred	corvus: I'm truly fascinated to find out what the situation is here :)	20:29
corvus	ayup	20:29
corvus	we're goin' fishin'	20:29
*** Open10K8S has joined #opendev		20:30
noonedeadpunk	fungi: so yeah, arp is not passing... `172.24.4.2 (incomplete) br-infra`	20:31
fungi	glad to know all those years as a network engineer weren't for nothing	20:32
fungi	noonedeadpunk: if you set up static arp entries can they communicate?	20:33
noonedeadpunk	fungi: nope	20:35
noonedeadpunk	so I guess it's smth in vxlan and ovs..	20:35
fungi	noonedeadpunk: so... we're using this role successfully in devstack multinode jobs, though i don't know if we have ovs versions (i think the default one is using linuxbridge?)	20:36
noonedeadpunk	btw in ovs log there's this `jsonrpc\|WARN\|unix#6: receive error: Connection reset by peer`	20:37
noonedeadpunk	ok. eed to dig deeper with that	20:37
fungi	but given neutron's default for victoria is ml2/ovs i would be surprised if there isn't one	20:37
noonedeadpunk	*need	20:37
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: Update synchronize-repos https://review.opendev.org/740110	20:49
*** mnaser is now known as mnaser\|ic		20:52
*** hashar has quit IRC		20:55
*** mnaser\|ic has quit IRC		21:08
*** mnaser\|ic has joined #opendev		21:08
*** mnaser\|ic has quit IRC		21:08
*** mnaser\|ic has joined #opendev		21:08
*** mnaser\|ic is now known as vexxhost		21:08
*** gouthamr has quit IRC		21:09
*** gouthamr has joined #opendev		21:11
*** vexxhost is now known as mnaser		21:14
*** mnaser is now known as mnaser\|ic		21:14
*** gouthamr_ has joined #opendev		21:15
*** johnsom_ is now known as johnsom		21:19
*** johnsom has joined #opendev		21:19
*** sshnaidm\|afk has quit IRC		22:03
*** slittle1 has quit IRC		22:17
*** tosky has quit IRC		22:30
*** slittle1 has joined #opendev		22:31
*** mnaser has joined #opendev		22:38
*** mlavalle has quit IRC		22:54
*** slittle1 has quit IRC		22:56
*** slittle1 has joined #opendev		22:58
*** tkajinam has joined #opendev		22:58
kevinz	corvus: Not yet, please help to unlock it , thanks	23:10
*** Dmitrii-Sh has quit IRC		23:14
*** Dmitrii-Sh has joined #opendev		23:21
*** DSpider has quit IRC		23:35
*** mnaser has quit IRC		23:38
*** mnaser\|ic has quit IRC		23:38
*** mnaser\|ic has joined #opendev		23:38
*** mnaser\|ic has quit IRC		23:38
*** mnaser\|ic has joined #opendev		23:38
*** mnaser\|ic is now known as mnaser		23:38

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!