Wednesday, 2021-10-13

opendevreview	Shnaidman Sagi (Sergey) proposed openstack/diskimage-builder master: Improve DIB for building CentOS 9 stream https://review.opendev.org/c/openstack/diskimage-builder/+/806819	00:22
ianw	clarkb / fungi : I have put in a draft note at the end of https://etherpad.opendev.org/p/gerrit-upgrade-3.3 about dashboards and attention sets as discussed in the meeting. please feel free to edit and send as you see fit	02:53
Clark[m]	ianw: I made two small edits but lgtm if you want to send it	02:58
ianw	Clark[m]: were you thinking openstack-discuss or just service-discuss?	03:03
Clark[m]	service-discuss. More people seem to be watching our lists and getting it out there will hopefully percolate through the places	03:05
ianw	will do. going to take a quick walk before rain and will come back to it :)	03:06
*** ysandeep\|out is now known as ysandeep		04:04
*** ykarel\|away is now known as ykarel		04:32
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: Update centos element for 9-stream https://review.opendev.org/c/openstack/diskimage-builder/+/806819	04:53
ianw	sshnaidm: i have updated the change with the testing we should be doing, and tried to explain more clearly what's going on in https://review.opendev.org/c/openstack/diskimage-builder/+/806819/comment/87f51fd0_0ee1505c/	05:04
ianw	hopefully that can get us on the same page	05:05
*** bhagyashris is now known as bhagyashris\|rover		05:21
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: Update centos element for 9-stream https://review.opendev.org/c/openstack/diskimage-builder/+/806819	06:49
opendevreview	Bhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749	06:50
opendevreview	Bhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749	06:59
opendevreview	Bhagyashri Shewale proposed zuul/zuul-jobs master: [DNM] Handle TypeError while installing the any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749	07:09
opendevreview	Dong Zhang proposed zuul/zuul-jobs master: Implement role for limiting zuul log file size https://review.opendev.org/c/zuul/zuul-jobs/+/813034	07:18
opendevreview	Bhagyashri Shewale proposed zuul/zuul-jobs master: Handled TypeError while installing any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749	07:25
*** jpena\|off is now known as jpena		07:29
lourot	o/ has anyone got a moment for https://github.com/openstack-charmers/test-share/pull/21 ? thanks!	07:45
lourot	wrong channel, sorry	07:46
frickler	infra-root: lots of post_failures. I've heard rumors of OVH having issues, but can't dig right now. might be log uploads failing	08:05
ianw	yep, OVH	08:20
ianw	WARNING:keystoneauth.identity.generic.base:Failed to discover	08:20
ianw	available identity versions when contacting https://auth.cloud.ovh.net/.	08:20
opendevreview	Ian Wienand proposed opendev/base-jobs master: Disable log upload to OVH https://review.opendev.org/c/opendev/base-jobs/+/813780	08:24
ianw	status page gives "Active issue"	08:24
ianw	https://status.us.ovhcloud.com/	08:25
frickler	didn't we use to have a third log provider? maybe we should actively try to get some more redundancy again	08:28
ttx	Looks like OVH is having network issues right now	08:29
ianw	frickler: think we should fast merge it?	08:29
ttx	I was disconnected for an hour, just came back	08:29
frickler	ianw: actually I think the indentation might be broken with your patch?	08:29
ttx	(my bouncer is on a OVH node)	08:29
ianw	frickler: the lines are commented, i think the other lines remain the same?	08:30
frickler	ianw: but don't comments need to match the indentation of their surroundings?	08:31
ianw	i don't believe so, but if you'd prefer i can delete the lines and we can just put a revert in	08:32
frickler	anyway, it seems to be working again just now, at least accessing the cloud from bridge is now returning things again	08:32
ianw	yeah, i can get to auth.cloud.ovh.net:5000 too	08:33
opendevreview	Ian Wienand proposed opendev/base-jobs master: Disable log upload to OVH https://review.opendev.org/c/opendev/base-jobs/+/813780	08:33
opendevreview	Ian Wienand proposed opendev/base-jobs master: Revert "Disable log upload to OVH" https://review.opendev.org/c/opendev/base-jobs/+/813783	08:33
ianw	there's the stack if we have issues	08:34
ianw	i'm afraid i'm rapidly reaching burnout point here	08:34
frickler	ianw: np, thanks for your help, I can check from time to time and see if it stays stable	08:35
*** arxcruz is now known as arxcruz\|rover		08:48
opendevreview	yatin proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392	09:22
*** ykarel is now known as ykarel\|lunch		09:23
*** ysandeep is now known as ysandeep\|mtg		09:25
frickler	hmm, still seeing failures, checking logs	09:48
frickler	I expected to see errors in the executor logs, but can't find anything there. also zuul didn't vote on 813780 but I'm also not seeing any job for that	09:57
frickler	and while looking for strange things, https://review.opendev.org/800445 seems to be stuck in check for 44h	09:58
frickler	also some tobiko periodic jobs for even longer	10:10
frickler	I also don't see any current POST_FAILURES, so will leave the upload config as is for now	10:15
opendevreview	Shnaidman Sagi (Sergey) proposed zuul/zuul-jobs master: Include podman installation with molecule https://review.opendev.org/c/zuul/zuul-jobs/+/803471	10:18
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Changing no of days for query from 14 to 7 https://review.opendev.org/c/opendev/elastic-recheck/+/813795	10:27
*** ykarel\|lunch is now known as ykarel		10:37
*** ysandeep\|mtg is now known as ysandeep		10:57
*** dviroel\|out is now known as dviroel		11:17
*** jpena is now known as jpena\|lunch		11:24
*** ysandeep is now known as ysandeep\|afk		11:31
*** ysandeep\|afk is now known as ysandeep		12:01
*** jpena\|lunch is now known as jpena		12:24
*** ykarel__ is now known as ykarel		12:37
ysandeep	Folks o/ Is there a way to set the hashtag(s) via the CLI, for example we can set topic: -t <topic> to git-review	13:06
*** dviroel is now known as dviroel\|rover		13:33
*** arxcruz\|rover is now known as arxcruz		13:37
*** bhagyashris\|rover is now known as bhagyashris		13:39
fungi	ysandeep: feel free to push up a change implementing that in git-review, though you can probably also do it as a second command straight to gerrit's ssh api... checking the documentation	13:40
fungi	ysandeep: i'm not finding it, looks like they didn't implement any controls for hashtags in the ssh cli, at least not yet	13:47
ysandeep	fungi: ack, no worries, thank your for checking	13:48
ysandeep	you*	13:48
fungi	ysandeep: looks like it could be set at push, similar to how git-review does topics at push: https://review.opendev.org/Documentation/user-upload.html#hashtag	13:51
Clark[m]	They are part of the push ref options	13:51
fungi	yeah, it's just that the ssh cli also has a set-topic command, so i was hoping there might be a similar set-hashtag	13:52
* ysandeep checking documentation		13:56
fungi	yeah, in git_review/cmd.py you could probably just extend the command line options with one for hashtags and then append to push_options like happens with --topic	13:58
fungi	oh, interesting, it looks like you can only set one hashtag at push time, not a list of them	13:58
ysandeep	fungi, thanks i was able to set hashtag with push ref options	14:09
ysandeep	fungi, I will give a shot implementing that in git-review	14:12
Tengu	ysandeep: cool! thanks :)	14:13
Tengu	fungi: thanks as well :)	14:13
fungi	ysandeep: feel free to ping me here when you push up the git-review feature and i'll be happy to review it	14:21
ysandeep	fungi++ thanks! I will try to implement that as my weekend python project.. So will probably ping you in week after PTG	14:23
fungi	whenever, have fun!	14:23
opendevreview	Ananya proposed opendev/elastic-recheck rdo: WIP: ER bot with opensearch for upstream https://review.opendev.org/c/opendev/elastic-recheck/+/813250	14:23
johnsom	FYI, the ptg page is down. I'm getting a cloudflare error 523 "origin is unreachable" page going to openstack.org/ptg page.	14:36
fungi	johnsom: yes, there's some network incident in vexxhost impacting some systems there, but ptg.opendev.org is still up	14:39
johnsom	Ah, bummer, I wish them luck!	14:40
fungi	i'm sure they'll have it cracked shortly	14:40
*** ysandeep is now known as ysandeep\|dinner		14:55
*** ykarel is now known as ykarel\|away		15:12
clarkb	I guess the OVH stuff corrected itself before we had to worry about landing and then reverting any changes	15:26
fungi	seems so	15:27
clarkb	apparently I'm somehow still identified with oftc too. Neat	15:29
fungi	did you set up cert auth?	15:29
fungi	i never have to identify on reconnect	15:29
fungi	it's just done as part of the tls setup with the client key	15:30
clarkb	I don't think I did based on my nickserv status	15:30
clarkb	but also it seems to have been magically handled by weechat? so meh?	15:31
*** marios is now known as marios\|out		15:33
opendevreview	Ananya proposed opendev/elastic-recheck rdo: WIP: ER bot with opensearch for upstream https://review.opendev.org/c/opendev/elastic-recheck/+/813250	15:48
clarkb	fungi: thinking out loud here should we hold off on https://review.opendev.org/c/opendev/system-config/+/813534/ and children until after renaming is done just to avoid any issues withconfig when gerrit starts up again in that process? Or go for it and maybe restart gerrit today/tomorrow?	15:49
clarkb	similar question for https://review.opendev.org/c/opendev/system-config/+/813716	15:49
*** ysandeep\|dinner is now known as ysandeep		16:10
clarkb	fungi: https://review.opendev.org/c/opendev/gerritlib/+/813710 is a super easy review too (adds python39 testing to gerritlib as we run jeepyb which uses gerritlib on python39 now)	16:11
fungi	johnsom: problem seems to have been fixed, if you needed to get to the site for something	16:13
johnsom	fungi Thank you!	16:13
fungi	clarkb: i like the quick restart later today or tomorrow idea, just to make sure we're as prepped as we can be	16:14
gthiemonge	Hey Folks, one of my patches is stuck in zuul https://zuul.openstack.org/status#698450	16:14
gthiemonge	is there any ways to kill it?	16:14
fungi	also i just realized i booked an appointment for a vehicle inspection friday after the openstack release team meeting, but i should be back well before we're starting the rename maintenance	16:15
clarkb	fungi: in that case feel free to carefully review and approve those chagnes I guess :)	16:16
clarkb	gthiemonge: hrm we should probably inspect why that happened first	16:16
clarkb	corvus: ^ fyi there are two changes in openstack's check pipeline that have gotten stuck. Likely due to lost node requests? I feel like that is what happened before. I'll start trying to find logs for them	16:17
gthiemonge	octavia-v2-dsvm-scenario-centos-8-stream is still queued, and my patch updates this job	16:17
fungi	clarkb: frickler noted some stuck changes earlier	16:17
clarkb	fungi: I'm guessing it is these chagnes since they are old enough to have been seen back when frickler's work day was happening	16:18
fungi	800445,16 has an openstack-tox-py36 build queued for 50 hours and counting	16:18
fungi	also there's a few periodic jobs waiting since days	16:19
clarkb	810f631ad5494c9ba7bc892d1c3f430f is the event associated with that enqueue I think	16:21
clarkb	there are two other events for child changes	16:21
clarkb	2021-10-13 07:41:31,801 DEBUG zuul.Pipeline.openstack.check: [e: 810f631ad5494c9ba7bc892d1c3f430f] Adding node request <NodeRequest 300-0015752422 ['nested-virt-centos-8-stream']> for job octavia-v2-dsvm-scenario-centos-8-stream to item <QueueItem 7f41db5ea67847b887881460f0b7b2b5 for <Change 0x7f62645d9e80 openstack/octavia-tempest-plugin 698450,19> in check> <- is the last thing the	16:22
clarkb	scheduler logs for that job on that event	16:22
clarkb	now to hunt down that node requests	16:22
clarkb	nested-virt-centos-8-stream <- the job uses a special node type...	16:23
fungi	i notice all the long-queued builds in periodic are for fedora	16:24
fungi	so we might be looking at multiple causes	16:24
clarkb	ya gthiemonge is caused because only ovh can build nested-virt-centos-8-stream and that reuqest happened during ovh's outage. I think the raeson we haven't node failured is we must've leaked launcher registrations in zookeeper again so nodepool thinks there are other providers still to decline that request	16:27
clarkb	let me check on the registrations really quickly, but in gthiemonge's case I think the easiest thing is to abandon/restore and push a new patchset	16:27
clarkb	or wait, I think if I restart the node with the registrations it will notice and node failure it	16:28
fungi	we seem to not be unable to launch fedora-34 nodes, but i have a feeling something similar befell the three periodic builds which wanted one	16:28
clarkb	hrm I don't see any extra registrations	16:29
fungi	similarly that openstack-tox-py36 job would have wanted an ubuntu-bionic node and we're probably booting fewer of them these days	16:29
fungi	statistically speaking, as none of the 5 stuck examples use our most popular node label, it's possible they're all representatives of a similar problem	16:30
*** jpena is now known as jpena\|off		16:30
clarkb	the linaro provider has not declined that request yet	16:32
clarkb	it cannot provide that node type so it should decline it. Now to look at why it hasn't yet	16:33
clarkb	linaro reports not enough quota to satisfy request whcih will cause it to enter pause mode and not decline requests.	16:36
clarkb	I think that may be starving its ability to get through ahd decline requests it cannot satisfy at all due to being the wrong label type	16:36
clarkb	there are leaked isntances in that cloud which I am trying to clean up now. We'll see what happens	16:37
clarkb	fungi: the tobiko changes have been pathological for a while. I suspect some weird configuration issue as they have a ton of errors iirc	16:40
clarkb	fungi: neutron is probably related to whatever is causing fedora issues	16:40
fungi	i looked at a few and it was suds-jurko failing to build	16:40
clarkb	fungi: but that wouldn't cause them to be stuck in zuul?	16:40
fungi	no, talking about your suggestion that the tobiko jobs have been pathologically going into retry_limit	16:41
clarkb	gthiemonge: I think your best bet may be to push a new patchset or abandon and restore the current patchset. The issue is nodepool isn't declining the request because it can't get to those requests because a cloud is failing very early :/	16:41
clarkb	fungi: ah	16:41
clarkb	I'm going to write an email to kevinz about cleaning up these leaked isntances in the linaro cloud now	16:42
fungi	the fedora-34 situation is a little odd too. there's one in airship-kna1 which has been in a "ready" state for more than 3 days	16:44
fungi	it should have gotten assigned to a build long before now	16:44
clarkb	linaro email sent	16:45
clarkb	I think that we can set the linaro quota to 0 if this persists and we notice more stuck changes due to it	16:46
fungi	i wonder if there's a good way to determine why nl02 hasn't assigned 0026879768 to a node request yet	16:46
clarkb	fungi: that cloud is also probably near or at its quota most of the time so it has a hard time filing through requests	16:48
fungi	oh, maybe	16:48
clarkb	nodepool.exceptions.ConnectionTimeoutException: Timeout waiting for connection to 149.202.161.123 on port 22 <- seems to be the general issue here	16:48
clarkb	with fedora-34 launches I mean	16:48
clarkb	we'll probably need to launch one out of band and inspect it. /me starts trying to do that	16:48
fungi	yeah, but 0026879768 has been in a ready state for days there	16:49
fungi	and grepping the debug log, the last mention was when it came ready and was unlocked (2021-10-10 08:52:00,927)	16:49
clarkb	fungi: yes, but if the airship cloud is perpetually paused it won't ever get a chance to scan all the requests and find the few fedora requests to assign that node	16:50
clarkb	the process here is provider is not doing an action proceed to grab next request, lock it, check quota, if at quota pause, when no longer at quota attempt to launch node.	16:51
clarkb	it can shortcut that if it already has the node but that depends on it finding a random request for a fedora034 node	16:51
gthiemonge	clarkb: ok thanks, i'll try	16:51
fungi	got it, so the problem is that we want the pause to pause cloud provider api interactions, not just everything	16:52
fungi	pausing declining nodes, or assigning nodes which are available and ready, results in a deadlock	16:53
fungi	but i guess nodepool performs a server list, which is a cloud provider api call	16:53
fungi	so is that where it's blocking?	16:54
clarkb	it does an internal wait for a node to be deleted iirc as it knows there will be free quota after that	16:54
opendevreview	Merged opendev/gerritlib master: Add python39 testing https://review.opendev.org/c/opendev/gerritlib/+/813710	16:54
clarkb	fungi: and ya it seems like we could have it continue to process and decline things it has no hope of ever fulfilling as well as fulfilling things it already has resource allocated to like the fedora-34 node	16:55
clarkb	clarkb-test-fedora-34 is booting in ovh bhs1. that as a region I noticed fedora-34 boot problems in	16:56
clarkb	fungi: I think it is failing in rax too, otherwise I would blame potentially bad uploads due to ovh's network problems	16:57
clarkb	the byte count for this image seems to match up what we have on nb02 at least	16:58
clarkb	this is spicy the console log is just one giant kernel panic	16:59
fungi	i'm looking at a build which succeeded on a fedora-34 node in airship-citycloud yesterday, so apparently we booted one there for it even though there was one ready for a couple days	17:00
fungi	in the same provider	17:00
clarkb	fungi: were they different pools?	17:00
clarkb	anyway I think fedora-34 is completely hosed based on the console log on the test instance i booted in ovh	17:00
fungi	provider: airship-kna1	17:00
clarkb	I'm going to try and boot in other clouds and see if we get different results	17:01
clarkb	fungi: ya but that provider has multiple pools. I don't think it will share acorss pools	17:01
fungi	so same pool	17:01
clarkb	fungi: we can do multiple pools per provider now	17:03
fungi	nl02.opendev.org-PoolWorker.airship-kna1-main-43d321e2735c45e5a17fc8d90e8ac674 logged for both nodes 0026879768 (the one that's been ready for days) and 0026901606 (the one booted yesterday for node request 300-0015737905)	17:05
fungi	why would the pool worker build a new node for an incoming node request when it already had one with the same label available?	17:06
clarkb	I do not know. That seems like a bug	17:06
clarkb	the iweb boot doesn't seem to panic. That makes me think potentially corrupt image in ovh. However the iweb test node doesn't seem to allow me in via ssh either	17:07
fungi	could be cpu flags?	17:07
fungi	or something hypervisor-related?	17:07
clarkb	fungi: well the image upload happened around ovh's crisis. I really suspect it is as simple as a bad image upload there	17:09
clarkb	on the iweb side of things it apepars that we only configure the lo interface with glean?	17:09
clarkb	at least I don't see anything logged for other interfaces + glean in the console log	17:09
clarkb	I think we should consider pausing fedora-34 image builds then delete today build	17:10
clarkb	and then hopefully those that understand fedora can look into why glean + network manager seem to be unhappy with it	17:10
opendevreview	Merged zuul/zuul-jobs master: Handled TypeError while installing any sibling python packages https://review.opendev.org/c/zuul/zuul-jobs/+/813749	17:11
opendevreview	Clark Boylan proposed openstack/project-config master: Pause fedora-34 to debug network problems https://review.opendev.org/c/openstack/project-config/+/813876	17:12
clarkb	I'm going to boot a test on yesterday's image to see if it acts different	17:12
fungi	i guess it's just provider launches we can pause from the cli? i always forget	17:13
clarkb	I didn't realize there is a command line option I'll check after I test yseterday's image	17:14
fungi	i'm looking it up in the docs now	17:14
clarkb	hrm yseterday's image may be not better	17:15
clarkb	in which case we're in a more roll forward state	17:15
fungi	https://zuul-ci.org/docs/nodepool/operation.html#command-line-tools "image-pause: pause an image"	17:16
clarkb	ya I think rolling back to the previous image isn't going to help us	17:16
opendevreview	Merged opendev/system-config master: Replace testing group vars with host vars for review02 https://review.opendev.org/c/opendev/system-config/+/813534	17:16
clarkb	fungi: https://review.opendev.org/c/zuul/zuul-jobs/+/813749 that merged which needed a fedora-34 node. So there must be working fedora-34 somewhere /em looks	17:19
clarkb	ha that ran in airhsip. did it use your old node?	17:20
clarkb	fedora-34-airship-kna1-0026921358 was the hostname	17:20
clarkb	Failed to start [0;1;39mNetwork Manager Wait Online then See 'systemctl status NetworkManager-wait-online.service' for details. on the iweb images	17:21
clarkb	ovh kernel panics but could just be a bad image	17:21
fungi	airship-kna1 for the check pipeline build too	17:22
fungi	so it's like we're only getting fedora-34 nodes in airship-kna1, which also has a fedora-34 ready node it's been ignoring for days	17:23
clarkb	maybe its image is just old	17:23
fungi	i wonder if image uploads have been failing there and it's got an old...	17:23
fungi	yeah	17:23
clarkb	doesn't appear to be old	17:24
clarkb	it could be luck that whatever is causing NM + glean to fail in iweb isn't an issue in airship	17:24
clarkb	I am going to try and delete the image in ovh that is panicing to force a reupload	17:24
clarkb	then maybe we'll see ovh do what iweb is doing or function like airship	17:24
fungi	yeah, airship is currently using image 7900 uploaded 13.5 hours ago	17:25
fungi	er, airship-kna1 is	17:25
fungi	maybe network setup in citycloud is different than everywhere else we try to boot fedora-34?	17:26
fungi	different virtual interface type which the f34 image's kernel is actually recognizing?	17:27
clarkb	I think it uses dhcp like many clouds. rax is static	17:27
clarkb	ya that might explain it	17:28
clarkb	https://bodhi.fedoraproject.org/updates/FEDORA-2021-ffda3d6fa1 is a recent update and they already have https://bodhi.fedoraproject.org/updates/FEDORA-2021-385f3aebfd proposed too	17:29
fungi	ens3 is the detected interface on the successful builds there	17:29
fungi	also citycloud is using rfc-1918 addressing with floating-ip for global access	17:30
clarkb	this is curious booting the previous image in ovh produces no console log. But I also cannot ssh in	17:31
fungi	and no global ipv6	17:31
clarkb	at least it isn't kernel pacnicing?	17:31
clarkb	I'm going to clean up my test instances in ovh and iweb and boot some on rax and vexxhost and see if they are any different	17:34
clarkb	vexxhost can boot the fedora-34 image successfully too ,but we don't launch the fedora image tehre because we only do the special larger instances in vexxhost now	17:45
opendevreview	Merged opendev/system-config master: Switch test gerrit hostname to review99.opendev.org https://review.opendev.org/c/opendev/system-config/+/813671	17:47
fungi	so something about the image works in vexxhost and citycloud, is unreachable in iweb, crashes during boot in ovh...	17:49
clarkb	fungi: yes though the crashes during boot in ovh may be unrelated and a result of us trying to upload images tehre during their network crisis	17:50
clarkb	rax test node also appears to be sad. This is noteworthy because rax uses static configuration and not dhcp. Implies the issue is independent of dhcp	17:51
fungi	"sad" as in boots but is unreachable over the network, or crashing like in ovh?	17:51
clarkb	unreachable over network. They don't support cli console logs so I didn't bother to check that	17:52
clarkb	it is possible that it crashes but that requires me to dig out credentials and do more work, but I need to context switch to other stuff	17:52
fungi	yeah, you can console url show and then stick that in a browser	17:52
clarkb	oh neat	17:52
fungi	shouldn't need credentials, the url is just meant to be unguessable	17:52
clarkb	ok let me relaunch in rax and see what it says	17:53
fungi	that's been my fallback on the providers who don't implement console log show	17:53
fungi	annoying, but better than nothing	17:53
clarkb	but I think we're fast approaching the bit where I say "people who care about this platform and understand it should really take a look" because I'm still arguing we should delete allfedora and usestream which seems to be a fair bit more stable for our pruposes	17:54
fungi	if memory serves, ianw did a fair bit of work on the f34 networking stuff, so may have a better idea of where we should be looking for root cause	17:55
clarkb	ya my hunch is something to do with ordering of services. Like udev isn't finding the device properly before we run glean or similar	17:55
clarkb	It wouldn't be so problematic if fedora didn't update so frequently with so many big explosions :)	17:56
clarkb	basically the reason we don't do intermediate ubuntu release	17:56
clarkb	I'm going to abandon the f34 pause change since that won't help	17:56
fungi	i suppose we could temporarily add f34 labels to vexxhost and remove them from everywhere else except citycloud	17:57
clarkb	ya the risk there is the flavor there is huge so the fedora jobs might end up needing that memory	17:58
clarkb	but if it is a short lived change the risk of that should be low	17:58
fungi	also not sure if i should delete this f34 ready node in citycloud which nodepool seems to just be ignoring and wasting quota with	17:58
clarkb	might be worth keepign around if anyone has time to dig into why the launcher isn't using it but instaed booting new nodes	17:58
fungi	yeah, also i have a feeling that if i do delete it, a new ready node will be booted and ignored instead	17:59
clarkb	oh ya since it is the only cloud that can satisfy the min-ready of 1 for f34 right now	17:59
clarkb	everything else will fail and eventually the airship cloud should get it	18:00
clarkb	fungi: the rax boot enters an emergency shell and I can't seem to get any scrollback to understand why that happens better	18:09
clarkb	"unknown key released" when I hit page up	18:10
clarkb	certainly seems that something to do with fedora 34 is new or different and causing some clouds problems	18:10
fungi	i wonder when this started	18:11
clarkb	I'll try to emergency boot this and see if the initramfs sos report is present	18:11
clarkb	(I kinda doubt it will be there because I don't think it is on persistent storage but doesn't hurt to check)	18:12
fungi	yeah, if it didn't get far enough to pivot from the initramfs to the real rootfs	18:14
clarkb	LE job failed for https://review.opendev.org/c/opendev/system-config/+/813534 so the jobs behind it didn't run	18:17
*** ysandeep is now known as ysandeep\|out		18:19
clarkb	fatal: unable to access 'https://github.com/Neilpang/acme.sh/': Failed to connect to github.com port 443: Connection timed out <- that repo redirects to https://github.com/acmesh-official/acme.sh now but is generally accessible. I guess this is just a random internet is sad occurence	18:19
clarkb	but this means that service-review didn't run after that change landed so not sure if we want to manually run it	18:19
fungi	though we could stand to update the url anyway, i guess	18:20
fungi	also retry downloads maybe	18:20
clarkb	++	18:21
clarkb	for that ord f34 instance I can't get to the rescue instance	18:21
clarkb	does it rescue with its own image by default?	18:21
fungi	yes	18:21
clarkb	ugh	18:22
fungi	that's probably not going to work	18:22
clarkb	ya	18:22
fungi	for... reasons which should previously have been obvious to me, sorru	18:22
clarkb	I didn't expect rescuing to give us any new info anyway. I'll just unrescue and delete the instance	18:22
clarkb	people who have had trouble booting f34 VMs have pointed to https://fedoraproject.org/wiki/Changes/UnifyGrubConfig on the internets. I'm booting a new instance in vexxhost so that I can check its grub configs	18:27
clarkb	fungi: when the current set of deploy jobs finishes maybe we should run service-review manually to be sure there are no unexpected updates?	18:28
opendevreview	Jeremy Stanley proposed opendev/system-config master: Retry acme.sh cloning https://review.opendev.org/c/opendev/system-config/+/813880	18:28
fungi	no idea if that's the way to do it, was reading random examples and trying to piece together from the docs	18:28
clarkb	fungi: I left a comment on it, close, but not quite	18:29
fungi	thanks	18:30
opendevreview	Jeremy Stanley proposed opendev/system-config master: Retry acme.sh cloning https://review.opendev.org/c/opendev/system-config/+/813880	18:32
clarkb	[Wed Oct 13 18:29:05 2021] Unknown command line parameters: nofb BOOT_IMAGE=/boot/vmlinuz-5.14.10-200.fc34.x86_64 gfxpayload=text	18:34
clarkb	the vexxhost node's dmesg reports that. I half wonder if it isn't able to find the kernel as a result on some system	18:34
clarkb	though maybe not	18:35
clarkb	since the kernel is already running at this point	18:35
clarkb	and we're just telling the kernel about itself	18:35
fungi	yeah, it's got to be finding the kernel if it's into the initrd	18:36
clarkb	/boot/efi/EFI/fedora/ exists but is empty. We do all of our x86 images as grub images. I half wonder the other clouds might be seeing the efi dir and attempting efi, failing due to the lack of an efi config and not falling back to grub?	18:37
clarkb	I don't know how that all works with openstack, nova, kvm, and qemu	18:37
clarkb	the grub config and fstab and the device label all lgtm	18:42
clarkb	The actual grub menu entry uses the device uuid not its label, but both the label in the grub /etc/default/grub config and the uuid in the /boot/grub2/grub.cfg menu entry match /dev/vda1	18:43
clarkb	I don't think vexxhosts's kvm had to do any magic to properly boot this	18:43
clarkb	ok I really need to page out the f34 stuff. I'm going to delete the vexxhost test node as it didn't show me anything suepr useful other than "it should work". But then i need to do lunch and then we should manually run service-review.yaml, check that didn't make any unexpected changes to gerrit. Then test node exporter on trusty. Then prep stuff for the project renaming	18:47
fungi	looking into the gitea metadata automation, it's the gitea_create_repos.Gitea.update_gitea_project_settings() method we want to call, and that already takes a project as a posarg, we're calling it from a loop in the make_projects() method	18:48
clarkb	fungi: iirc there is a force flag	18:48
clarkb	and if the project is new or the force flag is set then the metadata is updated	18:48
fungi	though looking closer at how we call it from ansible, it may be simpler to add a project filter as a library argument and filter it that way	18:48
clarkb	maybe we can make the force flag a list of names to force?	18:49
clarkb	if force is not empty then if project in force type deal	18:49
fungi	yeah, we already parameterize that	18:49
fungi	always_update: "{{ gitea_always_update }}"	18:49
fungi	right now we just set it to true or let it default to false	18:50
fungi	but we could overload it as a trinary?	18:50
fungi	or make it a regex	18:50
clarkb	well you could set it to structured data like a list	18:50
clarkb	false/[] don't force update or [ foo/bar, bar/foo] force update those projects	18:51
fungi	though if we also still want a way to be able to force it to do all projects, we'd have to list thousands	19:00
fungi	but yeah, a trinary of falsey/[some,list]/true could fit and remain backward-compatible	19:01
fungi	oh, though ansible seems to like these arguments to be declared with only one type	19:13
fungi	oh, maybe not declaring a type in the AnsibleModule argument_spec is fine	19:14
fungi	we don't seem to declare it for all of them	19:15
clarkb	I've quickly consumed some food. fungi I'll start a root screen on bridge and run the service-review.yaml playbook?	19:18
opendevreview	Jeremy Stanley proposed opendev/system-config master: Allow gitea_create_repos always_update to be list https://review.opendev.org/c/opendev/system-config/+/813886	19:19
fungi	clarkb: sounds good, thanks	19:19
fungi	also there's a start on the metadata update project filtering, though i haven't touched the testing yet	19:19
clarkb	alright starting that playbook now	19:20
fungi	i'm attached to the root screen	19:20
clarkb	review02.opendev.org : ok=66 changed=0 unreachable=0 failed=0 skipped=9 rescued=0 ignored=0	19:23
fungi	looks good	19:23
clarkb	it looked as we hoped no changes	19:23
clarkb	yup, I think we're good, the var movement didn't cause any unexpected updates	19:23
clarkb	I'll go ahead an exit the screen?	19:23
fungi	yep, go ahead	19:23
clarkb	cool, I think we should still do a restart because we haven't done one since the quoting changes happened	19:24
clarkb	but this is good news on its own	19:24
fungi	i should be around for a gerrit restart later today if you want	19:25
clarkb	ok, lets see where the day continues to go :) I am still planning to get the rename input file pushed and review the related chagnes and start an etherpad	19:26
clarkb	oh and I wanted to test node exporter on wiki.	19:26
clarkb	fungi: for ^ it is basically `wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.2.2.linux-amd64.tar.gz` the nconfirm the sha256, then extract and run the binary to see that it starts and doesnt crash	19:27
clarkb	fungi: any objectiosn for me doing that on wiki now?	19:27
clarkb	it all runs as my own user	19:27
clarkb	I went ahead and fetched it, checked the hash and extracted it since that is all pretty safe	19:30
clarkb	fungi: ^ I await your ACK before running the binary out of an abundance of cuation	19:31
clarkb	fungi: re testing of the gitea project stuff you should be able to hack the existing system-config-run-gitea job for that since it creates projects and then does another pass of them to ensure it noops (but in this case we could hack it to force updates for some projects)	19:33
fungi	clarkb: no objections	19:34
clarkb	cool it ran successfully. I ran it in the foreground and killed it. But if you want to double check it didn't fork into a daemon it was listening on port 9100 and process was called node_exporter	19:37
clarkb	I can mark that done and we should be good to land that spec tomorrow	19:37
fungi	yep, lgtm. nothing listening on 9100 (though it does have listeners on 9200 and 9300/tcp on the loopback)	19:51
clarkb	those ports are for ES that is used to do text search	19:54
clarkb	they are expected iirc	19:54
fungi	yep	19:59
ianw	clarkb: sorry, reading scrollback now	20:38
clarkb	ianw: I don't think it is super urgent but its super weird and going to be a pain to resolve I bet :/	20:39
ianw	the kernel starting and things going blank could very well be a sign that root=/dev/... is missing, i have seen that before	20:39
ianw	that said, i think it is passing in the devstack boot tests ... it should hit there too if that's it	20:40
clarkb	well it works in citycloud and vexxhost	20:40
clarkb	which is why I suspect this is an odd one	20:40
ianw	hrm, yeah i have no immediate thoughts :/	20:44
ianw	bib just have to sort out some things	20:45
opendevreview	Andrii Ostapenko proposed zuul/zuul-jobs master: Add retries for docker image upload to buildset registry https://review.opendev.org/c/zuul/zuul-jobs/+/813894	20:49
clarkb	do we know ^'s IRC nick?	21:44
clarkb	similar to goneri's related update we should make sure there aren't problems with the registry or local networking since that upload should always be local to the cloud	21:45
clarkb	Its a huge warning flag to me that people are retrying those requests and points to an underlying issue that we should probably fix instead	21:45
clarkb	fungi: the openinfra renames are renaming projects like openstackid which need to be retired. I guess we retire them in the new name location?	21:48
fungi	there's a foundation profile associated with the gerrit account e-mail, but it doesn't have any irc field filled in	21:48
fungi	clarkb: yeah, retiring them in the new location is fine, also gets rid of the old namespace that way	21:49
clarkb	well the old namespace will stick around for redirects but it empties it	21:49
fungi	right, that	21:50
fungi	the project list will no longer include the old namespace	21:50
fungi	does https://zuul.opendev.org/t/openstack/build/a758b4b433b7433aa3574ebbd3d77c21 look to anyone else like our conftest is bitrotten?	21:55
opendevreview	Jeremy Stanley proposed opendev/system-config master: Allow gitea_create_repos always_update to be list https://review.opendev.org/c/opendev/system-config/+/813886	21:58
opendevreview	Jeremy Stanley proposed opendev/system-config master: More yaml.safe_load() in testinfra/conftest.py https://review.opendev.org/c/opendev/system-config/+/813900	21:58
opendevreview	Clark Boylan proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/ https://review.opendev.org/c/openstack/project-config/+/765787	21:58
clarkb	Pushed that to resovle a conflict ebtween two of the renaming changes	21:59
fungi	thanks	21:59
clarkb	fungi: looks like pyyaml updated and we need to update to match?	22:00
clarkb	safe flag?	22:00
opendevreview	Clark Boylan proposed opendev/project-config master: Record renames being performed on October 15, 2021 https://review.opendev.org/c/opendev/project-config/+/813902	22:03
clarkb	and there is our input file and recording of the changes	22:03
fungi	clarkb: yeah, for now i updated the remaining call in that script to match the other one which was already using safe_load	22:07
fungi	but there are probably a bunch more which need changing	22:07
corvus	i'd like to restart zuul now. any objections?	22:17
clarkb	corvus: looking	22:20
fungi	should we time the gerrit restart to coincide?	22:20
clarkb	fungi: if you'd like. I don't see any relaese activity but will warn the release team. The tripleo team may appreicate waiting that 14 mintues to see if those changes at the top of their queue end up merging	22:21
clarkb	I've warned the release team	22:21
corvus	i can afk for 20 minutes and try again if you like	22:21
fungi	i can handle the gerrit restart in the middle of the zuul down/up	22:22
clarkb	corvus: considering how long their changes can take that might be a good thing	22:22
clarkb	just to avoid another set of 4 hour round trips for each of them	22:22
clarkb	infra-root I've put https://etherpad.opendev.org/p/project-renames-2021-10-15 together for the rename on Friday	22:22
corvus	clarkb: okay. my own thought is that the last time we waited 5 minutes it took an hour. there's no good time and therefore no bad time to restart.	22:23
clarkb	corvus: fair neough, I'm happy to proceed now too	22:23
corvus	but it's fine. i have something else to do that takes 20m so it's no big deal to me.	22:23
clarkb	I'll let you decide if you'd ratehr do it now or in 20 minutes :)	22:23
clarkb	I'll be around for both	22:23
corvus	let's come back in 20m. (mostly just don't want to establish too much of a precedent :)	22:24
clarkb	fungi: re gerrit restart the big thing it will be checking is the gerrit.config quoting updates	22:24
fungi	yep	22:25
fungi	if i need to hand edit the config to get it to restart, i can do that really quickly too	22:25
clarkb	fungi: /home/gerrit2/tmp/clarkb/gerrit.config.20211013.pre-group-mangling <- is a copy I made of that file on review02 earlier today when those otehr changes were merging	22:25
clarkb	we can use that to compare delta post restart	22:25
fungi	status notice Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again	22:40
fungi	that work?	22:40
clarkb	lgtm	22:40
fungi	i have a root screen session started on the gerrit server in case we need to coordinate anything there, and the docker-compose down command is queued	22:41
clarkb	I'll join it	22:41
fungi	that tripleo job has been uploading logs for almost 5 minutes, so should end any time hopefullt	22:43
clarkb	but also we gave it a chance we can proceed when ready I think	22:44
fungi	yep	22:44
clarkb	corvus: you'll do a stop, then we can restart gerrit, then a start?	22:44
fungi	that's how we did it last time, at least	22:45
fungi	oh, the test job wrapped up, now the paused registry job is closing out	22:47
corvus	sounds good	22:47
corvus	are we waiting still, or calling it good enough?	22:48
clarkb	I'm happy calling it good enoguh. We gave it a real chance.	22:48
fungi	good enough	22:48
fungi	#status notice Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again	22:49
opendevstatus	fungi: sending notice	22:49
corvus	okay. i'm re-pulling images to get ianw's 400 change	22:49
-opendevstatus- NOTICE: Both Gerrit and Zuul services are being restarted briefly for minor updates, and should return to service momentarily; all previously running builds will be reenqueued once Zuul is fully started again		22:49
corvus	will take just a few seconds longer	22:49
corvus	stopping zuul	22:50
corvus	fungi: you can proceed with gerrit restart	22:50
fungi	downing gerrit	22:50
fungi	upping	22:51
corvus	waiting for signal from fungi to start zuul	22:51
fungi	webui is loading for me	22:51
clarkb	yup lodas for me too and the config diff is empty	22:52
fungi	[2021-10-13T22:51:33.954Z] [main] INFO com.google.gerrit.pgm.Daemon : Gerrit Code Review 3.3.6-44-g48c065f8b3-dirty ready	22:52
fungi	corvus: all clear	22:52
clarkb	++	22:52
corvus	starting zuul	22:53
clarkb	fungi: I detached from the screen. feel free to close it whenever you like	22:53
fungi	thanks, don	22:53
fungi	e	22:53
clarkb	its ok you can call me Don	22:54
corvus	just don call me shirley	22:54
fungi	i picked the wrong day to stop sniffing glue	22:55
clarkb	what continues to try and get a kata-conatiners tenant? we removed it right? Maybe the cronjobs to dump queues?	22:58
clarkb	Thats a not today question I think	22:58
opendevreview	Ian Wienand proposed opendev/system-config master: ptgbot: have apache cache backend https://review.opendev.org/c/opendev/system-config/+/813910	23:01
ianw	fungi: ^ i'd probably consider you domain expert in that -- i'd not really intended the little static server to be demand-facing, so having apache cache it would be good for reliability i think	23:01
fungi	oh, yep	23:03
corvus	re-enqueing	23:05
corvus	#status log restarted all of zuul on commit 3066cbf9d60749ff74c1b1519e464f31f2132114	23:05
opendevstatus	corvus: finished logging	23:05
clarkb	and in an hour we should see the znode count fall again?	23:07
clarkb	I think we expect it in the 80-90k range?	23:07
corvus	yeah. it's hard for me to say if 110 might be okay though -- so i probably wouldn't assume we have a leak until it's over 120k sustained.	23:10
clarkb	corvus: a zuul/zuul change showed up in the openstack tenant release-approval pipeline briefly. I'm surprised we evaluate zuul changes in openstack at all?	23:11
corvus	it's in the projects list	23:12
clarkb	huh I didn't expect that but that is expected behavior then	23:12
clarkb	oh i bet it is there for the zuul_return testing in system-config/project-config/etc?	23:12
clarkb	we might be able toclean that up nwo as zuul_return has a mock or osmething now iirc	23:12
corvus	i think it may have been to try to trigger opendev deployments on zuul changes or something.	23:13
corvus	not sure if currently used	23:13
corvus	but it looks like jobs are loaded too, so may be some job inheritance going on	23:13
corvus	re-enqueue complete	23:15
clarkb	fungi: I approved the safe_load fix	23:18
clarkb	ianw: if we ignore the kernel panic in ovh because maybe that was due to their outage coinciding with our upload we're left with two failure modes. The rax emergency initramfs shell and the iweb failure to get network	23:24
clarkb	It might be easier to debug the iweb failure case first? like maybe do a build that hard code dhcp without glean or something and see if that boots and work back from that?	23:24
clarkb	and maybe if we get lucky fixing that will give us clues to fixing the rax problem	23:25
ianw	clarkb: yeah, i think all will be revealed if we can get a serial output	23:28
clarkb	fungi: re the gitea metadata. I'm thinking we can just do a copy of playbooks/sync-gitea-projects.yaml but then replace the gitea_always_update var with our list and then be good? You should be able to test this by calling that copy of sync-gitea-projects.yaml in the system-config-run-gitea job	23:30
clarkb	that job runs playbooks/test-gitea.yaml <- should be easy to run the playbook from there?	23:30
clarkb	note the import_playbook in test-gitea.yaml you should be able to run sync-gitea-projects that way	23:31
fungi	yup, will give it a shot tomorrow between meetings	23:33
ianw	clarkb: sorry my attention is slightly divided, i'm just trying to see if we can get these 9-stream image-based builds testing in 806819	23:35
clarkb	ianw: ya no worries, I don't think this is urgent yet. Worse case we can add f34 to vexxhost liek fungi suggested and that will give us enough capacity for the label to limp along while we debug further	23:36
clarkb	ianw: https://zuul.opendev.org/t/openstack/build/e336bb93987042a18a5acc44fb818b1e/log/logs/centos_8-build-succeeds.FAIL.log#933-935 that seems odd considering the other centos builds succeeded in that job. Has epel already started removing centos 8 stuff?	23:41
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [dnm] testing centos 8 image builds https://review.opendev.org/c/openstack/diskimage-builder/+/813912	23:43
ianw	clarkb: ^ i hope to find out :)	23:43
clarkb	ha ok	23:44
ianw	i don't like these image-based jobs but clearly people still use them	23:45
clarkb	ianw: ya I'll admit I didn't even consider that that might be what people were trying to do there	23:45
clarkb	the minimal builds are far more reliable bceause you don't have the upstream image changing daily on you in the case of ubuntu for example	23:45
opendevreview	Merged opendev/system-config master: More yaml.safe_load() in testinfra/conftest.py https://review.opendev.org/c/opendev/system-config/+/813900	23:46
clarkb	big znode drop from ~140k to 108k. corvus' estimate of ~110k may have been spot on	23:53
corvus	90 may be idle and 110 may be busy-ish ?	23:56

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!