Wednesday, 2023-09-06

mnasiadka	morning	07:47
mnasiadka	https://meetings.opendev.org/#Containers_Team_Meeting - where can I change the url it's pointing to? Magnum team has moved to using "magnum" as the meeting name, so that link is pointing to very old archives	07:47
mnasiadka	(I mean url for meeting logs)	07:47
frickler	mnasiadka: https://opendev.org/opendev/irc-meetings/src/branch/master/meetings/containers-team-meeting.yaml	07:50
mnasiadka	frickler: thanks :)	07:53
jrosser	could i get a hold on job openstack-ansible-deploy-aio_capi-ubuntu-jammy for change 893240	09:49
frickler	jrosser: I can do that in a moment	10:09
frickler	corvus: can you have a look at https://zuul.opendev.org/t/openstack/build/52f01074b2eb487993ede049d858a660 please? nova certainly does have a master branch, doesn't it?	10:10
frickler	note that this is for stable/pike and I'm deleting the whole .zuul.yaml there now anyway, I'm just not sure how to interpret that error	10:11
frickler	jrosser: done, recheck triggered	10:14
frickler	elodilles: infra-root: I've done https://review.opendev.org/c/openstack/blazar/+/893846 now, but (of course) now with no job running at all it also cannot be merged. would you prefer adding a noop job like we do for newly created projects or just force merge?	10:37
elodilles	frickler: if noop does the trick, then it's OK to me	10:51
jrosser	frickler: thankyou	10:54
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add charms-purestorage group for purestorage charms https://review.opendev.org/c/openstack/project-config/+/893377	11:36
fungi	we'll probably be able to take advantage of this by the time we get mailman auth wired up to our keycloak server: https://github.com/pennersr/django-allauth/commit/ab70468 (that's just merged to the auth lib we're using)	11:51
*** dviroel_ is now known as dviroel		12:27
*** dhill is now known as Guest2025		12:32
*** d34dh0r5- is now known as d34dh0r53		12:33
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add the Pure Storage Flashblade charm-manila subordinate https://review.opendev.org/c/openstack/project-config/+/893910	13:08
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add the Pure Storage Flashblade charm-manila subordinate https://review.opendev.org/c/openstack/project-config/+/893910	13:08
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add the Pure Storage Flashblade charm-manila subordinate https://review.opendev.org/c/openstack/project-config/+/893910	13:13
opendevreview	Alex Kavanagh proposed openstack/project-config master: Switch charm-cinder-purestorage acl https://review.opendev.org/c/openstack/project-config/+/893912	13:16
opendevreview	Alex Kavanagh proposed openstack/project-config master: Switch charm-cinder-purestorage acl https://review.opendev.org/c/openstack/project-config/+/893912	13:16
opendevreview	Alex Kavanagh proposed openstack/project-config master: Add the Pure Storage Flashblade charm-manila subordinate https://review.opendev.org/c/openstack/project-config/+/893910	13:22
opendevreview	Harry Kominos proposed openstack/diskimage-builder master: feat: Add new fail2ban elemenent https://review.opendev.org/c/openstack/diskimage-builder/+/892541	13:25
fungi	corvus: whenever you have a moment, no urgency at all, but this is an example of a leaked image in iad from yesterday. oddly, i can't seem to find a mention of the image name in either of our builders: https://paste.opendev.org/show/bIzGJD8QKpW13Dqg4LpH/	13:27
fungi	there's a list of 24 leaked image uuids in ~fungi/iad.leakedimages on bridge	13:29
fungi	in case you need more examples	13:29
fungi	oh, right, because the image names the builders log are in the new build id suffix format rather than the serial suffix that gets uploaded	13:35
fungi	i'm not sure how to reverse the serial suffix name to a build id in order to find the exact upload attempts	13:35
fungi	since the upload gives up before it gets a uuid for the image	13:37
fungi	and the builder no longer mentions the serial at all	13:37
corvus	frickler: bug in zuul. fix in https://review.opendev.org/c/zuul/zuul/+/893925	14:02
frickler	corvus: ah, cool, thx	14:05
frickler	meh, I fixed blazar, now zuul complains about blazar-dashboard https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/891628 . corvus is there a chance zuul could report all possibly affected projects at once?	14:06
opendevreview	Merged openstack/project-config master: Reduce frequency of image rebuilds https://review.opendev.org/c/openstack/project-config/+/893588	14:09
opendevreview	Bernhard Berg proposed zuul/zuul-jobs master: prepare-workspace-git: Add ability to define synced pojects https://review.opendev.org/c/zuul/zuul-jobs/+/887917	14:10
corvus	frickler: that could get very large; it only reports the first to avoid leaving many gb messages. normally i'd suggest looking at codesearch, but i guess these are on unindexed branches?	14:11
opendevreview	Merged openstack/project-config master: Add charms-purestorage group for purestorage charms https://review.opendev.org/c/openstack/project-config/+/893377	14:17
corvus	fungi: well, that sure doesn't look like it has nodepool metadata attached to it. also, neither does any other image in rackspace.	14:31
corvus	oh wait i take that back	14:31
corvus	i see nodepool_build_id='5fefa50bae64483b9e6f8b6a21664758', nodepool_provider_name='rax-dfw', nodepool_upload_id='0000000001' in one of the other images	14:32
corvus	but i don't see those in the paste	14:32
corvus	so it does look like they were not added to the leaked image	14:32
fungi	i expect that metadata gets attached to images after they're imported, so if the sdk gives up waiting for the image to show up then it never gets around to attaching the metadata to it? but i'm really not familiar enough with the business logic in the sdk to know for sure. it could also just be some sort of internal timeout between services inside rackspace responsible for handing off the	14:36
fungi	metadata, i suppose	14:36
fungi	thanks for looking, corvus! i wonder if we can more conveniently script something that just checks our tenant for private images lacking nodepool metadata and clean those up directly, rather than comparing uuids between lists from zk and glance	14:38
fungi	looks like `openstack image list` allows filtering by property (key=value), so might be able to fetch a list of them that way	14:40
corvus	fungi: https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/image/v2/_proxy.py#L755	14:42
corvus	yes, it looks like when using the v2 task api, the sdk attaches image metadata after the image is sucessfully imported.	14:42
corvus	i think that's the api that is relevant here?	14:42
fungi	i agree	14:43
fungi	i guess a longer timeout in the image create call would counter that to some extent	14:43
corvus	do you know where the task api is documented?	14:43
fungi	hah	14:43
fungi	sorry, it's a running joke	14:43
fungi	the task api hails from the bad old days of "vendor extensible apis" so it can be/do just about anything the provider wants	14:44
corvus	so i guess there would be "rackspace task api documentation" then?	14:44
fungi	as i understand it, yes	14:45
corvus	i'm wondering if it would be valid to include the metadata in https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/image/v2/_proxy.py#L732	14:45
corvus	alternatively, i wonder if glance_task.result['image_id'] is available and valid in the case of an exception, and could be used to attach metadata in the exception handler	14:46
fungi	i think this might be the current api doc for tasks, but of course rackspace is also running something ancient which may not be entirely glance either: https://docs.openstack.org/api-ref/image/v2/index.html#tasks	14:47
fungi	i've observed test uploads to their glance, and the basic process is that the image create call returns an "import" task id (uuid) then the task is inspected and, once it completes, its data includes the uuid of the image that was created	14:48
corvus	https://docs-ospc.rackspace.com/cloud-images/v2/api-reference/image-task-operations#task-to-import-image	14:49
corvus	warning:: Name is the only property that can be included in image-properties. Including any other property will cause the operation to fail.	14:49
corvus	so that's a no on idea #1	14:49
corvus	fungi: yeah, but since it's leaving an image object around, perhaps it includes the image id in the failure case too	14:50
corvus	so idea #2 might work if that's the case	14:50
fungi	well, we're not getting task failures. the sdk is just giving up waiting for the image to be ready	14:50
fungi	once the image is ready, the task object does include the uuid of the image	14:50
fungi	regardless of whether the sdk is still looking for it at that point	14:51
fungi	it's just that it might be 6 hours after the image was uploaded	14:51
corvus	oh	14:51
corvus	it looks like the timeout comes from the create_image call, so i think we can pass that in from nodepool	14:52
fungi	yes, we can. frickler made a temporary patch that hard-coded a nondefault timeout value in the builder's invocation of the sdk method	14:52
fungi	and afterward you indicated that adding a config option in nodepool for that would be acceptable if we wanted it	14:53
fungi	we were just pursuing other avenues first before determining whether it was necessary	14:53
corvus	we used to have a 6 hour timeout, but apparently that disappeared	14:54
corvus	probably lost in one of the sdk refactors	14:55
corvus	https://opendev.org/zuul/nodepool/src/branch/master/nodepool/builder.py#L43	14:55
corvus	but there's the constant; unused.	14:55
fungi	oh wow	14:56
fungi	looks like it was last used in shapshot image updates, removed by https://review.openstack.org/396719 in 2016?	14:58
fungi	and removed from upload waiting by https://review.openstack.org/226751 (presumably similarly set in shade back then)	15:00
fungi	though i'm not immediately finding it in shade's history if so	15:01
fungi	so anyway, i take this to mean we haven't had that 6-hour timeout in production for about 7 years	15:02
fungi	wasn't a recent change	15:02
corvus	remote: https://review.opendev.org/c/zuul/nodepool/+/893933 Add an image upload timeout to the openstack driver [NEW]	15:05
fungi	thanks!	15:05
fungi	clarkb: does https://zuul.opendev.org/t/openstack/build/cfe503710fa543618da6bf513c7576ba mean that opensuse dropped the libvirt-python package? i see a python3-libvirt-python in /afs/openstack.org/mirror/opensuse/distribution/leap/15.2/repo/oss/INDEX.gz but i have no idea if i'm looking in the right place	15:07
clarkb	fungi: https://software.opensuse.org/package/libvirt-python "there is no official package for ALL distributions"	15:14
clarkb	looking at https://software.opensuse.org/search?baseproject=ALL&q=libvirt-python I think you may need different package names for different suse releaes	15:15
clarkb	I would drop it from the bindep test	15:15
opendevreview	Jeremy Stanley proposed openstack/project-config master: Drop libvirt-python from suse in bindep fallback https://review.opendev.org/c/openstack/project-config/+/893935	15:19
fungi	clarkb: ^ thanks!	15:19
fungi	apparently, git-review takes roughly 5 minutes to give up trying to reach gerrit over ipv6	15:20
clarkb	I think that will be based on your system tcp settings?	15:21
clarkb	zanata didn't rotate mysql sessions overnight and the error still occurs. I see there is a wildfly systemd unit running on the server whihc I will restart now to see if that changes things	15:22
clarkb	ianychoi[m]: the DB sql_mode update to the 5.6 default plus a restart of the zanata service appears to have your api request working	15:24
corvus	i'm going to restart the schedulers now to get the default branch bugfix	15:28
fungi	thanks corvus!	15:28
fungi	clarkb: yeah, i know i can tweak the protocol fallback in the kernel, just hoping whatever the v6 routing issue is between here and there clears up soon	15:29
corvus	i'm restarting web too; not necessary, but just to keep schedulers ~= web	15:30
fungi	my outbound traceroute to the server transits eqix-ash.bdr01.12100sunrisevalleydr01.iad.beanfield.com and gets as far as 2607:f0c8:2:1::e (no reverse dns but whois puts that somewhere in beanfield as well), but the return route to me from the gerrit server seems to never escape the vexxhost border (bouncing back and forth between 2604:e100:1::1 and 2604:e100:1::2)	15:32
fungi	guilhermesp_____: mnaser: ^ any idea if there's a routing table issue in ca-ymq-1 which could explain that? can you see where packets for 2600:6c60:5300:375:96de:80ff:feec:f9e7 are trying to go?	15:34
fungi	it's been like this for at least a week, so doesn't seem to be clearing up on its own	15:34
fungi	maybe my isp's announcements are being filtered at the edge or by a peer	15:35
fungi	i haven't had trouble reaching other things over ipv6 though, i can get to servers in vexxhost's sjc1 region over ipv6 with no problem, for example	15:36
corvus	#status log restarted zuul schedulers/web to pick up default branch bugfix	15:37
opendevstatus	corvus: finished logging	15:37
corvus	fungi: who needs to review https://review.opendev.org/893792 (openstack-zuul-jobs) ?	15:38
fungi	corvus: i can, meant to look earlier, thanks!	15:38
corvus	oh cool, thanks :)	15:38
clarkb	https://23.253.22.116/ <- is a held gerrit running in a bookworm container with java 17	15:40
corvus	we might still see some default branch errors since those values are cached. they should clear as config changes are merged and the cache is updated; but we may still want to do a zuul-admin delete-state this weekend just to make sure all traces are gone. or if it becomes a real problem, we can do that earlier, but that's an outage, so i'd like to avoid it.	15:40
clarkb	Gerrit + bookworm + java 17 seems to generally work. I think we can plan for a short gerrit outage to restart on that container if others agree	15:42
fungi	seems fine, i suppose we could time that to coincide with the zuul restart?	15:42
opendevreview	Merged openstack/project-config master: Drop libvirt-python from suse in bindep fallback https://review.opendev.org/c/openstack/project-config/+/893935	15:44
clarkb	the zuul restart is done?	15:45
fungi	clarkb: the full down delete-state restart	15:48
fungi	not the rolling restart	15:48
clarkb	ah, I think that restart depends on gerrit being up, but yes we could do gerrit first and then zuul	15:49
frickler	corvus: those issue are all on stable/pike it seems, and I can understand the concern about error size, ack. so I'll work my way through them one by one and hope that'll be a finite task	15:52
clarkb	frickler: might be able to do a git grep for the negative lookahead regex and catch a lot of them upfront?	15:54
frickler	clarkb: that's not about regexes, but about removing some old project templates https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/891628	15:56
frickler	and I don't have most of those repos locally	15:57
clarkb	ah	15:57
frickler	unless now, when I need to create a patch on them	15:57
corvus	another option to consider would be to force-merge the removal, then cleanup based on the new static config errors. you'd have to be pretty sure that isn't going to break branches you care about, because it will definitely break project configs on affected branches. not advocating it, just brainstorming. :)	15:59
frickler	afaict all of those branches are only waiting to be eoled, because they predate release automation. so that's a good idea. let me try to do two more projects manually and if more show up then, we can take that path	16:03
frickler	likely noone would care if they stay broken until they're eoled even	16:03
frickler	I have the silent fear that blazar might not be the single odd project, but just rather early in the alphabetic list of more to come	16:04
opendevreview	Clark Boylan proposed openstack/project-config master: Set fedora labels min-ready to 0 https://review.opendev.org/c/openstack/project-config/+/893961	16:08
clarkb	infra-root ^ that one should be safe to merge now	16:08
frickler	that sounds like a good first step, ack	16:09
opendevreview	Clark Boylan proposed openstack/project-config master: Remove fedora-35 and fedora-36 from nodepool providers https://review.opendev.org/c/openstack/project-config/+/893963	16:17
opendevreview	Clark Boylan proposed openstack/project-config master: Remove fedora image builds https://review.opendev.org/c/openstack/project-config/+/893964	16:17
clarkb	These two shoud land on monday with some oversight to ensure things clean up sufficiently	16:17
opendevreview	Merged openstack/project-config master: Set fedora labels min-ready to 0 https://review.opendev.org/c/openstack/project-config/+/893961	16:24
corvus	the next nodepool restarts should happen on bookworm-based images	16:28
corvus	so heads up for any unexpected behavior changes there	16:29
frickler	corvus: do you want to restart now/soon or wait for the usual cycle?	16:33
corvus	i was assuming we just wanted the usual cycle; but i don't have a strong opinion	16:42
frickler	if we see the probability of issues as low I'm fine with waiting, otherwise a restart where we can more closely watch might be better	17:03
*** ykarel is now known as ykarel\|away		17:29
clarkb	the nodepool services have updated fwiw	17:29
clarkb	live disk images are huge these days.	17:37
fungi	going to pop out for a late lunch, bbiab	17:50
frickler	ah, I was confused earlier, thinking that nodepool would be updated together with zuul on the weekly schedule. sorry if what I was saying didn't make much sense	17:58
clarkb	frickler: ah ya nodepool updates ~hourly	18:00
clarkb	my laptop exihits the same behavior under ubuntu jammy's desktop livecd which is linux 6.2 which definitely worked back when I ran that linux version. Unfortunately that makes me pretty confident I have a hardware issue	18:00
guilhermesp_____	fungi: huuuum i think to be able to filter then i would need to check which hv the vm lives.. .we did have a minor issue with core switches this week but it was on amsterdam region... nothing suspicious in mtl	18:19
*** ralonsoh is now known as ralonsoh_away		18:55
frickler	corvus: zuul seems to be reporting the errors twice on https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/891628 each time, I didn't notice that earlier, but seems to have been happening since the first result	19:09
frickler	also now some charm repo is broken.	19:09
frickler	will continue looking at that tomorrow	19:09
corvus	frickler: likely due to being in two tenants	19:13
corvus	(or could just be 2 pipelines)	19:14
fungi	guilhermesp_____: i see the same problem to both 16acb0cb-ead1-43b2-8be7-ab4a310b4e0a (review02.opendev.org) and a064b2e3-f47c-4e70-911d-363c81ccff5e (mirror01.ca-ymq-1.vexxhost.opendev.org)	19:15
fungi	guilhermesp_____: from the cloud side it looks like a routing loop, but it's hard to tell with the limited amount of visibility i see	19:16
corvus	frickler: it's because it's in check and check-arm64	19:17
corvus	clarkb: fungi maybe gerrit + zuul reboot saturday morning pst?	20:04
fungi	i should be around, sgtm	20:05
fungi	happy to do some/all of it	20:05
clarkb	I can be around for that too	20:14
corvus	i wonder how long it's been since we actually had zuul offline. it's been a pretty good run. :)	20:16
clarkb	we should probably merge the gerrit change on friday then since we won't auto restart on the newi mage	20:17
opendevreview	Merged zuul/zuul-jobs master: Remove the nox-py27 job https://review.opendev.org/c/zuul/zuul-jobs/+/892408	20:46
fungi	this discussion seems hauntingly familiar: https://discuss.python.org/t/33122	21:50
guilhermesp_____	fungi: hum ok... im going out for holidays tomorrow. Maybe if you could just open up a ticket for us to keep track and we can investigate that ( or someone else from my team tomorrow when im back ) -- just send an email to support@vexxhost.com	21:58
fungi	guilhermesp_____: sure, will do. thanks!	22:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!