Thursday, 2022-04-28

opendevreview	Merged openstack/diskimage-builder master: Fix dhcp-all-interfaces on debuntu systems https://review.opendev.org/c/openstack/diskimage-builder/+/839080	00:05
opendevreview	Merged openstack/diskimage-builder master: Switch to release-notes-jobs-python3 https://review.opendev.org/c/openstack/diskimage-builder/+/839599	00:49
*** rlandy\|bbl is now known as rlandy		00:59
*** rlandy is now known as rlandy\|out		01:00
*** ysandeep\|out is now known as ysandeep		01:23
opendevreview	Merged openstack/diskimage-builder master: Set machine-id to uninitialized to trigger first boot https://review.opendev.org/c/openstack/diskimage-builder/+/837251	01:35
*** ysandeep is now known as ysandeep\|breakfast		03:16
*** ysandeep\|breakfast is now known as ysandeep		04:19
*** ysandeep is now known as ysandeep\|afk		06:06
frickler	infra-root: hrw: it looks like the c9s wheel job is broken and thus blocking wheel updates. https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/836829 https://zuul.opendev.org/t/openstack/builds?job_name=publish-wheel-cache-centos-9-stream	06:41
ianw	frickler: yeah, looking at it now. it comes back to the openafs publishing	06:41
ianw	the promote job has failed so we don't have the rpms, i think we need to untangle that first	06:42
ianw	https://zuul.opendev.org/t/openstack/build/a0c2266464534e2b8267559564f2f609/logs was the last run	06:42
ianw	but all logs are gone	06:42
*** pojadhav- is now known as pojadhav		06:44
frickler	ianw: ah, o.k., I just noticed that this looked weird when checking the AFS graphs, go ahead, then ;)	06:46
ianw	yeah i noticed yesterday when looking how much space we'd saved with the src file removal too	06:46
*** ysandeep\|afk is now known as ysandeep		06:48
*** jpena\|off is now known as jpena		06:58
ianw	2022-04-28 06:59:10.630370 \| centos-7 \| error: Failed build dependencies:	07:01
ianw	2022-04-28 06:59:10.630649 \| centos-7 \| kernel-devel-x86_64 = 3.10.0-1160.59.1.el7 is needed by openafs-1.8.8.1-1.el7.x86_64	07:01
ianw	i bet we have build issues with centos-7 images; this happens when our images get out of sync with the mirror	07:02
frickler	Cannot find a valid baseurl for repo: base/$releasever/x86_64	07:03
frickler	https://nb01.opendev.org/centos-7-0000263671.log	07:03
frickler	centos-7 images look to be 22 days old, quite similar to the wheel age ...	07:05
ianw	that's definitely part of it. the images get out of sync, and then the kernel version the images have disappears from the mirrors, and then we can't find the -devel packages for the kernel it's running, and so can't build openafs, so can't publish wheels ...	07:06
ianw	this is the first yum call in the chroot	07:09
ianw	https://3312f1bac6b015e072cd-6f4fdaa50c9ffb2ee70643e96aea629f.ssl.cf1.rackcdn.com/838863/7/check/dib-nodepool-functional-openstack-centos-7-src/0cbbe24/nodepool/builds/test-image-0000000001.log	07:12
ianw	is a good gate build	07:12
ianw	it looks pretty much the same. i'm going to have to investigate this more tomorrow	07:13
hrw	have you considered moving building from VM to containers? that way host can run one OS, has AFS mounted and then in container you have other OS with AFS already mounted	08:00
*** ysandeep is now known as ysandeep\|lunch		08:13
*** pojadhav is now known as pojadhav\|lunch		08:21
ianw	hrw: yeah, a lot of this was written before containers were really a consideration :) the other thing we could that's potentially less impact is to copy through the zuul executor. i've had that on my todo list for a long time, there might even be a spec about it	08:34
ianw	but yes, in 2022 it's a good option :)	08:35
hrw	ianw: less OS related stuff to handle as you can run infra on one distro and use other ones in containers to build stuff	08:35
hrw	but then it would be to decide 'move from DIB to dockerfiles' or 'add container building into DIB' probably?	08:37
*** pojadhav\|lunch is now known as pojadhav		09:11
*** ysandeep\|lunch is now known as ysandeep		09:40
*** rlandy\|out is now known as rlandy		10:25
*** ysandeep is now known as ysandeep\|afk		10:56
frickler	Warning: Change 826541 in project zuul/nodepool does not share a change queue with 826543 in project openstack/openstacksdk	11:03
frickler	that's a weird way for zuul to say that the dependency hasn't merged yet	11:03
frickler	also seems that that nodepool quota unit test has developed some persitent failure	11:05
frickler	persistent even	11:05
*** dviroel\|rover\|out is now known as dviroel\|rover		11:23
opendevreview	Merged openstack/project-config master: Add the cinder-three-par to Openstack charms https://review.opendev.org/c/openstack/project-config/+/837782	11:38
*** pojadhav is now known as pojadhav\|afk		11:42
*** ysandeep\|afk is now known as ysandeep		11:45
*** pojadhav\|afk is now known as pojadhav		12:45
*** pojadhav is now known as pojadhav\|afk		13:42
*** ysandeep is now known as ysandeep\|out		14:10
*** jpena is now known as jpena\|off		14:14
*** pojadhav\|afk is now known as pojadhav		15:17
clarkb	hrw: ianw: note I'm not sure containers would help in this case because the issue is a mismatch between expected kernel dev headers in userspace and the running kernel of the system. Containers would only make that problems worse as ubuntu and centos and debian all run completely different kernels	15:21
clarkb	also the problem is orthogonal to building the VM images with DIB (but did does support the containerfile element now if you wish to use that). This is about building an openafs client for the running kernel to write out the wheel packages to the filesystem	15:22
hrw	clarkb: host OS loads openafs kernel module, mounts afs volumes. then container runs and gets afs volume already mounted by host os	15:22
clarkb	sure, but that doesnt' solve the problem of having an openafs client you just externalize it	15:23
clarkb	the problem continues to exist either way	15:23
hrw	clarkb: run hostOS vm, mount afs, run container with guest OS, do builds etc, exit, sync afs in hostOS, shutdown VM?	15:24
clarkb	hrw: yes but where does hostOS vm get its openafs client?	15:24
clarkb	thats the problem here :) and its an issue on all the platforms	15:24
hrw	clarkb: you already have that covered for several host OSes	15:24
clarkb	they just break at different times due to different pace of the different host platforms	15:24
hrw	just choose one where it works	15:24
clarkb	hrw: only because we've solved this problem	15:24
clarkb	with a fair bit of effort is my point and containers don't solve that	15:24
hrw	this way you sort it once per 2 years and have it done	15:25
clarkb	well once per $afs_breaks_time	15:25
clarkb	but ya it would reduce the problem space.	15:25
hrw	grab ubuntu 20.04 for example as host OS, run it for 5y and then migrate to 24.04 for another 5y?	15:26
clarkb	that doesn't work beacuse the ubuntu openafs client doesn't work which is my point	15:26
clarkb	we solve this same problem on ubuntu and debian too	15:26
clarkb	they just don't break at the same time as centos	15:26
hrw	or any other host OS supported by AFS upstream	15:26
clarkb	there are none	15:26
clarkb	this is the biggest drawback to using openafs	15:27
hrw	move to git with lfs?	15:27
clarkb	I'm just saying its easy to say "use a container" when the real problem is we have to build our own openafs packages for the kernels we run against	15:27
hrw	clarkb: understood	15:27
clarkb	git + lfs is not a globally distributed filesystem	15:27
clarkb	its unfortunate that there aren't any more modern alternatives to afs because it solves the mirroring problem quite elegantly	15:28
clarkb	its basically a filesystem with built in CDN	15:28
clarkb	We get to maintain one (really two) copies of the data then all readers see that content cohesively at roughly the same time when updates are made. This means we don't need 2TB of disk in every cloud to manage mirrors, we need 10% of that for caches	15:29
hrw	still - having one host OS to worry about instead of several ones?	15:30
clarkb	we could definitely engineer something like a 404 handler that queries against the pristine copy and manage caching more directly. But so far afs has done well enough	15:30
fungi	i think the issue at hand had to do with wheel builds, and the suggestion was to have openafs in an ubuntu vm but then perform wheel builds in a centos container/chroot?	15:30
clarkb	fungi: yes. I'm saying that just punts the problem of the openafs packaging	15:31
clarkb	I don't really think running the centos build on a centos machine to make centos wheels is significantly more effort than managing the PPA for ubuntu packages then coming up with an indirection layer to run stuff in containers on top of that	15:31
clarkb	(and also I wanted to make sure that it was clear DIB isn't involved in this as the two things got conflated. Dib supports the functionalty that has been suggested there )	15:32
fungi	yes, i agree there. i wasn't sure where dib's containerfile backend was coming into the discussion, but i'm trying to follow three conversations simultaneously so i probably skimmed poorly	15:32
clarkb	copying through the executor and not thinking too hard about where the wheels are actually built is probably the most resilient thing long term and ianw made that suggestion	15:34
clarkb	since the executors already so the data shuffling all day long	15:34
clarkb	its a well exercised method that doesn't require any new tooling be built. Just a switch to the existing tooling	15:35
fungi	yeah, we've been talking about that as an improvement for a while now	15:36
fungi	though it does involve additional data copying	15:36
clarkb	for wheels specifically I still think it would be helpful to have our upstream deps publish wheels. One of the biggest offenders is libvirt-python and we know the people there but apparently pypi license terms are unfavorable to them? I've never understood the argument	15:36
fungi	much less now that we don't unnecessarily include wheels which are already available on pypi	15:36
clarkb	The problem with our wheel cache is that others don't have it and we semi frequently discover that people outside of CI running our software are broken as a result	15:37
clarkb	(I don't understand the libvirt-python pypi concern because they do publsih sdsits)	15:37
fungi	wheels would need to have libvirt itself vendored in	15:39
fungi	since it's not a base lib for manylinux	15:39
fungi	so it may be that they object to distributing built libvirt binaries on pypi	15:39
clarkb	wouldn't it only need the linker info to be present? I suppose if that changes drastically between libvirt versions then you'd be in the same boat	15:40
clarkb	but iirc they said they could do it except for some licensing terms they didn't like	15:41
clarkb	which confused me because they are ok with the sdists	15:41
fungi	it would only need the linker info if you were installing libvirt from some separate place	15:41
fungi	but generally, wheels embed pre-built c libraries if they're outside the set expected to be available on most systems	15:42
clarkb	right if you install libvirt-python they could assume you have a libvirt installed separately rather than bundling it. But that likely gets into the trouble of libvirt proper changing its interfaces	15:42
clarkb	fungi: is that what crpytography does?	15:42
clarkb	(they seem to be at the forefront of python linking to external resources and building wheels for all the things)	15:42
fungi	yes	15:43
*** dviroel\|rover is now known as dviroel\|rover\|lunch		16:02
*** dviroel\|rover\|lunch is now known as dviroel\|rover		16:41
clarkb	TIL about pinky(1)	16:42
clarkb	Luca has warned people about upgrading to 3.5 and to careful monitor performance and heap utilization	16:44
clarkb	I think the change I pushed may just be one piece of that puzzle	16:44
clarkb	Luca sent me a gerrithub outage analysis doc that I need to read through when I've got time to digest it talking about the issues they have seen	16:45
clarkb	Just a heads up that we might need to take the 3.5 upgrade carefully and with extra testing	16:45
clarkb	ianw: if timing works out today I think I'd like to land https://review.opendev.org/c/opendev/system-config/+/839621 and then can query you for info on the reprepro stuff that needs to be done?	16:57
clarkb	ianw: in scrollback two different commands are mentioned `clearvanished` and `deleteunreferenced`. Is the process roughly grab the volume lock on mirror-update. Then run the reprepreo clearvanished and deleteunreferenced commands against that repo using the krb credentials. Then finally vos release?	16:59
fungi	huh, i have pinky installed courtesy of coreutils	17:04
fungi	seems functionally similar to who/w	17:05
opendevreview	Gage Hugo proposed openstack/project-config master: End project gating for openstack-helm-docs https://review.opendev.org/c/openstack/project-config/+/839103	17:15
fungi	infra-root: rackspace support ticket #220427-ord-0001314 is warning us that there will be a block storage maintenance 2022-05-11 03:00-04:00 utc impacting these volumes: afs01.ord.openstack.org/main02 backup01.ord.rax.opendev.org/main02 mirror01.ord.rax.opendev.org/main01	17:15
fungi	it looks like they've set some anti-affinity scheduling so if we clone or otherwise replace volumes in advance we won't be impacted	17:17
fungi	that might be a good idea at least for afs01.ord	17:17
clarkb	the ord afs server doesn't serve that much stuff	17:17
opendevreview	Gage Hugo proposed openstack/project-config master: Retire openstack-helm-docs repo, step 3.3 https://review.opendev.org/c/openstack/project-config/+/839427	17:17
clarkb	but if that helps prevent fallout with vos releases may be worthwhile	17:17
fungi	i'm more worried about the afs sync time if we have to rebuild its filesystem from scratch due to corruption	17:17
clarkb	for the backup server we can just unmount it I think?	17:18
fungi	yeah, or shut the server down temporarily	17:18
fungi	we could shut down afs01.ord temporarily as well, and let things fail over	17:19
opendevreview	Gage Hugo proposed openstack/project-config master: Retire openstack-helm-docs repo, step 3.3 https://review.opendev.org/c/openstack/project-config/+/839427	17:20
opendevreview	Merged openstack/project-config master: End project gating for openstack-helm-docs https://review.opendev.org/c/openstack/project-config/+/839103	17:39
fungi	one up-side to replacing those cinder volumes ahead of the maintenance is that we don't have to remember to turn anything off right before. it'll be happening well into the evening for most of us, so could result in hung afs volumes, corrupt backups, job failures because the mirror went away, et cetera which we won't necessarily spot until the next day. and it's taking place in the middle of	17:44
fungi	the week (tuesday night/wednesday morning my time)	17:44
clarkb	ya swapping them out definitely seems advantageous and lvm amkes it easy	17:45
fungi	i'll get started on that for the afs server in a bit, then backup and mirror as time permits	17:47
*** rlandy is now known as rlandy\|mtg		19:05
*** rlandy\|mtg is now known as rlandy		19:23
*** artom__ is now known as artom		19:32
*** dviroel\|rover is now known as dviroel\|rover\|brb		20:31
ianw	clarkb: yep, what you said is about it, you need --nokeepunrefrenced i think to deleteunreferenced just ... because	20:39
clarkb	ianw: ok cool. Back from lunch and going to look into that if we can land that change	20:47
clarkb	oh looks like it is approved perfect	20:48
clarkb	I've grabbed the debian-docker flock on mirror-update in a new window in the root screen you started	20:52
clarkb	once the change lands I'll run those commands	20:52
clarkb	I seem to have crashed firefox	20:54
clarkb	and now it won'st start again	20:54
clarkb	and `firefox --ProfileManager` isn't helping	20:54
clarkb	ok its because / remounted ro due to ext4 errors :/	20:57
clarkb	I'm going to reboot and hope it is ok	20:57
opendevreview	Merged opendev/system-config master: Stop mirroring source packages for debian-docker https://review.opendev.org/c/opendev/system-config/+/839621	21:03
fungi	note that the debian-docker cleanup is likely a no-op	21:12
fungi	i think they didn't serve any source packages for us to mirror	21:12
fungi	though i'll admit i didn't look too hard either	21:13
clarkb	ya but I don't know if reprepro will complain.	21:16
clarkb	I figured go through the steps and figure them out	21:16
clarkb	also re sad ext4 I gave up trying to fsck as I can't find a way to do that with my tumblweed install without booting off usb drive	21:16
clarkb	if it happens again I guess I should debug harder	21:16
clarkb	ianw: the command history in window 0 of the screne doesn't seem to have those commands?	21:20
clarkb	I guess reprepro-mirror-update doesn't need to aklog? Anyway I see hte reprepro commands using k5start there so I hsould be able to run rperepro in the same way but passing the clearvanished and deleteunreferenced commands. Then run reprepro-mirror-update to do a vos release	21:25
clarkb	just waiting for the config update to deploy then I'll do that	21:26
clarkb	`k5start -t -f /etc/reprepro.keytab service/reprepro -- reprepro --confdir /etc/reprepro/debian-docker-xenial clearvanished`	21:32
fungi	yeah, all the krb auth is baked into the script	21:32
clarkb	I will run that then replace clearvanished with deleteunreferenced then run the script	21:32
clarkb	then do bionic then focal	21:34
fungi	sounds great	21:34
clarkb	ok looks like deploy is done let me quickly check the config	21:42
clarkb	yup lgtm running that command above with the lock held	21:43
clarkb	fungi: it appears to have been a very large noop for xenial	21:47
clarkb	for all of thecommands. But I'll proceed with bionic and focal just to be sure they aren't different	21:47
clarkb	and done	21:50
clarkb	do we want to proceed with landing the ubuntu change? it will definitely not noop	21:50
clarkb	I've releaesd the debian docker flock	21:51
clarkb	https://review.opendev.org/c/opendev/system-config/+/839622 that change. Not sure how long the clearvanished and deleteunreferenced steps are expected to take in a large set of mirrors like that	21:52
fungi	i'm up for it	22:16
fungi	#status log Replaced block storage volume afs01.ord.openstack.org/main02 with main03 in order to avoid service disruption from upcoming provider maintenance activity	22:18
opendevstatus	fungi: finished logging	22:18
clarkb	fungi: ok I approved it	22:20
clarkb	a currently running ubuntu reprepro run has the flock	22:21
clarkb	I'll try to grab it	22:21
clarkb	Not sure how long those runs typically take. It started ~6 minutes ago	22:21
fungi	is there any reason why the vgs on the borg servers used raw block instead of partitions?	22:26
clarkb	fungi: does the vicepa system need raw block?	22:27
fungi	for borg?	22:27
clarkb	oh sorry I read it as afs servres for some reason	22:27
fungi	this is all at the lvm layer	22:27
clarkb	I don't think borg cares it is all on top of your fs not doing direct device manipulation	22:27
fungi	right, i'm following https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#cinder-volume-management and it has us partition the attached cinder volume and then add the partition as a pv in the vg, but wondering if there was a specific reason that was avoided when the borg server was built	22:29
clarkb	ianw may know as he did most of the setup on those	22:29
clarkb	as far was why use partitions I think it make sit clear the the volume has been used	22:29
clarkb	but when it is raw it is harder to make that distinction?	22:30
clarkb	though lsblk may tell you either way	22:30
fungi	right, partitioning lets you set the partition type to lvm, which may aid in scanning, i don't know if it still does these days	22:30
fungi	and also possibly with block alignment	22:32
fungi	anyway, one of the pvs in the main-202010 vg is now a partition while the other two are raw devices. lvm doesn't really care	22:32
fungi	we can replace the others over time as opportunity arises	22:33
fungi	for consistency with our documented process and our other existing severs	22:33
clarkb	++	22:33
clarkb	ubuntu flock is still held. /me tries to be patient	22:41
*** rlandy is now known as rlandy\|out		22:45
clarkb	it is vos releasing now so hopefully soon it will be done	22:55
opendevreview	Merged opendev/system-config master: Stop mirroring source packages for ubuntu https://review.opendev.org/c/opendev/system-config/+/839622	22:55
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Parse block device lvm lvs size attributes https://review.opendev.org/c/openstack/diskimage-builder/+/839829	22:56
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Make centos reset-bls-entries behave the same as rhel https://review.opendev.org/c/openstack/diskimage-builder/+/839830	22:56
clarkb	woot I have the lock now	22:57
ianw	sorry had to run around, back now	22:58
ianw	i guess i must have done them as raw partitions. i usually make a partition and set its type to lvm, so i'm not sure what happened	22:59
clarkb	ianw: no worries, pushing along with the ubuntu source rpeo removal. The chagne just merged and I hvae the lock. I'll run clearvanished and deleteunreferenced then the main script in that order once the configs update	22:59
ianw	clarkb: hrm, i guess the bash history disppears after your run bash under flock? those are the bits missing from history	22:59
clarkb	ianw: let me know if you think we should wait for some reaosn	22:59
clarkb	ianw: ya I was wondering it if was in a subshell once I exit'd by subshell that held the debian docker lock	23:00
ianw	i don't see any reason to wait, from what i could tell of codesearch i couldn't see anyone setting them up	23:01
clarkb	ok the config update is done. Running clearvanished now	23:01
clarkb	lots of stuff like 'There are still packages in 'focal-updates\|main\|source', not removing (give --delete to do so)!'	23:02
clarkb	ianw: fungi: Do I want to pass --delete to clearvanished?	23:03
clarkb	or maybe do the deleteunrefrerenced first then clearvanished again?	23:03
clarkb	I'll try deleteunreferenced first	23:04
ianw	yes, you want delete	23:06
clarkb	heh 'Error: packages database contains unused 'bionic-backports\|main\|source' database.' and clearvanished said 'There are still packages in 'bionic-backports\|main\|source', not removing (give --delete to do so)!' so ya I need to --delete	23:06
ianw	then after clearvanished, we want the deleteunreferenced	23:06
clarkb	ianw: is it reprepro --delete clearvanished or reprepro clearvanished --delete? maybe it doesn't matter. Ifind this tool quite obtuse. at least it yells at you and gives you hints :)	23:06
ianw	i feel like i put it last (after clearvanished) but i also think it doesn't matter	23:06
clarkb	it does matter. It has to go in the front	23:07
ianw	yes, it is certainly "interesting" to interact with	23:07
clarkb	but it yells again	23:07
clarkb	so I guess the good news is eventually it yells enough that we figure it out	23:07
ianw	that could be right too :)	23:07
ianw	oh did i just kill your screen?	23:08
clarkb	ianw: no you belled me but I'm in another window	23:08
clarkb	window 2 for ubuntu work	23:08
ianw	sorry i should have exited the session now	23:09
clarkb	ok clearvanished is done and now deleteunreferenced is running. it shas like 280k files to prune	23:10
ianw	excellent	23:11
ianw	... ok, so back to last night's issue ... i have no idea why centos7 is failing on the builders but not in the dib gate	23:12
clarkb	ianw: does one use our mirror and the other not?	23:14
clarkb	and our mirror is stale because $reason?	23:14
ianw	"Cannot find a valid baseurl for repo: base/$releasever/x86_64"	23:15
ianw	it does have shades of being a mirror issue, but I think that in the chroot here we're not using our mirror in both	23:15
clarkb	ianw: http://mirror.ord.rax.opendev.org/centos/7/os/x86_64/ base/$releasever/x86_64 doesn't seem to align with that	23:16
clarkb	its centos/$releasever/$content/x84_64	23:16
ianw	yeah i'm wondering if we've got a different .repo file in there, or we've sed-ed something the wrong way	23:18
clarkb	ok delete unreferenced is done. Running the regular sync now	23:18
clarkb	there were errors related to deb packages for libreoffice things so we didn't vos release	23:34
clarkb	I'm going to rerun the script now to see if those are not persistent	23:35
clarkb	ok it is still failing	23:49
clarkb	'Unable to forget unknown filekey'	23:50
clarkb	should I try running clearvanished and deleteunreferenced again?	23:50
clarkb	I guess it can't hurt anymore that we've already done and it may help /me tries it	23:50
clarkb	oh I wonder if this is due to our state tracking to remove packages?	23:51
clarkb	ya 2022-04-28 23:48:50 \| Cleaning up files made unreferenced on the last run	23:52
*** dviroel\|rover\|brb is now known as dviroel\|rover		23:52
clarkb	so I think the error is that we're telling it to forget files that the deleteunreferenced already cleared out	23:52
clarkb	can I just move /var/run/reprepro/mirror.ubuntu.ubuntu.unreferenced-files aside?	23:53
clarkb	ya Ithink so. We generate a new version of thta file on the next pass	23:54
clarkb	I'm going to move that file aside into my homedir on the server	23:54
clarkb	and now rerunning the main script	23:55
ianw	yeah, that sounds familiar	23:55
ianw	there might be something about it in the afs recovery docs, but it's been a long time (thankfully!)	23:56
clarkb	ya reading the logs and the script we bsaically has the old pre source removal list of things to remove which doens't work bceause I removed them and more	23:59
clarkb	but moving that script aside should get things moving	23:59
ianw	"2022-04-28 23:56:11.052 \| base/$releasever/x86_64 CentOS-$releasever - Base 0"	23:59
fungi	sorry, stepped away to make dinner, but back around now. sounds like you've got it worked out?	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!