Tuesday, 2022-10-25

clarkb	Just about meeting time	18:59
ianw	o/	19:00
clarkb	We do have a fairly large agend so I'll try tokeep things moving	19:00
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Oct 25 19:01:16 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/pipermail/service-discuss/2022-October/000369.html Our Agenda	19:01
clarkb	#topic Announcements	19:01
clarkb	No announcements so we can dive right in	19:01
clarkb	#topic Bastion Host Changes	19:02
clarkb	ianw: you've made a bunch of progress on this both with the zuul console log files and the virtualenv and upgrade work	19:02
ianw	yep in short there is one change that is basically s/bridge.openstack.org/bridge01.opendev.org/ -> https://review.opendev.org/c/opendev/system-config/+/861112	19:03
ianw	the new host is ready	19:03
clarkb	ianw: at this point do we expect that we won't have any console logs written to the host? we updated the base jobs repo and system-config? Have we deleted the old files?	19:03
ianw	oh, in terms of the console logs in /tmp -- yep they should be gone and i removed all the old files	19:03
clarkb	I guess that is less important for bridge as we're replacing the host. But for static that is important	19:04
clarkb	also great	19:04
ianw	on bridge and static	19:04
clarkb	For the bridge replacement I saw there were a couple of struggles with the overlap between testing and prod. Are any of those worth digging into?	19:04
ianw	not at this point -- it was all about trying to minimise the number of places we hardcode literal "bridge.openstack.org"	19:05
ianw	i think I have it down to about the bare minimum; so 861112 is basically it	19:05
clarkb	For the new server the host vars and group vars and secrets files are moved over?	19:06
clarkb	(since that requires a manual step)	19:06
ianw	no, so i plan on doing that today if no objections	19:06
ianw	there's a few manual steps -- copying the old secrets, and setting up zuul login	19:06
ianw	and i am 100% sure there is something forgotten that will be revealed when we actually try it	19:06
clarkb	ya I think the rough order of operations should be copying that content over, ask other roots to double check things and then land https://review.opendev.org/c/opendev/system-config/+/861112 ?	19:07
ianw	but i plan to keep notes and add a small checklist for migrating bridge to system-config docs	19:07
clarkb	++	19:07
clarkb	if we want we can do a pruning pass of that data first too (since we may have old hosts var files or similar)	19:07
ianw	yep, that is about it	19:07
clarkb	but that seems less critical and can be done on the new host afterwads too	19:08
clarkb	ok sounds good to me	19:08
ianw	yeah i think at this point i'd like to get the migration done and prod jobs working on it -- then we can move over any old ~ data and prune, etc.	19:08
clarkb	anything else on this topic?	19:09
ianw	nope, hopefully next week it won't be a topic! :)	19:09
corvus	(and try to reconstruct our venvs! ;)	19:09
fungi	that'll be the hardest part!	19:09
clarkb	it may be worth keeping the old bridge around for a bit too just in case	19:09
clarkb	thank you for pushing this along. Great progress	19:09
ianw	corvus: i've got a change out for us to have launch node in a venv setup by system-config, so we don't need separate ones	19:09
corvus	++	19:09
ianw	#link https://review.opendev.org/c/opendev/system-config/+/861284	19:09
clarkb	#topic Upgrading Bionic Servers	19:09
clarkb	We have our first jammy server in production. gitea-lb02 which fronts opendev.org	19:10
clarkb	This server was booted in vexxhost which does/did not have a jammy image already. I took ubuntu's published image and converted it to raw and uploaded that to vexxhost	19:10
clarkb	I did the raw conversion for maximum compatibility with vexxhost ceph	19:11
clarkb	That seems to be working fine. But did require a modern paramiko in a venv to do ssh as jammy ssh seems to not want to do rsa + sha1	19:11
clarkb	I thought about updating launch node to use an ed25519 key instead but paramiko doesn't have key generation routines for that key type like it does rsa	19:11
clarkb	Anyway it mostly works except for the paramiko thing. I don't think there is much to add to this other than that ianw's bridge work should hopefully mitigate some of this	19:12
clarkb	Otherwise I think we can go ahead and launch jammy nodes	19:12
clarkb	#topic Removing snapd	19:12
ianw	++ the new bridge is jammy too	19:12
clarkb	When doing the new jammy node I noticed that we don't remove snapd which is something I thought we were doing. Fungi did some excellent git history invenstigating and discovered we did remove snapd at one time but stopped so that we could install the kubectl snap	19:13
clarkb	We aren't currently using kubectl for anything in production and even if we were I think we could find a different isntall method. This makes me wonder if we should go back toremoving snapd?	19:13
clarkb	I don't think we need to make a hard decision here in the meeting but wanted to call it out as something to think about and if you have thoughts I'm happy for them to be shared	19:14
fungi	also we only needed to stop removing it from the server(s) where we installed kubectl	19:14
fungi	and also there now seem to be more sane ways of installing an updated kubectl anyway	19:14
ianw	i hit something tangentially related with the screenshots -- i wanted to use firefox on the jammy hosts but the geckodriver bits don't work because firefox is embedded in a snap	19:15
ianw	which -- i guess i get why you want your browser sandboxed. but it's also quite a departure from traditional idea of a packaged system	19:16
clarkb	probably something that deserves a bit more investigation to understand its broader impact then	19:16
clarkb	I'll try to make time for that. One thing that might be good is listing snaps for which there aren't packages that we might end up using like kubectl or firefox	19:16
clarkb	And then take it from there	19:17
clarkb	#topic Mailman 3	19:17
clarkb	Moving along so we don't run out of time	19:17
clarkb	fungi: I think our testing is largely complete at this point. Are we ready to boot a new jammy server and if so have we decided where it should live?	19:18
fungi	if folks are generally satisfied with our forked image strategy, yeah i guess next steps are deciding where to boot it and then booting it and getting it excluded from blocklists if needed	19:18
clarkb	at this point I still haven't heard from the upstream image maintainer. I do think we should probably accept that we'll need to maintain our own images at least for now	19:19
fungi	once we have ip addresses for the server, we can include those in communications around migration planning for lists.opendev.org and lists.zuul-ci.org as our first sites to move	19:19
clarkb	re hosting location it occured to me that we can't easily get reverse dns records outside of rax which makes me think rax is the best location for a mail server	19:20
clarkb	But I think we could also host it in vexxhost if mnaser doesn't have concerns with email flowing through his IPs and he is willing to edit dns records for us	19:21
fungi	perhaps, but rackspace also preemptively places their netblocks on the sbl	19:21
fungi	which makes them less great for it	19:21
fungi	er, on the pbl i mean	19:21
clarkb	ya so maybe step 0 is send a feeler to mnaser about it	19:21
fungi	(spamhaus policy blocklist)	19:21
clarkb	to figure out how problematic the dns records and email traffic would be	19:21
corvus	i think that's normal/expected behavior	19:22
corvus	and removal from pbl is easy?	19:22
fungi	exclusion from pbl used to be easier	19:22
corvus	i think vexxhost can do reverse dns by request	19:22
fungi	now they require you to periodically renew your pbl exclusion and there's no way to find out when it will run out that i can find	19:22
corvus	is it not easy? i thought it was click-and-done	19:22
corvus	ah :(	19:23
ianw	for review02 we did have to ask mnaser, but it was also easy :)	19:23
clarkb	from our end being able to set reverse dns records was what came to mind. SOunds like pbl is also worth considering	19:23
ianw	so there's already a lot of mail coming out of that	19:23
fungi	corvus: at least i recall spotting that change recently, looking now for a clear quote i can link	19:24
corvus	(either place seems good to me; seems like nothing's perfect)	19:25
clarkb	I guess the two todos are for people to weigh in on whether or not we're comfortable with forked images and specify if they have a strong preference for hosting location	19:26
clarkb	I agree sounds like we'll just deal with different things in either location	19:26
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/860157 Change to fork upstream mailman3 docker images	19:26
fungi	i concur	19:26
clarkb	maybe drop your thoughts there?	19:26
fungi	also i suppose merging those changes will be a prerequisite to booting the new server	19:27
fungi	there's a series of several	19:27
clarkb	fungi: we can boot the new server first it just won't do much until changesl and	19:27
clarkb	but I don't think the boot order is super important here	19:27
fungi	good point	19:27
clarkb	ok lets move on. Please leave thoughts on the change otherwise I expect we'll proceed	19:28
clarkb	#topic Switching our base job nodeset to Jammy	19:28
clarkb	#link https://review.opendev.org/c/opendev/base-jobs/+/862624	19:29
clarkb	today is the day we said we would make this swap	19:29
fungi	yeah, we can merge it after the meeting wraps up	19:29
clarkb	++ Mostly a heads up that this is changing and to be on the lookout for fallout	19:29
clarkb	I did find a place in zuul-jobs that would likely break which was python3.8 jobs running without a nodeset specifier	19:30
fungi	if anyone else wants to review that three-line change before we approve it, you have roughly half an hour	19:30
clarkb	I expect that sort of thing to be the bulk of what we run into	19:30
clarkb	#topic Updating our base python images to use pip wheel	19:31
clarkb	About a week ago Nodepool could no longer build its container images. The issue was that we weren't using wheels built by the builder in the prod image	19:31
clarkb	after a bunch of debugging it basically came down to pip 22.3 changed the location it caches wheels compared to 22.2.2 and prior	19:32
clarkb	I think this is actually a pip bug (because it reduces the file integrity assertions that existed previously)	19:32
fungi	or rather the layout of the cache directory	19:32
fungi	changed	19:32
clarkb	ya	19:32
clarkb	#link https://github.com/pypa/pip/issues/11527	19:32
clarkb	#link https://github.com/pypa/pip/pull/11538	19:33
clarkb	I filed an issue upstream and wrote a patch. The patch is currently not passing CI due to a different git change (that zuul also ran into) that impacts their test suite. They've asked if I want to write a patch for that too but I haven't found time yet	19:33
clarkb	Anyway part of the fallout from this is that pip says we shouldn't use the cache that way as its more of an implementation detail for pip which is a reasonable position	19:33
clarkb	Their suggestion is to use `pip wheel` instead and explicitly fetch/build wheels and use them that way	19:34
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/862152	19:34
clarkb	that change updates our base images to do this. I've tested it with a change to nodepool and diskimage builder which helps to exercise that the modification actually work without breaking sibling installs and extras installs	19:35
clarkb	This shouldn't actually change our images much, but should make our build process more reliable in the future	19:35
clarkb	reviews and concerns appreciated.	19:35
fungi	tonyb is looking into doing something similar with rewrites of the wheel cache builder jobs, i think	19:35
fungi	and constraints generation jobs more generally	19:36
clarkb	The other peice of feedback that came out of this is that other people do similar but instead of creating a wheel cache and copying that and performing another install on the prod image they do a pip install --user on the builder side then just copy over $USER/.local to the prod image	19:36
clarkb	this has the upside of not needing wheel files in the final image which reduces the final image size	19:36
fungi	as long as the path remains the same, right?	19:37
clarkb	I think we should consider doing that as well, but all of our consuming images would need to be updated to find the executables in the local dir or a virtualenv	19:37
clarkb	fungi: yes it only works if the two sides stay in sync for python versions (somethign we already attempt to do) and paths	19:37
ianw	that would be ... interesting	19:37
ianw	i think most would not	19:37
fungi	well, but also venvs aren't supposed to be relocateable	19:37
clarkb	its the and paths bit that makes it difficult for us to transition as we'd need to update the consuming images	19:37
ianw	(find things in a venv)	19:37
clarkb	fungi: yes, except in this case they aren't relocating as far as they are concerned everything stays in the same spot	19:38
fungi	the path of the venv inside the container image would need to be the same as where they're copied from on the host where they're built?	19:38
fungi	or maybe i'm misunderstanding how docker image builds work	19:38
clarkb	fungi: the way it works today is we have a builder image and a base image that becomes the prod image	19:39
corvus	i think the global install has a lot going for it and prefer that to a user/venv install	19:39
clarkb	the builder image makes wheels using compile time deps. We copy the wheels to the base prod image and install there which means we don't need build time deps in the prod image	19:39
fungi	okay, so you're saying create the venv in the builder image but then copy it to the base image	19:39
clarkb	in the venv case you'd make the venv on the builder and copy it to base	19:39
fungi	in that case the paths would be identical, right	19:39
clarkb	corvus: ya it would definitely be a lot of effort to switch considering existing assumptions so we better really like the smaller images	19:40
corvus	why would they be smaller?	19:40
clarkb	anyway I bring it up as it was mentioned and I do think it is a good idea if the tiniest image is the goal. I don't think we should shelve the pip wheel work in favor of that as its a lot more effort	19:40
clarkb	corvus: beacuse we copy the wheels from the builder to the base image which increases the base image by the aggregate size of all the wheels. You don't have thisstep in the venv case	19:40
fungi	corvus: because pip will cache the wheels while installing	19:40
fungi	or otherwise needs a local copy of them	19:41
corvus	we can just remove the wheel cache after installing them?	19:41
clarkb	corvus: that doesn't reduce the size of the image unfortunately	19:41
corvus	i think there are tools/techniques for that	19:42
clarkb	because they are copied in using a docker COPY directive we get a layer with that copy. Then any removals are just another layer delta saying the files don't exist anymore. But the layer is still there with the contents	19:42
clarkb	anyway we don't need to debugthe sizes here. I just wanted to call it out as another alternative to what my changes propose. But one I think would require significnatly more effort which is why I didn't change direction	19:42
fungi	the recent changes to image building aren't making larger images than what we did before anyway	19:43
clarkb	#topic Dropping python3.8 base docker images	19:44
clarkb	related but not really is removing python3.8 base docker images to make room for yesterday's python3.11 release	19:44
clarkb	#link https://review.opendev.org/q/status:open+(topic:use-new-python+OR+topic:docker-cleanups)	19:44
clarkb	at this point we're ready to land the removal. I didn't +A it earlier since docker hub was having trouble but sounds like that may be over	19:44
clarkb	then we should also look at updating python3.9 things to 3.10/3.11 but there is a lot more stuff on 3.9 than 3.8	19:45
clarkb	THank you for all the reviews and moving this along	19:46
clarkb	#topic iweb cloud going away by the end of the year	19:46
clarkb	leaseweb acquired iweb which was spun out of inap	19:46
clarkb	leaseweb is a cloud provider but not primarily an openstack cloud provider.	19:46
clarkb	They have told us that the openstack environment currently backing our iweb provider in nodepool will need to go away by the end of the year. But they said we could keep using it until then and to let them know when we stop using it	19:47
clarkb	that pool gives us 200 nodes which is a fair bit.	19:48
fungi	around 20-25% of our theoretical total quotas, i guess	19:48
clarkb	The good news is that they were previously open to the idea of providing us test resources via cloudstack	19:48
clarkb	this would require a new nodepool driver. I've got a meeting on friday to talk to them about whether or not this si still something they are interested in	19:49
clarkb	I don't thinkwe need to do anything today. And I should make a calendar reminder for mid december to shut down that provider in our nodepool config	19:49
clarkb	And now you all know what I know :)	19:50
clarkb	#topic Etherpad container log growth	19:50
clarkb	During the PTG last week we discovered the etherpad servers root fs was filling up over time. It turned out to be the container log itself as there hasn't been anethepad release to upgrade too in a while so the container has run for a while	19:50
clarkb	To address that we docker-compose down'd then up'd the service which made a new container and cleared out the old large log file	19:51
clarkb	My question here is if we would expect ianw's container syslogging stuff to mitigate this. If so we should convert etherpad to it	19:51
clarkb	my understanding is that etherpad writes to stdout/stderr and docker accumulates that into a log file that never rotates	19:52
ianw	it seems like putting that in /var/log/containers and having normal logrotate would help in that situation	19:53
clarkb	ya I thought it would, but didn't feel like paging in how all that works in order to write a change before someone else agreed it would :)	19:53
clarkb	sounds like something we should try and get done	19:54
ianw	it should all be gate testable in that the logfile will be created, and you can confirm the output of docker-logs doesn't have it too	19:54
clarkb	good point	19:54
ianw	i can take a todo to update the etherpad installation	19:54
clarkb	ianw: that would be great (I don't mind doing it either just to page in how it all works, but with the pip things and mailman things and so on time is always an issue)	19:55
clarkb	#topic Open Discussion	19:55
clarkb	Somehow I have more things to bring up that didn't make it to the agenda	19:55
clarkb	corvus discovered we're underutilizing our quota in the inmotion cloud	19:55
clarkb	I believe this to be due to leaked placement allocations in the placement service for that cloud	19:55
clarkb	https://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html	19:55
clarkb	That is nova docs on how to deal with it and this is something melwitt has helped with in the past	19:55
clarkb	I've got that on my todo list to try and take a look but if anyone wants to look at nova debugging I'm happy to let some one else look	19:56
clarkb	And finally, the foundation has sent email to various project mailing lists asking for feedback on the potential for a PTG colocated with the Vancouver summit. There is a survey you can fill out to give them your thoughts	19:56
clarkb	Anything else?	19:57
ianw	one minor thing is	19:57
ianw	#link https://review.opendev.org/c/zuul/zuul-sphinx/+/862215	19:57
ianw	see the links inline, but works around what i think is a docutils bug (no response on that bug from upstream, no sure how active they are)	19:58
ianw	#link https://review.opendev.org/q/topic:ansible-lint-6.8.2	19:58
ianw	is also out there -- but i just noticed that a bunch of the jobs stopped working because it seems part of the testing is to install zuul-client, which must have just dropped 3.6 support maybe?	19:59
clarkb	ianw: yes it did. That came out of feedback for my docker image updtes to zuul-client	19:59
ianw	anyway, i'll have to loop back on some of the -1's there on some platforms to figure that out, but in general the changes can be looked at	19:59
clarkb	and we are at time	20:00
fungi	thanks clarkb!	20:00
clarkb	Thank you everyone. Sorry for the long agenda. I guess that is what happens when you skip due to a ptg	20:00
clarkb	#endmeeting	20:00
opendevmeet	Meeting ended Tue Oct 25 20:00:37 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.html	20:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.txt	20:00
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.log.html	20:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!