Monday, 2023-03-06

opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [wip] f37 https://review.opendev.org/c/openstack/diskimage-builder/+/876482	00:22
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [wip] f37 https://review.opendev.org/c/openstack/diskimage-builder/+/876482	00:50
clarkb	ianw: thank you for the reviews on the gitea stack. I'll fix that last one tomorrow morning and try to review the acl stack again while I'm letting those changes make their way through to production	00:58
opendevreview	Ian Wienand proposed opendev/system-config master: mirror-update : drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486	03:49
opendevreview	Ian Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487	03:49
opendevreview	Ian Wienand proposed opendev/system-config master: mirror-update : drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486	04:27
opendevreview	Ian Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487	04:27
opendevreview	Ian Wienand proposed opendev/system-config master: mirror-update: stop mirroring old atomic version https://review.opendev.org/c/opendev/system-config/+/876488	04:27
opendevreview	Ian Wienand proposed opendev/system-config master: mirror-update: drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486	04:31
opendevreview	Ian Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487	04:31
*** jpena\|off is now known as jpena		08:06
opendevreview	Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270	10:55
opendevreview	Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270	11:19
opendevreview	Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270	11:27
opendevreview	Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270	11:45
opendevreview	Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270	11:47
opendevreview	Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270	11:54
fungi	looking at the nodepool graphs for rax-ord, there's something pretty wrong in there	13:49
fungi	from what i can piece together, nodepool is getting a bunch of launch failures with "Timeout waiting for instance creation" and then proceeds to ask the cloud to delete the node and immediately deletes the znode, so nodepool is no longer tracking those, but they're hanging around in an active state in the cloud consuming quota for ages	13:50
fungi	i'm watching one which had its deletion requested over half an hour ago and is still in an active state according to openstack server show	13:51
fungi	anyway, the end result is that we're averaging something like 5% effective utilization of the quota we have there	13:52
fungi	and since it's the largest quota of any region we have access to, that's a huge chunk of our aggregate quota we can't use (more than 25%)	13:54
fungi	node 0033377524 is one of the examples i'm looking at	13:55
fungi	2023-03-06 13:19:22,943 INFO nodepool.StateMachineNodeDeleter.rax-ord: [node: 0033377524] Deleting ZK node id=0033377524, state=deleting, external_id=None	13:55
fungi	corresponding server instance c8f2f797-004a-4f26-8883-798ed0561926 finally disappeared moments ago, roughly 37 minutes after nl01 issued the delete	13:57
fungi	also nl01 is logging a bunch of tracebacks checking the quota	13:58
fungi	File "/usr/local/lib/python3.11/site-packages/nodepool/driver/utils.py", line 355, in estimatedNodepoolQuotaUsed	13:59
fungi	if node.type[0] not in provider_pool.labels:	13:59
fungi	IndexError: list index out of range	13:59
fungi	i don't see any other launchers logging that exception, so could be something specific to rackspace's api responses i suppose	13:59
frickler	this also doesn't look good https://paste.opendev.org/show/bZ1r1HWRmIQacGhpVhN8/	14:33
frickler	not sure whether we have created a busy loop of lots of creations leading to long startup times leading to lot of timeouts	14:34
frickler	we could consider bumping the launch-timeout. or lower the quota for some time to see if it recovers	14:35
frickler	maybe there's also another bug in the new state machine code. seems we don't have good test coverage for that	14:37
fungi	yeah, i suppose if we're deleting the node from zk immediately and relying on the quota checking to keep us honest, but then can't actually check the utilization because of that exception and fall back on max-servers and the assumption that the nodes it knows about are the only ones that exist, then we could be trying to boot over quota too	14:41
fungi	might even be leading us to hammer the api, kicking api rate limits into effect, slowing our calls down even more and creating a vicious cycle	14:50
fungi	need to go run some errands, but should be back within the hour	14:57
clarkb	the rax ord thing appears to have been going on for months.	16:14
clarkb	I suspect it is something to do withthe cloud itself given that and the lack of issues for the other two regions	16:14
clarkb	I would probably start by increasing boot timeout?	16:16
clarkb	since it appears to be booting things successfully but deciding they aren't coming up fast enough?	16:16
*** gthiemon1e is now known as gthiemonge		16:24
clarkb	oof launch timeout is already 10 minutes	16:25
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: Correct boot path to cover FIPS usage cases https://review.opendev.org/c/openstack/diskimage-builder/+/876192	16:31
opendevreview	Clark Boylan proposed opendev/system-config master: Switch borg backup from gitea01 to gitea09 https://review.opendev.org/c/opendev/system-config/+/876471	16:35
clarkb	infra-root if you get a chance to look at the new gitea servers I think https://review.opendev.org/c/opendev/system-config/+/876448 is ready to go	16:36
clarkb	re rax ord maybe the thing to do is set max-servers to 0 and let it clean up after itself	16:52
clarkb	then increase the number slowly and see if the timeouts persist	16:53
clarkb	the number == max-servers	16:53
fungi	yeah, i'm looking through the config, we already set boot-timeout: 120 and launch-timeout: 600 across all rackspace regions	16:54
fungi	which is pretty lengthy	16:54
clarkb	fungi: do we know which timeout we are hitting?	16:54
fungi	though this is boot-timeout it's running into i think	16:55
fungi	"Timeout waiting for instance creation"	16:55
clarkb	boot-timeout is the timeout waiting for openstack to report a node ready iirc. and then launch timeout is time to be able to ssh in	16:55
fungi	yeah, so maybe i'll up that to 300 and see if it helps	16:55
fungi	just in ord	16:55
clarkb	wfm	16:55
opendevreview	Jeremy Stanley proposed openstack/project-config master: Increase boot-timeout for rax-ord https://review.opendev.org/c/openstack/project-config/+/876592	16:58
fungi	in theory the launcher should clean up after itself anyway within an hour, if everything else is working. toggling max-servers to 0 and back is probably not going to change that	16:59
clarkb	well it was mostly an idea to start small and ramp up with clean data to see if we're our own worst enemy there	17:00
fungi	granted my samples were random, but it seemed like the launcher was cleaning up behind itself anyway	17:01
clarkb	fungi: if you have time to review https://review.opendev.org/c/opendev/system-config/+/876448/ I'd love to get that in today.	17:07
clarkb	I'm thinking I will also drop 01-04 from haproxy manually to see what load looks like on the four new servers	17:08
fungi	sure, sounds great	17:12
*** jpena is now known as jpena\|off		17:13
opendevreview	daniel.pawlik proposed zuul/zuul-jobs master: Provide deploy-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081	17:22
opendevreview	Merged openstack/project-config master: Increase boot-timeout for rax-ord https://review.opendev.org/c/openstack/project-config/+/876592	17:38
fungi	that deployed about 10 minutes ago, so hopefully we'll see the graph there smooth out by 19:00z	17:55
clarkb	ianw: I left some comments on the acl stack. Let me know what you think about the submit-requirement implied mapping problem	17:58
opendevreview	Merged opendev/system-config master: Replace gitea05-07 with gitea10-12 in haproxy https://review.opendev.org/c/opendev/system-config/+/876448	18:30
fungi	clarkb: mirror.iad3.inmotion.opendev.org seems to be offline again, powered off since 2023-03-03T18:15:28Z (3 days ago), but non-impacting since we zeroed max-servers for that provider. what's the best way to go about figuring out what's broken in there?	18:40
fungi	this is three times in two weeks, so something is definitely repeatedly killing the instance	18:40
fungi	i guess i can ssh into the nova controller and look at the service logs for clues?	18:41
clarkb	fungi: the first thing I would check is if the nova api (server show) lists any errors	18:41
clarkb	you can run that as our normal user and as admin. I think admin may get more info	18:41
fungi	server show doesn't report an error condition, no	18:42
clarkb	ok. In that case I'd probably find the hypervisor and see what nova compute logs and virsh/libvirt/qemu have to say about it	18:42
clarkb	there should be instance logs in /var/run//libvirt/something/or/other iirc	18:42
fungi	just that the power_state is Shutdown, vm_state is stopped, status is SHUTOFF	18:42
fungi	ah, as admin. i'll see if we have credentials for that in clouds.yaml already	18:43
clarkb	fungi: well if it doesn't show an error then admin probably won't	18:44
clarkb	fungi: we don't have them in bridge clouds.yaml but they are in a clouds.yaml or a openrc on the hosts themselves	18:44
fungi	no, no error whatsoever, just looks as though someone logged into the server and issued a poweroff	18:44
clarkb	ya so unlikely to be any different listing things as admin	18:44
fungi	i vaguely remember something similar happening to our mirror in the older linaro deployment	18:45
fungi	or might have been the builder	18:45
clarkb	ya I would look in the libvirt/qemu/nova compute logs	18:45
clarkb	those are actually two separate things but I would see if they have any hints	18:46
jrosser	you would get something like that if the OOM killer terminated the VM?	18:47
clarkb	jrosser: yup or if the VM hit some sort of nested virt failure (we have nested virt enabled but that vm doesn't do virt, but maybe its tripping it anyway)	18:49
fungi	sure, i suppose dmesg on the compute hosts would be a good first thing to check	18:49
fungi	how do i find the names/addresses of all the compute hosts?	18:49
clarkb	fungi: I think you login to the control panel with the secrets file infos and that gives you a listing. Its also the first three IPs after the api endpoint iirc	18:50
clarkb	there are a couple of extra hypervisors now too and I don't recall if they are in order too or not	18:50
fungi	oh, there's a control panel? i probably knew that at one point and then forgot	18:51
clarkb	ya there is the baremetal control panel which is separate from the openstack horizon stuff. The details for both are in secrets iirc	18:52
clarkb	but that control panel should list all the hypervisors and their IPs	18:52
fungi	thanks. i'll see if i have an opportunity to take a look there in a bit	18:52
fungi	i guess it's been a while since we tried that. the credentials we have on file are giving me "Invalid e-mail address and/or password"	18:55
clarkb	hrm	18:55
fungi	unrelated, looks like infra-prod-base failed on deploy for the gitea-lb update, likely due to the inmotion mirror being offline again	18:56
fungi	so maybe not so unrelated i guess	18:56
clarkb	ya I'm not too worried about that	18:56
clarkb	it applied the lb update anyway	18:56
clarkb	fungi: maybe try resetting the password? it should go to the shared email inbox. I just looked and there is/was email sent there	18:57
clarkb	otherwise we may need to reach out to them for help. For logging into hypervisors they are the three ips after the api endpoint though	18:57
clarkb	(also you can ssh to the api endpoint and ou get load balanced to a random one)	18:57
clarkb	jimmy might be of some help there too since I Think things got split into two companies?	18:58
fungi	yuris was in here for a while too, i thought	18:59
clarkb	I think yuris didn't maintain a persistent irc client	18:59
clarkb	but ya has been in and out but isn't here now	18:59
fungi	okay, so the old login url we had in our notes goes to a completely different system now. using the correct (new) url, i'm able to get into the webui there	19:01
fungi	manage->assets shows ip addresses for servers, though it's unclear what roles they play as they have generic types and randomly generated names	19:03
fungi	i'm guessing the three with more ram are the compute hosts?	19:03
clarkb	fungi: the way the deployment was made was a converged set of three hosts doing everything. THen to get the max-servers count out I think a couple of compute only hosts were added	19:04
clarkb	fungi: but ya the automated deployment system doesn't name things with helpful hints	19:04
fungi	the assets list has 3x servers with 16 cores and 128gb ram, also 3x servers with 40 cores and 510gb ram	19:05
fungi	so yeah, i suppose the smaller servers were the ones added to get the additional /28 netblock assignments	19:05
clarkb	I'm not sure which are which. You'll probably need to use the nova api to help you sort out which host that vm was on	19:05
clarkb	server show will give you a host id	19:06
clarkb	then there is some nova services listing that will give you the host ids mapped to useful things	19:06
fungi	does horizon have admin bits, or is it strictly for non-admin features?	19:06
clarkb	fungi: `openstack compute service list`	19:06
clarkb	I don't know. I avoid using horizon as much as possible >_>	19:07
fungi	i was assuming i'd need to use it to get the correct clouds.yaml values, but i guess those don't change for admin context anyway so i can reuse most of what's in our existing clouds.yaml	19:08
clarkb	fungi: the clouds.yaml is on those hosts	19:08
clarkb	if you ssh into one of them you'll have access to what is necessart to use openstack client	19:08
fungi	catch 22, i don't have ssh keys for them	19:09
clarkb	you should. Pretty sure we added your keys when this was deployed	19:09
fungi	oh? i didn't even think to try	19:09
fungi	hah	19:09
fungi	mmm, not as fungi, but root worked!	19:10
clarkb	yes we don't have specific users on these hosts	19:10
clarkb	(because all of the deployment is automated by that control panel)	19:10
fungi	[Fri Mar 3 18:03:51 2023] Out of memory: Killed process 4017237 (qemu-kvm) total-vm:10455440kB, anon-rss:8495744kB, file-rss:0kB, shmem-rss:4kB, UID:42436 pgtables:17752kB oom_score_adj:0	19:12
clarkb	cool. I don't think we are oversubscribing memory in openstack. But we are hyperconverged or whatever	19:12
clarkb	that means that maybe the openstack services are using more memory and that is impacting our VM?	19:12
fungi	jrosser wins a cookie	19:12
jrosser	\o/	19:13
clarkb	We might be able to deal with that by tuning max-servers down a bit and also telling openstack to use less resources? Oh also maybe we need to look for leaked resources	19:13
clarkb	leaked VMs just hanging out might be consuming too much memory or something	19:13
jrosser	i saw something a that felt similar here when we tried to fit exactly 2 giant GPU VM per host	19:13
jrosser	and there was not quite enough memory spare with 2 instances plus + $everything-else	19:14
fungi	clarkb: so of the 6 servers in the assets list, one of the smaller type has a slew of oom errors in dmesg and the other 5 are clean	19:14
jrosser	so the second instance to boot killed the first one	19:14
clarkb	fungi: ya I think the smaller ones are the control plane	19:14
clarkb	fungi: another option may be to move the mirror to one of the larger nodes	19:14
clarkb	since they are just compute nodes iirc	19:14
jrosser	if there are things running on some nodes that nova doesnt know about you can use `reserved_host_memory_mb`	19:16
clarkb	jrosser: Ithink this deployment is alread doing that, but ya memory needs may have expanded beyond that existing value	19:16
fungi	checking memory utilization, right now mysqld, glance, nova, cinder, ceph et al are taking up around 2/3 of the available memory on this machine. i don't see any qemu processes (unsurprising since we're not booting nodes there and the mirror is presently down), but i think that rules out lost virtual machines taking up excess ram	19:17
clarkb	++	19:17
clarkb	we could also try rebooting/restarting services so that they give back to the operating system	19:18
clarkb	but then we run the risk of rabbit getting angry	19:18
clarkb	but the cloud isn't in use so the risk to us if that happens is basically nil	19:18
fungi	i wonder if we can exclude this server from use for job nodes?	19:21
fungi	it has sufficient memory to run the mirror vm, but if we boot more than a few job nodes it won't	19:21
clarkb	fungi: we could run the mirror there and then set the value jrosser pointed too	19:21
clarkb	fungi: that seems like a good thing to try	19:22
clarkb	basically set it so that nova won't try to run anything else there because the mirror is already consuming those resources	19:22
jrosser	i wonder how much value there is actually in trying to run VM on those smaller nodes at all	19:24
clarkb	#status log Manually disabled gitea01-04 in haproxy to force traffic to go to the new larger gitea servers. This can be undone if the larger servers are not large enough to handle the load.	19:26
opendevstatus	clarkb: finished logging	19:26
clarkb	infra-root ^ fyi	19:26
clarkb	jrosser: well we're severely resource constrainted. So the value is in getting as much as we can out of the system	19:27
fungi	the smaller servers each represent 1/15 (approximately 7%) of our overall memory capacity	19:29
fungi	i'm guessing this one is where most of the central openstack services are parked though, hence much of the overhead being not virtual machines	19:30
fungi	though oddly no, looking through each of the servers, free reports approximately 24gb available on each of the smaller servers, and 432gb available on each of the larger servers	19:32
fungi	so there's around 80-100gb occupied by running services on each of those 6 servers	19:33
clarkb	ya so best thing may be to simply edit the reservation that nova avoids	19:34
clarkb	and possibly even prevent nova from launching test vms on those nodes alltogether	19:34
fungi	i guess if i look at strictly used (no shmem or buffers/cache) it's more like 50-90gb overhead on each server	19:35
fungi	well, strangely it's only the server with the mirror booted on it which was getting into an oom situation	19:35
clarkb	fungi: probably because it is the only one with a long lived VM on it	19:35
clarkb	so its going to end up with more memory pressure over time on average?	19:36
fungi	probably. also i wonder if adding a gb or two of swap to these would be a terrible idea, just so the kernel can page out infrequently accessed data to free up more for cache	19:36
fungi	right now none of them has any swap memory at all	19:37
fungi	clarkb: do you happen to know whether there's a specific reason that was avoided?	19:38
clarkb	fungi: no, its all done by their tooling	19:38
clarkb	the only thing we select is the number of nodes. We don't select partition layouts, IPs, hostnames, etc	19:38
fungi	i suppose they tried adding small swap partitions and determined they weren't much help	19:39
fungi	or created some sort of problem	19:40
clarkb	or they subscribe to swap is bad and never swap	19:40
fungi	in good news, the rax-ord graph looks a bit healthier since 19:00z, but won't really know until the next round of daily periodic jobs kicks off	19:43
clarkb	the giteas seem to be working after I reduced their number to 4	19:48
clarkb	I'll leave things like this. If this keeps up I think we should consider reducing total giteas to 4 or maybe 6	19:49
fungi	agreed, we can observe and see if they end up being more/less loaded	19:54
ianw	clarkb: that's a good point about label-name != submit-requirement, it could be confusing. i agree we should match on the label in the s-r. i'll rework that	20:17
opendevreview	Steve Baker proposed openstack/diskimage-builder master: A new diskimage-builder command for yaml image builds https://review.opendev.org/c/openstack/diskimage-builder/+/876245	20:26
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Switch run_functests.sh from disk-image-create to diskimage-builder https://review.opendev.org/c/openstack/diskimage-builder/+/876479	20:26
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Document diskimage-builder command https://review.opendev.org/c/openstack/diskimage-builder/+/876633	20:26
clarkb	the vast majority of the gitea demand is on 09 right now. It seems to be keeping up which is cool	20:33
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/876449 is the next gitea task. This removes 05-07 from gerrit replication. If you're happ with the new servers so far I think we can land this one	20:34
fungi	yeah, seems they're managing the current load	20:43
clarkb	https://gerrit-review.googlesource.com/c/gerrit/+/362054 is the change I promised the gerrit community meeting I would write	20:49
fungi	clarkb: so if we were to set reserved_host_memory_mb in the scheduler config, i guess that would go on whichever server is running nova's controller service? or just on all of them? any idea if the deployment tooling for this has somewhere to register/persist config overrides like that so they survive redeployment?	20:51
fungi	or is there a scheduler configured on each hypervisor host?	20:52
JayF	clarkb: I don't have an account there, but s/safetey/safety/ on LN 5427	20:52
clarkb	fungi: you would apply it to the server that runs the mirror so that nova doesn't schedule too much workload on that host causing OOMs. I suspect that will go into the nova scheduler/placement databases and no there isn't anything to persist that	20:52
clarkb	JayF: thanks	20:52
JayF	clarkb: and thank you <3 documenting undocumented things	20:52
fungi	okay, so each compute host has a scheduler config? i'm a little lost wading through the nova docs	20:53
clarkb	fungi: oh does it go into the config file?	20:53
clarkb	I expected that would have been a runtime thing :/	20:53
clarkb	fungi: the way the cloud is deployed there is a three node hyperconverged set of nodes. This means they run everything including the control plane, ceph, nova compute and VMs	20:54
clarkb	then there are the additional nodes that only run the compute services	20:54
fungi	well, web searching turned up references in newton documentation, i'll try to refine my searching	20:54
clarkb	ya looks like it is a compute (not scheduler) config option	20:55
clarkb	to modify that I think what you are supposed to do is edit the kolla config and do a kolla deployment	20:55
fungi	so it probably moved to be host-specific after newton: https://docs.openstack.org/newton/config-reference/compute/schedulers.html	20:56
opendevreview	Merged opendev/system-config master: Remove gitea05-07 from Gerrit replication https://review.opendev.org/c/opendev/system-config/+/876449	20:56
clarkb	I don't want to have to page all that in. I think instead what we might be able to do is turn on the mirror then disable the compute service on the node the mirror runs on	20:56
clarkb	https://docs.openstack.org/nova/latest/admin/scheduling.html#compute-disabled-status-support	20:57
fungi	i guess otherwise we need different host aggregates for the two accounts?	20:57
clarkb	no the memory reservation should be global. But deploying it requires editing the kolla deployment there and while doable that could lead to a whole bunch of work	20:58
fungi	we could set one of the small servers in one aggregate and everything else in the other, then make it so the control plane account only creates servers in the small dedicated aggregate and the nodepool account uses the aggregate that contains the other 5 servers	20:58
fungi	is what i meant	20:58
clarkb	oh ya that would be another option	20:59
fungi	that way we'd never try to boot job nodes on the (smaller) server where the mirror is booted	20:59
clarkb	fungi: https://wiki.openstack.org/wiki/OpsGuide-Maintenance-Compute#Planned_Maintenance	20:59
fungi	unfortunately i only know about enough to throw that word salad together, not actually how it's done	20:59
clarkb	I feel like simply disabling the nova compute there is the easiest thing	21:00
clarkb	its a bit hacky but it should work	21:00
clarkb	just the first step there is what we would need	21:01
fungi	agreed. it's also reminding me why we prefer not to run our own openstack clouds	21:01
clarkb	fungi: Ithink you need to start the mirror node before you do that though	21:01
fungi	if we set the host into maintenance mode, will that block us from (re)booting the mirror there?	21:01
clarkb	basically start the mirror node, disable the compute on that node, then tell nodepool it can use thing again	21:01
fungi	yeah, that's what i was wondering	21:02
clarkb	fungi: maybe? I don't know if it completely breaks the ability to manage a running instance	21:02
clarkb	the docs there show migrate commands being valid so maybe not	21:02
clarkb	melwitt: ^ would likely know	21:02
fungi	but if it stops the instance later for some reason, starting it again may involve toggling that	21:02
clarkb	ya worst case we just turn it back on I guess	21:02
clarkb	I don't think we should overthink this since we should liekyl consider redeploying that cloud for other reasons anyway. Do the simple thing that works then try to incporporate what we've learned when we start over	21:03
clarkb	fungi: another option would be to migrate it to one of the bigger nodes then see if the test nodes are ok on that host	21:09
clarkb	but that likely requires more observation. The plan to just dedicate that node to the mirror seems simplest	21:09
clarkb	oh darn the gerrit replication update failed to update due to the base job failures. I think thats fine and we can wait for our daily run to update it and then make sure things are happy tomorrow	21:14
clarkb	assuming we fix the mirror in the meantime. Otherwise maybe we add the mirror to the emergency file so that it is skipped for now	21:14
melwitt	clarkb: not sure I got the exact question but disabling a compute service should only prevent any new instance being scheduled to it. it should not affect instances that are running there already	21:15
clarkb	melwitt: that is what we needed to know. Thanks! Basically we've got a compute + control plane node that has limited memory due to the control plane. We want to run a single long lived VM there and force other things to boot elsewhere. Disabling the compute service there seems like an easy straightforward way to do that	21:15
clarkb	there are other more elegant tools but they are all a bit more involved and require us to undersatnd running and deploying the cloud better	21:16
melwitt	clarkb: ah gotcha. I think that should work	21:16
fungi	awesome. i'll see what i can do to make that happen	21:18
fungi	mmm, i get "Failed to set service status to disabled" but it doesn't give me a reason	21:43
fungi	i tried with the hostname listed in the assets, and also with its public ip address	21:44
fungi	possible i'm not specifying it correctly, since if i put random garbage in place of the hostname i get the same error	21:45
fungi	and i can't `compute service list` as it returns a policy rejection	21:45
fungi	(this is all with the admin creds listed in our notes)	21:46
fungi	`compute agent list` is similarly rejected by policy	21:46
fungi	might be due to clouds.yaml including a project_name, project_id and user_domain_name which are probably irrelevant for admin? but commenting them out i get a message that the service catalog is empty	21:49
fungi	hypervisor list also disallowed by policy	21:51
clarkb	fungi: have you tried using the creds on the host? Maybe they differ in some way that is important	22:10
clarkb	also check the history on the nodes for what I've done in the past? I seem to recall some things are wierd about admin and you have to be explicit about it	22:11
clarkb	fungi: `source /etc/kolla/admin-openrc.sh`	22:16
clarkb	and `source /opt/kolla-ansible/victoria/bin/activate` to get the built in openstack client	22:16
clarkb	I'm able to run `openstack compute service list` after doing that	22:16
ianw	clarkb: if you have a sec to double check the syntax in https://review.opendev.org/q/topic:f37 system-config stuff, i can monitor. pretty mechanical, just swapping the mirroring from 35/36 to 36/37	22:33
clarkb	ianw: re https://review.opendev.org/c/opendev/system-config/+/876488/1 I feel like every time we look this up someone is still using it?	22:34
clarkb	I mean they really shouldn't but...	22:35
ianw	clarkb: i think that the problem used to be old branches, but now victoria isn't even using it	22:35
ianw	it's switched to coreos or whatever it's called now	22:35
clarkb	right but ussuri and stuff still exist?	22:36
clarkb	fwiw I want to delete it because its one of my biggest complaints with magnum as a user. It relies on ancient tools by the time you actually get deployed in production	22:36
ianw	no i think it's all retired now, at least the branches aren't there in gitea	22:36
clarkb	oh huh I Guess the openstack branch cleanups finally got rid of those	22:37
clarkb	ok ya if the older branches are gone then this should be fine. And really if we end up forcing the issue for any stragglers thats probably a good thing at this point	22:37
ianw	so i don't think it's an issue for CI at least	22:37
fungi	oh, got it, i was using the creds from our notes. i'll revisit with what's on the servers	22:38
clarkb	fungi: its possible what is in our notes was from an early iteration of the cloud that got wiped and replaced? THat happened a couple of times and I don't recall when those creds were written relative to that	22:39
fungi	makes sense, yeah	22:39
clarkb	ianw: related: I think we can drop xenial-* from the ubuntu ports mirror pretty safely	22:42
ianw	yeah, i'm not sure we ever published a xenial image	22:42
clarkb	for some reason we have like 1.4GB of thunderbird packages in the centos 8 stream repo	22:45
clarkb	shouldn't it be clearing out old versions of that?	22:45
clarkb	https://mirror.bhs1.ovh.opendev.org/centos/8-stream/core/x86_64/centos-plus/Packages/t/ ?	22:45
clarkb	oh its closer to 3GB due to aarch64 almost doubling it	22:46
clarkb	debian-security needs a quota bump	22:47
ianw	it is the same as upstream, at least, http://mirror.centos.org/centos/8-stream/core/x86_64/centos-plus/Packages/t/	22:47
clarkb	weird	22:47
clarkb	re debian-security stretch is still in there. I think we can drop that now too?	22:48
fungi	yeah, should be able to, maybe it missed a manual cleanup step	22:48
clarkb	I'm trying to remember what the process is to drop things from reprepro. You pull them from the config then do manual syncing?	22:48
fungi	also keep in mind we're probably a little over a month from the bookworm release	22:48
clarkb	stretch is still in the regular debian repo too	22:49
clarkb	I think we removed stretch from our reprepro configs but then didn't remove it from the mirror?	22:50
clarkb	though its not clear to me if the packages are in the pool or not	22:51
clarkb	spot checking the 0ad package I don't see stretches version in the pool. Its just listed under dists so I think that is not going to reclaim a bunch of space	22:52
ianw	i do feel like that got cleared ...	22:52
clarkb	ianw: ya I think this is just stale since the reprepro cleanup won't remove the indexes	22:52
clarkb	I'm checking the -security side next	22:52
ianw	2021-11-12 debian-stretch : merge and babysit removal changes	22:52
ianw	manual cleanup reprepo for debian/debian-security	22:53
ianw	from my notes, so it would have happened abou tthen	22:53
clarkb	yup both seem to lack stretch packages in the pools	22:54
clarkb	its just the index stubbed out	22:54
clarkb	we're probably going to need to bump the quota for debian-security in that case	22:54
clarkb	I guess packages that go into -security eventually end up in the regular repos but don't get cleaned out of -security?	22:58
clarkb	fungi: ^	22:58
clarkb	ianw: hrm I think xenial in ubuntu-ports is in the same situation	23:00
clarkb	the indexes are stubbed out but we don't seem to configure reprepro to mirror it and if I cross check Packages and pool the packages are not present	23:01
ianw	2021-10-29 : remove ubuntu-ports xenial	23:01
ianw	`<https://review.opendev.org/c/opendev/system-config/+/815914/>`__	23:01
clarkb	do we just need to rm the pool/ dirs?	23:01
clarkb	er sorry the dists/ dirs	23:01
ianw	this is what i would have done ... https://review.opendev.org/c/opendev/system-config/+/815920/2/doc/source/reprepro.rst	23:02
clarkb	it won't save a lot of space but could cut down on confusion	23:02
ianw	it may well be that the clearvanished doesn't remove those	23:02
clarkb	ya I think I ran into this with something else	23:02
clarkb	I think clearvanished only cleans up pool/ but not dists/. fungi do you recall and is simply rm'ing those dirs out of the dists/ dir the right step?	23:03
clarkb	separately the growth on these mirrors over the last 3 months is crazy for what are not quite static but also fairly stable distros	23:03
ianw	clarkb: did you have thoughts on https://paste.opendev.org/show/bmy36m7TO2b4cpndO5gO/ , which is the All-Projects update?	23:08
fungi	clarkb: i don't know, not sure we've ever tried that	23:08
clarkb	ianw: sorry I missed that because it isn't in a change. Was there a docs change though? maybe not in that stack?	23:09
fungi	as for growth, i wonder if it's divergence between sid and bookworm since the freeze started? but we don't mirror either of those to my knowledge, so likely unrelated	23:10
ianw	clarkb: yeah, basically the same thing to the bootstrap docs -> https://review.opendev.org/c/opendev/system-config/+/876237/ & https://review.opendev.org/c/opendev/system-config/+/876236/1	23:10
clarkb	ianw: copyCondition = changekind:NO_CODE_CHANGE OR changekind:TRIVIAL_REBASE OR is:MIN <- is the only thing that jumps out since previously we were only copying min and trivial rebases. Not no code change. I think the difference is that we'll preserve votes if the commit message changes and we probably shouldnt' do that?	23:11
clarkb	ianw: also All-Projects is probably somethingto update after the openstack release since it will have broad impact?	23:12
ianw	well i guess the theory is it either has no impact or we roll it back immediately	23:12
ianw	certainly not a push and walk away thing	23:13
clarkb	I left a note on the change about the copycondition	23:14
clarkb	ya thats a good point	23:14
ianw	good point i'm wondering why i put no_code_change	23:14
clarkb	I'm putting the agenda together. ACL things are on there (though I may need to update some links). fungi do you want me to add the rax-ord and inmotion mirror stuff?	23:15
opendevreview	Ian Wienand proposed opendev/system-config master: doc/gerrit : update copyCondition https://review.opendev.org/c/opendev/system-config/+/876236	23:23
opendevreview	Ian Wienand proposed opendev/system-config master: doc/gerrit : update to submit-requirements https://review.opendev.org/c/opendev/system-config/+/876237	23:23
clarkb	ianw: oh heh I just left a comment about the overrides on https://review.opendev.org/c/opendev/system-config/+/876237 too	23:24
clarkb	is it only infra-specs that does that? If so I thin kwe can "break" things for infra-specs and force us to leave a code review and a rollcall	23:24
ianw	i think there's two that do that -- but i think it's copied from infra-specs, let me check	23:25
ianw	yeah -> https://review.opendev.org/c/openstack/project-config/+/875804/4/gerrit/acls/openstack/governance.config	23:25
clarkb	ianw: if its just those two I think we can ask the openstack tc to see if they are ok leaving a +2 as well	23:26
clarkb	I like not having overrides for unnecessary special behaviors :)	23:27
opendevreview	Merged opendev/system-config master: mirror-update: stop mirroring old atomic version https://review.opendev.org/c/opendev/system-config/+/876488	23:27
ianw	if we want to override, I think the submit-requirement needs to be called "Code-Review"	23:27
clarkb	I'm not sure the actual submit-requirement names mean anything?	23:28
clarkb	but maybe thats what they mean by override in this case is replacing a named submit-requirement and not changing the submittableIf?	23:28
ianw	once again the docs are a bit unclear	23:28
clarkb	agreed :)	23:28
ianw	https://gerrit-review.googlesource.com/Documentation/config-submit-requirements.html#inheritance	23:28
ianw	"administrators can redefine the requirement with the same name in the child project and set the applicableIf expression to is:false"	23:29
clarkb	aha so ya I guess the name is used to know what to override	23:29
clarkb	X is replaced by X'	23:29
ianw	that was what made me think it looks at the name and basically overwrites	23:29
clarkb	but you could have two different submit requirements that look at code-review submittableif conditions and only override one or the other	23:30
ianw	i'm not sure it would match that. i feel like it would treat each s-r totally separately?	23:32
clarkb	ya I think so	23:32
clarkb	its just confusing because labels and submit requirements don't have a 1:1 relationship but overrides do	23:33
ianw	what i can do is send a doc update change that explains it the way i think it works. which can either be accepted or rejected, assuming anyone wants to review it	23:33
clarkb	and the docs don't really go into a ton of depth here :/	23:33
ianw	I think i'm in agreement that the best way to avoid problems overriding code-review is to just not play the game. i'll double check what's doing that, and i think we can probably propose to change those ACL's as a first step	23:34
clarkb	++	23:35
clarkb	ianw: we control acls through cnetralized code review which makes overrides pretty safe. But ya I agree	23:40
fungi	also we have a linter written in a turing-complete language which can essentially enforce whatever policies we want to enact around that	23:44
ianw	this is true, but i wondered how far to go with the normalization script with this	23:45
ianw	i mean i could make it convert what we have to submit-requirements, but it seems like overkill	23:45
clarkb	if its really only a small number of situations then I think simplifying is a good thing	23:45
clarkb	its not the end of the world to leave a +2 and a +1	23:46
clarkb	fungi: re inmotion I'm still ssh'd in there should I disale the compute service running the mirror then we can start it again?	23:46
ianw	there's actually 4 that do it	23:48
ianw	opendev/infra-specs.config openinfra/transparency-policy openstack/governance.config starlingx/governance.config	23:48
fungi	clarkb: oh, if you want to please go for it. i hadn't freed back up sufficiently to revisit it yet and my evening is encroaching	23:49
fungi	i can probably get to it in my morning tomorrow otherwise	23:49
clarkb	I'm hoping to get it done today so that the base job runs successfully	23:49
clarkb	let me give it a go	23:49
fungi	oh, i didn't think about it blocking the base job, we can also add the mirror there to the emergency disable list temporarily so it will be skipped	23:50
clarkb	ya but this should go quickly once I identify the compute node hosting the mirror	23:51
fungi	.130 (parakeet) was the one with all the oom events	23:51
ianw	https://review.opendev.org/c/openstack/project-config/+/185785 has the thinking behind it	23:52
ianw	given the context there, it being a thought-out approach to comments on TC issues, i'm not sure i'd want to argue for not overriding Code-Review	23:55
clarkb	is there no good way to go from a hostId to the compute list?	23:56
clarkb	this seems like it should be trivial and yet	23:56
fungi	hostid is hashed by the tenant/project for privacy reasons	23:57
fungi	so the same host has a different hostid depending on which project the user querying it belongs to	23:58
clarkb	how are you supposed to use it then?	23:59
fungi	i remember this coming up when we originally asked the nova devs to expose a host identifier to normal users	23:59
fungi	i have to believe there's an admin function to convert it or look it up	23:59
fungi	but i've never been on that end of the situation	23:59
clarkb	I can't show the instance as admin because the project stuff is all wrong too it seems like	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!