opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [wip] f37 https://review.opendev.org/c/openstack/diskimage-builder/+/876482 | 00:22 |
---|---|---|
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [wip] f37 https://review.opendev.org/c/openstack/diskimage-builder/+/876482 | 00:50 |
clarkb | ianw: thank you for the reviews on the gitea stack. I'll fix that last one tomorrow morning and try to review the acl stack again while I'm letting those changes make their way through to production | 00:58 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update : drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486 | 03:49 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487 | 03:49 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update : drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486 | 04:27 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487 | 04:27 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update: stop mirroring old atomic version https://review.opendev.org/c/opendev/system-config/+/876488 | 04:27 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update: drop Fedora 35 https://review.opendev.org/c/opendev/system-config/+/876486 | 04:31 |
opendevreview | Ian Wienand proposed opendev/system-config master: mirror-update: Add Fedora 37 https://review.opendev.org/c/opendev/system-config/+/876487 | 04:31 |
*** jpena|off is now known as jpena | 08:06 | |
opendevreview | Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270 | 10:55 |
opendevreview | Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270 | 11:19 |
opendevreview | Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270 | 11:27 |
opendevreview | Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270 | 11:45 |
opendevreview | Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270 | 11:47 |
opendevreview | Maksim Malchuk proposed openstack/diskimage-builder master: Add swap support https://review.opendev.org/c/openstack/diskimage-builder/+/869270 | 11:54 |
fungi | looking at the nodepool graphs for rax-ord, there's something pretty wrong in there | 13:49 |
fungi | from what i can piece together, nodepool is getting a bunch of launch failures with "Timeout waiting for instance creation" and then proceeds to ask the cloud to delete the node and immediately deletes the znode, so nodepool is no longer tracking those, but they're hanging around in an active state in the cloud consuming quota for ages | 13:50 |
fungi | i'm watching one which had its deletion requested over half an hour ago and is still in an active state according to openstack server show | 13:51 |
fungi | anyway, the end result is that we're averaging something like 5% effective utilization of the quota we have there | 13:52 |
fungi | and since it's the largest quota of any region we have access to, that's a huge chunk of our aggregate quota we can't use (more than 25%) | 13:54 |
fungi | node 0033377524 is one of the examples i'm looking at | 13:55 |
fungi | 2023-03-06 13:19:22,943 INFO nodepool.StateMachineNodeDeleter.rax-ord: [node: 0033377524] Deleting ZK node id=0033377524, state=deleting, external_id=None | 13:55 |
fungi | corresponding server instance c8f2f797-004a-4f26-8883-798ed0561926 finally disappeared moments ago, roughly 37 minutes after nl01 issued the delete | 13:57 |
fungi | also nl01 is logging a bunch of tracebacks checking the quota | 13:58 |
fungi | File "/usr/local/lib/python3.11/site-packages/nodepool/driver/utils.py", line 355, in estimatedNodepoolQuotaUsed | 13:59 |
fungi | if node.type[0] not in provider_pool.labels: | 13:59 |
fungi | IndexError: list index out of range | 13:59 |
fungi | i don't see any other launchers logging that exception, so could be something specific to rackspace's api responses i suppose | 13:59 |
frickler | this also doesn't look good https://paste.opendev.org/show/bZ1r1HWRmIQacGhpVhN8/ | 14:33 |
frickler | not sure whether we have created a busy loop of lots of creations leading to long startup times leading to lot of timeouts | 14:34 |
frickler | we could consider bumping the launch-timeout. or lower the quota for some time to see if it recovers | 14:35 |
frickler | maybe there's also another bug in the new state machine code. seems we don't have good test coverage for that | 14:37 |
fungi | yeah, i suppose if we're deleting the node from zk immediately and relying on the quota checking to keep us honest, but then can't actually check the utilization because of that exception and fall back on max-servers and the assumption that the nodes it knows about are the only ones that exist, then we could be trying to boot over quota too | 14:41 |
fungi | might even be leading us to hammer the api, kicking api rate limits into effect, slowing our calls down even more and creating a vicious cycle | 14:50 |
fungi | need to go run some errands, but should be back within the hour | 14:57 |
clarkb | the rax ord thing appears to have been going on for months. | 16:14 |
clarkb | I suspect it is something to do withthe cloud itself given that and the lack of issues for the other two regions | 16:14 |
clarkb | I would probably start by increasing boot timeout? | 16:16 |
clarkb | since it appears to be booting things successfully but deciding they aren't coming up fast enough? | 16:16 |
*** gthiemon1e is now known as gthiemonge | 16:24 | |
clarkb | oof launch timeout is already 10 minutes | 16:25 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: Correct boot path to cover FIPS usage cases https://review.opendev.org/c/openstack/diskimage-builder/+/876192 | 16:31 |
opendevreview | Clark Boylan proposed opendev/system-config master: Switch borg backup from gitea01 to gitea09 https://review.opendev.org/c/opendev/system-config/+/876471 | 16:35 |
clarkb | infra-root if you get a chance to look at the new gitea servers I think https://review.opendev.org/c/opendev/system-config/+/876448 is ready to go | 16:36 |
clarkb | re rax ord maybe the thing to do is set max-servers to 0 and let it clean up after itself | 16:52 |
clarkb | then increase the number slowly and see if the timeouts persist | 16:53 |
clarkb | the number == max-servers | 16:53 |
fungi | yeah, i'm looking through the config, we already set boot-timeout: 120 and launch-timeout: 600 across all rackspace regions | 16:54 |
fungi | which is pretty lengthy | 16:54 |
clarkb | fungi: do we know which timeout we are hitting? | 16:54 |
fungi | though this is boot-timeout it's running into i think | 16:55 |
fungi | "Timeout waiting for instance creation" | 16:55 |
clarkb | boot-timeout is the timeout waiting for openstack to report a node ready iirc. and then launch timeout is time to be able to ssh in | 16:55 |
fungi | yeah, so maybe i'll up that to 300 and see if it helps | 16:55 |
fungi | just in ord | 16:55 |
clarkb | wfm | 16:55 |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Increase boot-timeout for rax-ord https://review.opendev.org/c/openstack/project-config/+/876592 | 16:58 |
fungi | in theory the launcher should clean up after itself anyway within an hour, if everything else is working. toggling max-servers to 0 and back is probably not going to change that | 16:59 |
clarkb | well it was mostly an idea to start small and ramp up with clean data to see if we're our own worst enemy there | 17:00 |
fungi | granted my samples were random, but it seemed like the launcher was cleaning up behind itself anyway | 17:01 |
clarkb | fungi: if you have time to review https://review.opendev.org/c/opendev/system-config/+/876448/ I'd love to get that in today. | 17:07 |
clarkb | I'm thinking I will also drop 01-04 from haproxy manually to see what load looks like on the four new servers | 17:08 |
fungi | sure, sounds great | 17:12 |
*** jpena is now known as jpena|off | 17:13 | |
opendevreview | daniel.pawlik proposed zuul/zuul-jobs master: Provide deploy-microshift role https://review.opendev.org/c/zuul/zuul-jobs/+/876081 | 17:22 |
opendevreview | Merged openstack/project-config master: Increase boot-timeout for rax-ord https://review.opendev.org/c/openstack/project-config/+/876592 | 17:38 |
fungi | that deployed about 10 minutes ago, so hopefully we'll see the graph there smooth out by 19:00z | 17:55 |
clarkb | ianw: I left some comments on the acl stack. Let me know what you think about the submit-requirement implied mapping problem | 17:58 |
opendevreview | Merged opendev/system-config master: Replace gitea05-07 with gitea10-12 in haproxy https://review.opendev.org/c/opendev/system-config/+/876448 | 18:30 |
fungi | clarkb: mirror.iad3.inmotion.opendev.org seems to be offline again, powered off since 2023-03-03T18:15:28Z (3 days ago), but non-impacting since we zeroed max-servers for that provider. what's the best way to go about figuring out what's broken in there? | 18:40 |
fungi | this is three times in two weeks, so something is definitely repeatedly killing the instance | 18:40 |
fungi | i guess i can ssh into the nova controller and look at the service logs for clues? | 18:41 |
clarkb | fungi: the first thing I would check is if the nova api (server show) lists any errors | 18:41 |
clarkb | you can run that as our normal user and as admin. I think admin may get more info | 18:41 |
fungi | server show doesn't report an error condition, no | 18:42 |
clarkb | ok. In that case I'd probably find the hypervisor and see what nova compute logs and virsh/libvirt/qemu have to say about it | 18:42 |
clarkb | there should be instance logs in /var/run//libvirt/something/or/other iirc | 18:42 |
fungi | just that the power_state is Shutdown, vm_state is stopped, status is SHUTOFF | 18:42 |
fungi | ah, as admin. i'll see if we have credentials for that in clouds.yaml already | 18:43 |
clarkb | fungi: well if it doesn't show an error then admin probably won't | 18:44 |
clarkb | fungi: we don't have them in bridge clouds.yaml but they are in a clouds.yaml or a openrc on the hosts themselves | 18:44 |
fungi | no, no error whatsoever, just looks as though someone logged into the server and issued a poweroff | 18:44 |
clarkb | ya so unlikely to be any different listing things as admin | 18:44 |
fungi | i vaguely remember something similar happening to our mirror in the older linaro deployment | 18:45 |
fungi | or might have been the builder | 18:45 |
clarkb | ya I would look in the libvirt/qemu/nova compute logs | 18:45 |
clarkb | those are actually two separate things but I would see if they have any hints | 18:46 |
jrosser | you would get something like that if the OOM killer terminated the VM? | 18:47 |
clarkb | jrosser: yup or if the VM hit some sort of nested virt failure (we have nested virt enabled but that vm doesn't do virt, but maybe its tripping it anyway) | 18:49 |
fungi | sure, i suppose dmesg on the compute hosts would be a good first thing to check | 18:49 |
fungi | how do i find the names/addresses of all the compute hosts? | 18:49 |
clarkb | fungi: I think you login to the control panel with the secrets file infos and that gives you a listing. Its also the first three IPs after the api endpoint iirc | 18:50 |
clarkb | there are a couple of extra hypervisors now too and I don't recall if they are in order too or not | 18:50 |
fungi | oh, there's a control panel? i probably knew that at one point and then forgot | 18:51 |
clarkb | ya there is the baremetal control panel which is separate from the openstack horizon stuff. The details for both are in secrets iirc | 18:52 |
clarkb | but that control panel should list all the hypervisors and their IPs | 18:52 |
fungi | thanks. i'll see if i have an opportunity to take a look there in a bit | 18:52 |
fungi | i guess it's been a while since we tried that. the credentials we have on file are giving me "Invalid e-mail address and/or password" | 18:55 |
clarkb | hrm | 18:55 |
fungi | unrelated, looks like infra-prod-base failed on deploy for the gitea-lb update, likely due to the inmotion mirror being offline again | 18:56 |
fungi | so maybe not so unrelated i guess | 18:56 |
clarkb | ya I'm not too worried about that | 18:56 |
clarkb | it applied the lb update anyway | 18:56 |
clarkb | fungi: maybe try resetting the password? it should go to the shared email inbox. I just looked and there is/was email sent there | 18:57 |
clarkb | otherwise we may need to reach out to them for help. For logging into hypervisors they are the three ips after the api endpoint though | 18:57 |
clarkb | (also you can ssh to the api endpoint and ou get load balanced to a random one) | 18:57 |
clarkb | jimmy might be of some help there too since I Think things got split into two companies? | 18:58 |
fungi | yuris was in here for a while too, i thought | 18:59 |
clarkb | I think yuris didn't maintain a persistent irc client | 18:59 |
clarkb | but ya has been in and out but isn't here now | 18:59 |
fungi | okay, so the old login url we had in our notes goes to a completely different system now. using the correct (new) url, i'm able to get into the webui there | 19:01 |
fungi | manage->assets shows ip addresses for servers, though it's unclear what roles they play as they have generic types and randomly generated names | 19:03 |
fungi | i'm guessing the three with more ram are the compute hosts? | 19:03 |
clarkb | fungi: the way the deployment was made was a converged set of three hosts doing everything. THen to get the max-servers count out I think a couple of compute only hosts were added | 19:04 |
clarkb | fungi: but ya the automated deployment system doesn't name things with helpful hints | 19:04 |
fungi | the assets list has 3x servers with 16 cores and 128gb ram, also 3x servers with 40 cores and 510gb ram | 19:05 |
fungi | so yeah, i suppose the smaller servers were the ones added to get the additional /28 netblock assignments | 19:05 |
clarkb | I'm not sure which are which. You'll probably need to use the nova api to help you sort out which host that vm was on | 19:05 |
clarkb | server show will give you a host id | 19:06 |
clarkb | then there is some nova services listing that will give you the host ids mapped to useful things | 19:06 |
fungi | does horizon have admin bits, or is it strictly for non-admin features? | 19:06 |
clarkb | fungi: `openstack compute service list` | 19:06 |
clarkb | I don't know. I avoid using horizon as much as possible >_> | 19:07 |
fungi | i was assuming i'd need to use it to get the correct clouds.yaml values, but i guess those don't change for admin context anyway so i can reuse most of what's in our existing clouds.yaml | 19:08 |
clarkb | fungi: the clouds.yaml is on those hosts | 19:08 |
clarkb | if you ssh into one of them you'll have access to what is necessart to use openstack client | 19:08 |
fungi | catch 22, i don't have ssh keys for them | 19:09 |
clarkb | you should. Pretty sure we added your keys when this was deployed | 19:09 |
fungi | oh? i didn't even think to try | 19:09 |
fungi | hah | 19:09 |
fungi | mmm, not as fungi, but root worked! | 19:10 |
clarkb | yes we don't have specific users on these hosts | 19:10 |
clarkb | (because all of the deployment is automated by that control panel) | 19:10 |
fungi | [Fri Mar 3 18:03:51 2023] Out of memory: Killed process 4017237 (qemu-kvm) total-vm:10455440kB, anon-rss:8495744kB, file-rss:0kB, shmem-rss:4kB, UID:42436 pgtables:17752kB oom_score_adj:0 | 19:12 |
clarkb | cool. I don't think we are oversubscribing memory in openstack. But we are hyperconverged or whatever | 19:12 |
clarkb | that means that maybe the openstack services are using more memory and that is impacting our VM? | 19:12 |
fungi | jrosser wins a cookie | 19:12 |
jrosser | \o/ | 19:13 |
clarkb | We might be able to deal with that by tuning max-servers down a bit and also telling openstack to use less resources? Oh also maybe we need to look for leaked resources | 19:13 |
clarkb | leaked VMs just hanging out might be consuming too much memory or something | 19:13 |
jrosser | i saw something a that felt similar here when we tried to fit exactly 2 giant GPU VM per host | 19:13 |
jrosser | and there was not quite enough memory spare with 2 instances plus + $everything-else | 19:14 |
fungi | clarkb: so of the 6 servers in the assets list, one of the smaller type has a slew of oom errors in dmesg and the other 5 are clean | 19:14 |
jrosser | so the second instance to boot killed the first one | 19:14 |
clarkb | fungi: ya I think the smaller ones are the control plane | 19:14 |
clarkb | fungi: another option may be to move the mirror to one of the larger nodes | 19:14 |
clarkb | since they are just compute nodes iirc | 19:14 |
jrosser | if there are things running on some nodes that nova doesnt know about you can use `reserved_host_memory_mb` | 19:16 |
clarkb | jrosser: Ithink this deployment is alread doing that, but ya memory needs may have expanded beyond that existing value | 19:16 |
fungi | checking memory utilization, right now mysqld, glance, nova, cinder, ceph et al are taking up around 2/3 of the available memory on this machine. i don't see any qemu processes (unsurprising since we're not booting nodes there and the mirror is presently down), but i think that rules out lost virtual machines taking up excess ram | 19:17 |
clarkb | ++ | 19:17 |
clarkb | we could also try rebooting/restarting services so that they give back to the operating system | 19:18 |
clarkb | but then we run the risk of rabbit getting angry | 19:18 |
clarkb | but the cloud isn't in use so the risk to us if that happens is basically nil | 19:18 |
fungi | i wonder if we can exclude this server from use for job nodes? | 19:21 |
fungi | it has sufficient memory to run the mirror vm, but if we boot more than a few job nodes it won't | 19:21 |
clarkb | fungi: we could run the mirror there and then set the value jrosser pointed too | 19:21 |
clarkb | fungi: that seems like a good thing to try | 19:22 |
clarkb | basically set it so that nova won't try to run anything else there because the mirror is already consuming those resources | 19:22 |
jrosser | i wonder how much value there is actually in trying to run VM on those smaller nodes at all | 19:24 |
clarkb | #status log Manually disabled gitea01-04 in haproxy to force traffic to go to the new larger gitea servers. This can be undone if the larger servers are not large enough to handle the load. | 19:26 |
opendevstatus | clarkb: finished logging | 19:26 |
clarkb | infra-root ^ fyi | 19:26 |
clarkb | jrosser: well we're severely resource constrainted. So the value is in getting as much as we can out of the system | 19:27 |
fungi | the smaller servers each represent 1/15 (approximately 7%) of our overall memory capacity | 19:29 |
fungi | i'm guessing this one is where most of the central openstack services are parked though, hence much of the overhead being not virtual machines | 19:30 |
fungi | though oddly no, looking through each of the servers, free reports approximately 24gb available on each of the smaller servers, and 432gb available on each of the larger servers | 19:32 |
fungi | so there's around 80-100gb occupied by running services on each of those 6 servers | 19:33 |
clarkb | ya so best thing may be to simply edit the reservation that nova avoids | 19:34 |
clarkb | and possibly even prevent nova from launching test vms on those nodes alltogether | 19:34 |
fungi | i guess if i look at strictly used (no shmem or buffers/cache) it's more like 50-90gb overhead on each server | 19:35 |
fungi | well, strangely it's only the server with the mirror booted on it which was getting into an oom situation | 19:35 |
clarkb | fungi: probably because it is the only one with a long lived VM on it | 19:35 |
clarkb | so its going to end up with more memory pressure over time on average? | 19:36 |
fungi | probably. also i wonder if adding a gb or two of swap to these would be a terrible idea, just so the kernel can page out infrequently accessed data to free up more for cache | 19:36 |
fungi | right now none of them has any swap memory at all | 19:37 |
fungi | clarkb: do you happen to know whether there's a specific reason that was avoided? | 19:38 |
clarkb | fungi: no, its all done by their tooling | 19:38 |
clarkb | the only thing we select is the number of nodes. We don't select partition layouts, IPs, hostnames, etc | 19:38 |
fungi | i suppose they tried adding small swap partitions and determined they weren't much help | 19:39 |
fungi | or created some sort of problem | 19:40 |
clarkb | or they subscribe to swap is bad and never swap | 19:40 |
fungi | in good news, the rax-ord graph looks a bit healthier since 19:00z, but won't really know until the next round of daily periodic jobs kicks off | 19:43 |
clarkb | the giteas seem to be working after I reduced their number to 4 | 19:48 |
clarkb | I'll leave things like this. If this keeps up I think we should consider reducing total giteas to 4 or maybe 6 | 19:49 |
fungi | agreed, we can observe and see if they end up being more/less loaded | 19:54 |
ianw | clarkb: that's a good point about label-name != submit-requirement, it could be confusing. i agree we should match on the label in the s-r. i'll rework that | 20:17 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: A new diskimage-builder command for yaml image builds https://review.opendev.org/c/openstack/diskimage-builder/+/876245 | 20:26 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Switch run_functests.sh from disk-image-create to diskimage-builder https://review.opendev.org/c/openstack/diskimage-builder/+/876479 | 20:26 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Document diskimage-builder command https://review.opendev.org/c/openstack/diskimage-builder/+/876633 | 20:26 |
clarkb | the vast majority of the gitea demand is on 09 right now. It seems to be keeping up which is cool | 20:33 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/876449 is the next gitea task. This removes 05-07 from gerrit replication. If you're happ with the new servers so far I think we can land this one | 20:34 |
fungi | yeah, seems they're managing the current load | 20:43 |
clarkb | https://gerrit-review.googlesource.com/c/gerrit/+/362054 is the change I promised the gerrit community meeting I would write | 20:49 |
fungi | clarkb: so if we were to set reserved_host_memory_mb in the scheduler config, i guess that would go on whichever server is running nova's controller service? or just on all of them? any idea if the deployment tooling for this has somewhere to register/persist config overrides like that so they survive redeployment? | 20:51 |
fungi | or is there a scheduler configured on each hypervisor host? | 20:52 |
JayF | clarkb: I don't have an account there, but s/safetey/safety/ on LN 5427 | 20:52 |
clarkb | fungi: you would apply it to the server that runs the mirror so that nova doesn't schedule too much workload on that host causing OOMs. I suspect that will go into the nova scheduler/placement databases and no there isn't anything to persist that | 20:52 |
clarkb | JayF: thanks | 20:52 |
JayF | clarkb: and thank you <3 documenting undocumented things | 20:52 |
fungi | okay, so each compute host has a scheduler config? i'm a little lost wading through the nova docs | 20:53 |
clarkb | fungi: oh does it go into the config file? | 20:53 |
clarkb | I expected that would have been a runtime thing :/ | 20:53 |
clarkb | fungi: the way the cloud is deployed there is a three node hyperconverged set of nodes. This means they run everything including the control plane, ceph, nova compute and VMs | 20:54 |
clarkb | then there are the additional nodes that only run the compute services | 20:54 |
fungi | well, web searching turned up references in newton documentation, i'll try to refine my searching | 20:54 |
clarkb | ya looks like it is a compute (not scheduler) config option | 20:55 |
clarkb | to modify that I think what you are supposed to do is edit the kolla config and do a kolla deployment | 20:55 |
fungi | so it probably moved to be host-specific after newton: https://docs.openstack.org/newton/config-reference/compute/schedulers.html | 20:56 |
opendevreview | Merged opendev/system-config master: Remove gitea05-07 from Gerrit replication https://review.opendev.org/c/opendev/system-config/+/876449 | 20:56 |
clarkb | I don't want to have to page all that in. I think instead what we might be able to do is turn on the mirror then disable the compute service on the node the mirror runs on | 20:56 |
clarkb | https://docs.openstack.org/nova/latest/admin/scheduling.html#compute-disabled-status-support | 20:57 |
fungi | i guess otherwise we need different host aggregates for the two accounts? | 20:57 |
clarkb | no the memory reservation should be global. But deploying it requires editing the kolla deployment there and while doable that could lead to a whole bunch of work | 20:58 |
fungi | we could set one of the small servers in one aggregate and everything else in the other, then make it so the control plane account only creates servers in the small dedicated aggregate and the nodepool account uses the aggregate that contains the other 5 servers | 20:58 |
fungi | is what i meant | 20:58 |
clarkb | oh ya that would be another option | 20:59 |
fungi | that way we'd never try to boot job nodes on the (smaller) server where the mirror is booted | 20:59 |
clarkb | fungi: https://wiki.openstack.org/wiki/OpsGuide-Maintenance-Compute#Planned_Maintenance | 20:59 |
fungi | unfortunately i only know about enough to throw that word salad together, not actually how it's done | 20:59 |
clarkb | I feel like simply disabling the nova compute there is the easiest thing | 21:00 |
clarkb | its a bit hacky but it should work | 21:00 |
clarkb | just the first step there is what we would need | 21:01 |
fungi | agreed. it's also reminding me why we prefer not to run our own openstack clouds | 21:01 |
clarkb | fungi: Ithink you need to start the mirror node before you do that though | 21:01 |
fungi | if we set the host into maintenance mode, will that block us from (re)booting the mirror there? | 21:01 |
clarkb | basically start the mirror node, disable the compute on that node, then tell nodepool it can use thing again | 21:01 |
fungi | yeah, that's what i was wondering | 21:02 |
clarkb | fungi: maybe? I don't know if it completely breaks the ability to manage a running instance | 21:02 |
clarkb | the docs there show migrate commands being valid so maybe not | 21:02 |
clarkb | melwitt: ^ would likely know | 21:02 |
fungi | but if it stops the instance later for some reason, starting it again may involve toggling that | 21:02 |
clarkb | ya worst case we just turn it back on I guess | 21:02 |
clarkb | I don't think we should overthink this since we should liekyl consider redeploying that cloud for other reasons anyway. Do the simple thing that works then try to incporporate what we've learned when we start over | 21:03 |
clarkb | fungi: another option would be to migrate it to one of the bigger nodes then see if the test nodes are ok on that host | 21:09 |
clarkb | but that likely requires more observation. The plan to just dedicate that node to the mirror seems simplest | 21:09 |
clarkb | oh darn the gerrit replication update failed to update due to the base job failures. I think thats fine and we can wait for our daily run to update it and then make sure things are happy tomorrow | 21:14 |
clarkb | assuming we fix the mirror in the meantime. Otherwise maybe we add the mirror to the emergency file so that it is skipped for now | 21:14 |
melwitt | clarkb: not sure I got the exact question but disabling a compute service should only prevent any new instance being scheduled to it. it should not affect instances that are running there already | 21:15 |
clarkb | melwitt: that is what we needed to know. Thanks! Basically we've got a compute + control plane node that has limited memory due to the control plane. We want to run a single long lived VM there and force other things to boot elsewhere. Disabling the compute service there seems like an easy straightforward way to do that | 21:15 |
clarkb | there are other more elegant tools but they are all a bit more involved and require us to undersatnd running and deploying the cloud better | 21:16 |
melwitt | clarkb: ah gotcha. I think that should work | 21:16 |
fungi | awesome. i'll see what i can do to make that happen | 21:18 |
fungi | mmm, i get "Failed to set service status to disabled" but it doesn't give me a reason | 21:43 |
fungi | i tried with the hostname listed in the assets, and also with its public ip address | 21:44 |
fungi | possible i'm not specifying it correctly, since if i put random garbage in place of the hostname i get the same error | 21:45 |
fungi | and i can't `compute service list` as it returns a policy rejection | 21:45 |
fungi | (this is all with the admin creds listed in our notes) | 21:46 |
fungi | `compute agent list` is similarly rejected by policy | 21:46 |
fungi | might be due to clouds.yaml including a project_name, project_id and user_domain_name which are probably irrelevant for admin? but commenting them out i get a message that the service catalog is empty | 21:49 |
fungi | hypervisor list also disallowed by policy | 21:51 |
clarkb | fungi: have you tried using the creds on the host? Maybe they differ in some way that is important | 22:10 |
clarkb | also check the history on the nodes for what I've done in the past? I seem to recall some things are wierd about admin and you have to be explicit about it | 22:11 |
clarkb | fungi: `source /etc/kolla/admin-openrc.sh` | 22:16 |
clarkb | and `source /opt/kolla-ansible/victoria/bin/activate` to get the built in openstack client | 22:16 |
clarkb | I'm able to run `openstack compute service list` after doing that | 22:16 |
ianw | clarkb: if you have a sec to double check the syntax in https://review.opendev.org/q/topic:f37 system-config stuff, i can monitor. pretty mechanical, just swapping the mirroring from 35/36 to 36/37 | 22:33 |
clarkb | ianw: re https://review.opendev.org/c/opendev/system-config/+/876488/1 I feel like every time we look this up someone is still using it? | 22:34 |
clarkb | I mean they really shouldn't but... | 22:35 |
ianw | clarkb: i think that the problem used to be old branches, but now victoria isn't even using it | 22:35 |
ianw | it's switched to coreos or whatever it's called now | 22:35 |
clarkb | right but ussuri and stuff still exist? | 22:36 |
clarkb | fwiw I want to delete it because its one of my biggest complaints with magnum as a user. It relies on ancient tools by the time you actually get deployed in production | 22:36 |
ianw | no i think it's all retired now, at least the branches aren't there in gitea | 22:36 |
clarkb | oh huh I Guess the openstack branch cleanups finally got rid of those | 22:37 |
clarkb | ok ya if the older branches are gone then this should be fine. And really if we end up forcing the issue for any stragglers thats probably a good thing at this point | 22:37 |
ianw | so i don't think it's an issue for CI at least | 22:37 |
fungi | oh, got it, i was using the creds from our notes. i'll revisit with what's on the servers | 22:38 |
clarkb | fungi: its possible what is in our notes was from an early iteration of the cloud that got wiped and replaced? THat happened a couple of times and I don't recall when those creds were written relative to that | 22:39 |
fungi | makes sense, yeah | 22:39 |
clarkb | ianw: related: I think we can drop xenial-* from the ubuntu ports mirror pretty safely | 22:42 |
ianw | yeah, i'm not sure we ever published a xenial image | 22:42 |
clarkb | for some reason we have like 1.4GB of thunderbird packages in the centos 8 stream repo | 22:45 |
clarkb | shouldn't it be clearing out old versions of that? | 22:45 |
clarkb | https://mirror.bhs1.ovh.opendev.org/centos/8-stream/core/x86_64/centos-plus/Packages/t/ ? | 22:45 |
clarkb | oh its closer to 3GB due to aarch64 almost doubling it | 22:46 |
clarkb | debian-security needs a quota bump | 22:47 |
ianw | it is the same as upstream, at least, http://mirror.centos.org/centos/8-stream/core/x86_64/centos-plus/Packages/t/ | 22:47 |
clarkb | weird | 22:47 |
clarkb | re debian-security stretch is still in there. I think we can drop that now too? | 22:48 |
fungi | yeah, should be able to, maybe it missed a manual cleanup step | 22:48 |
clarkb | I'm trying to remember what the process is to drop things from reprepro. You pull them from the config then do manual syncing? | 22:48 |
fungi | also keep in mind we're probably a little over a month from the bookworm release | 22:48 |
clarkb | stretch is still in the regular debian repo too | 22:49 |
clarkb | I think we removed stretch from our reprepro configs but then didn't remove it from the mirror? | 22:50 |
clarkb | though its not clear to me if the packages are in the pool or not | 22:51 |
clarkb | spot checking the 0ad package I don't see stretches version in the pool. Its just listed under dists so I think that is not going to reclaim a bunch of space | 22:52 |
ianw | i do feel like that got cleared ... | 22:52 |
clarkb | ianw: ya I think this is just stale since the reprepro cleanup won't remove the indexes | 22:52 |
clarkb | I'm checking the -security side next | 22:52 |
ianw | 2021-11-12 debian-stretch : merge and babysit removal changes | 22:52 |
ianw | manual cleanup reprepo for debian/debian-security | 22:53 |
ianw | from my notes, so it would have happened abou tthen | 22:53 |
clarkb | yup both seem to lack stretch packages in the pools | 22:54 |
clarkb | its just the index stubbed out | 22:54 |
clarkb | we're probably going to need to bump the quota for debian-security in that case | 22:54 |
clarkb | I guess packages that go into -security eventually end up in the regular repos but don't get cleaned out of -security? | 22:58 |
clarkb | fungi: ^ | 22:58 |
clarkb | ianw: hrm I think xenial in ubuntu-ports is in the same situation | 23:00 |
clarkb | the indexes are stubbed out but we don't seem to configure reprepro to mirror it and if I cross check Packages and pool the packages are not present | 23:01 |
ianw | 2021-10-29 : remove ubuntu-ports xenial | 23:01 |
ianw | `<https://review.opendev.org/c/opendev/system-config/+/815914/>`__ | 23:01 |
clarkb | do we just need to rm the pool/ dirs? | 23:01 |
clarkb | er sorry the dists/ dirs | 23:01 |
ianw | this is what i would have done ... https://review.opendev.org/c/opendev/system-config/+/815920/2/doc/source/reprepro.rst | 23:02 |
clarkb | it won't save a lot of space but could cut down on confusion | 23:02 |
ianw | it may well be that the clearvanished doesn't remove those | 23:02 |
clarkb | ya I think I ran into this with something else | 23:02 |
clarkb | I think clearvanished only cleans up pool/ but not dists/. fungi do you recall and is simply rm'ing those dirs out of the dists/ dir the right step? | 23:03 |
clarkb | separately the growth on these mirrors over the last 3 months is crazy for what are not quite static but also fairly stable distros | 23:03 |
ianw | clarkb: did you have thoughts on https://paste.opendev.org/show/bmy36m7TO2b4cpndO5gO/ , which is the All-Projects update? | 23:08 |
fungi | clarkb: i don't know, not sure we've ever tried that | 23:08 |
clarkb | ianw: sorry I missed that because it isn't in a change. Was there a docs change though? maybe not in that stack? | 23:09 |
fungi | as for growth, i wonder if it's divergence between sid and bookworm since the freeze started? but we don't mirror either of those to my knowledge, so likely unrelated | 23:10 |
ianw | clarkb: yeah, basically the same thing to the bootstrap docs -> https://review.opendev.org/c/opendev/system-config/+/876237/ & https://review.opendev.org/c/opendev/system-config/+/876236/1 | 23:10 |
clarkb | ianw: copyCondition = changekind:NO_CODE_CHANGE OR changekind:TRIVIAL_REBASE OR is:MIN <- is the only thing that jumps out since previously we were only copying min and trivial rebases. Not no code change. I think the difference is that we'll preserve votes if the commit message changes and we probably shouldnt' do that? | 23:11 |
clarkb | ianw: also All-Projects is probably somethingto update after the openstack release since it will have broad impact? | 23:12 |
ianw | well i guess the theory is it either has no impact or we roll it back immediately | 23:12 |
ianw | certainly not a push and walk away thing | 23:13 |
clarkb | I left a note on the change about the copycondition | 23:14 |
clarkb | ya thats a good point | 23:14 |
ianw | good point i'm wondering why i put no_code_change | 23:14 |
clarkb | I'm putting the agenda together. ACL things are on there (though I may need to update some links). fungi do you want me to add the rax-ord and inmotion mirror stuff? | 23:15 |
opendevreview | Ian Wienand proposed opendev/system-config master: doc/gerrit : update copyCondition https://review.opendev.org/c/opendev/system-config/+/876236 | 23:23 |
opendevreview | Ian Wienand proposed opendev/system-config master: doc/gerrit : update to submit-requirements https://review.opendev.org/c/opendev/system-config/+/876237 | 23:23 |
clarkb | ianw: oh heh I just left a comment about the overrides on https://review.opendev.org/c/opendev/system-config/+/876237 too | 23:24 |
clarkb | is it only infra-specs that does that? If so I thin kwe can "break" things for infra-specs and force us to leave a code review and a rollcall | 23:24 |
ianw | i think there's two that do that -- but i think it's copied from infra-specs, let me check | 23:25 |
ianw | yeah -> https://review.opendev.org/c/openstack/project-config/+/875804/4/gerrit/acls/openstack/governance.config | 23:25 |
clarkb | ianw: if its just those two I think we can ask the openstack tc to see if they are ok leaving a +2 as well | 23:26 |
clarkb | I like not having overrides for unnecessary special behaviors :) | 23:27 |
opendevreview | Merged opendev/system-config master: mirror-update: stop mirroring old atomic version https://review.opendev.org/c/opendev/system-config/+/876488 | 23:27 |
ianw | if we want to override, I think the submit-requirement needs to be called "Code-Review" | 23:27 |
clarkb | I'm not sure the actual submit-requirement names mean anything? | 23:28 |
clarkb | but maybe thats what they mean by override in this case is replacing a named submit-requirement and not changing the submittableIf? | 23:28 |
ianw | once again the docs are a bit unclear | 23:28 |
clarkb | agreed :) | 23:28 |
ianw | https://gerrit-review.googlesource.com/Documentation/config-submit-requirements.html#inheritance | 23:28 |
ianw | "administrators can redefine the requirement with the same name in the child project and set the applicableIf expression to is:false" | 23:29 |
clarkb | aha so ya I guess the name is used to know what to override | 23:29 |
clarkb | X is replaced by X' | 23:29 |
ianw | that was what made me think it looks at the name and basically overwrites | 23:29 |
clarkb | but you could have two different submit requirements that look at code-review submittableif conditions and only override one or the other | 23:30 |
ianw | i'm not sure it would match that. i feel like it would treat each s-r totally separately? | 23:32 |
clarkb | ya I think so | 23:32 |
clarkb | its just confusing because labels and submit requirements don't have a 1:1 relationship but overrides do | 23:33 |
ianw | what i can do is send a doc update change that explains it the way i think it works. which can either be accepted or rejected, assuming anyone wants to review it | 23:33 |
clarkb | and the docs don't really go into a ton of depth here :/ | 23:33 |
ianw | I think i'm in agreement that the best way to avoid problems overriding code-review is to just not play the game. i'll double check what's doing that, and i think we can probably propose to change those ACL's as a first step | 23:34 |
clarkb | ++ | 23:35 |
clarkb | ianw: we control acls through cnetralized code review which makes overrides pretty safe. But ya I agree | 23:40 |
fungi | also we have a linter written in a turing-complete language which can essentially enforce whatever policies we want to enact around that | 23:44 |
ianw | this is true, but i wondered how far to go with the normalization script with this | 23:45 |
ianw | i mean i could make it convert what we have to submit-requirements, but it seems like overkill | 23:45 |
clarkb | if its really only a small number of situations then I think simplifying is a good thing | 23:45 |
clarkb | its not the end of the world to leave a +2 and a +1 | 23:46 |
clarkb | fungi: re inmotion I'm still ssh'd in there should I disale the compute service running the mirror then we can start it again? | 23:46 |
ianw | there's actually 4 that do it | 23:48 |
ianw | opendev/infra-specs.config openinfra/transparency-policy openstack/governance.config starlingx/governance.config | 23:48 |
fungi | clarkb: oh, if you want to please go for it. i hadn't freed back up sufficiently to revisit it yet and my evening is encroaching | 23:49 |
fungi | i can probably get to it in my morning tomorrow otherwise | 23:49 |
clarkb | I'm hoping to get it done today so that the base job runs successfully | 23:49 |
clarkb | let me give it a go | 23:49 |
fungi | oh, i didn't think about it blocking the base job, we can also add the mirror there to the emergency disable list temporarily so it will be skipped | 23:50 |
clarkb | ya but this should go quickly once I identify the compute node hosting the mirror | 23:51 |
fungi | .130 (parakeet) was the one with all the oom events | 23:51 |
ianw | https://review.opendev.org/c/openstack/project-config/+/185785 has the thinking behind it | 23:52 |
ianw | given the context there, it being a thought-out approach to comments on TC issues, i'm not sure i'd want to argue for not overriding Code-Review | 23:55 |
clarkb | is there no good way to go from a hostId to the compute list? | 23:56 |
clarkb | this seems like it should be trivial and yet | 23:56 |
fungi | hostid is hashed by the tenant/project for privacy reasons | 23:57 |
fungi | so the same host has a different hostid depending on which project the user querying it belongs to | 23:58 |
clarkb | how are you supposed to use it then? | 23:59 |
fungi | i remember this coming up when we originally asked the nova devs to expose a host identifier to normal users | 23:59 |
fungi | i have to believe there's an admin function to convert it or look it up | 23:59 |
fungi | but i've never been on that end of the situation | 23:59 |
clarkb | I can't show the instance as admin because the project stuff is all wrong too it seems like | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!