opendevreview | Merged opendev/system-config master: Fix rax reverse DNS setup in launch https://review.opendev.org/c/opendev/system-config/+/879388 | 00:22 |
---|---|---|
*** Trevor is now known as Guest10786 | 00:46 | |
opendevreview | Ian Wienand proposed opendev/system-config master: dns: move tsig_key into common group variable https://review.opendev.org/c/opendev/system-config/+/880252 | 01:23 |
opendevreview | Ian Wienand proposed opendev/system-config master: dns: move tsig_key into common group variable https://review.opendev.org/c/opendev/system-config/+/880252 | 01:38 |
opendevreview | Merged opendev/system-config master: Refactor adns variables https://review.opendev.org/c/opendev/system-config/+/876936 | 02:31 |
ianw | openstack --os-cloud=openstackci-rax server list | 03:38 |
ianw | Version 2 is deprecated, use alternative version 3 instead. | 03:38 |
ianw | great | 03:38 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: use apt to update packages https://review.opendev.org/c/opendev/system-config/+/880262 | 04:19 |
ianw | looks like the rdns setup in the launch node worked | 04:20 |
ianw | but *still* i don't see the ssh keys printed out for adding | 04:20 |
ianw | oh i see it. we're not passing the full domain | 04:48 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: refactor https://review.opendev.org/c/opendev/system-config/+/880264 | 05:28 |
frickler | ianw: can you consider a new dib release with the latest openeuler fix? | 05:28 |
frickler | also reviews on https://review.opendev.org/c/openstack/project-config/+/879196 would be nice to enable full debian testing in kolla | 05:29 |
ianw | frickler: sure | 05:36 |
opendevreview | Merged openstack/project-config master: Remove gerritbot from #openstack-charms https://review.opendev.org/c/openstack/project-config/+/879976 | 05:37 |
opendevreview | Merged opendev/zone-zuul-ci.org master: Set default ttl to one hour https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/880213 | 05:45 |
opendevreview | Merged openstack/project-config master: Add nested-virt-debian-bullseye label to nodepool https://review.opendev.org/c/openstack/project-config/+/879196 | 05:49 |
ianw | frickler: pushed 3.29.0 | 05:55 |
frickler | ianw: thx | 06:00 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264 | 07:17 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264 | 07:23 |
*** amoralej|off is now known as amoralej | 07:25 | |
ianw | clarkb: ^ i finally got sick of this never working. it mostly comes down to us not waiting for the host to come back up so we can scan it's keys. but there was a fair bit of room for cleaning up. i think that makes things better, and the output is nicer | 07:25 |
ianw | https://paste.opendev.org/show/b1MjiTvYr4E03GTeP56w/ is a sample | 07:25 |
opendevreview | Merged openstack/project-config master: Retire patrole project: end project gating https://review.opendev.org/c/openstack/project-config/+/880013 | 07:34 |
opendevreview | Merged openstack/diskimage-builder master: Update satellite_repo labels + add env var https://review.opendev.org/c/openstack/diskimage-builder/+/879137 | 08:32 |
opendevreview | Martin Kopec proposed opendev/irc-meetings master: Update Interop meeting details https://review.opendev.org/c/opendev/irc-meetings/+/880302 | 13:05 |
genekuo | clarkb: I will take the python container image task, already marked my name on it. | 13:58 |
Clark[m] | genekuo: sounds good. Let us know if you have any questions. Thank you! | 14:02 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea to 1.19.1 https://review.opendev.org/c/opendev/system-config/+/877541 | 14:58 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node https://review.opendev.org/c/opendev/system-config/+/848181 | 14:58 |
clarkb | I've removed my old 1.19 hold and created a new one for this newer patchset | 14:59 |
chandankumar | Hello infra | 15:06 |
chandankumar | https://zuul.opendev.org/t/openstack/status jobs on multiple reviews are queued for more than 23 hours | 15:07 |
chandankumar | can someone take a look? what is going on there | 15:07 |
chandankumar | thank you :-) | 15:08 |
clarkb | corvus: ^ possible fallout from nodepool's zk cache changes? | 15:09 |
clarkb | it isn't specific to a single node type which rules out a problem with an image. Thats an easy thing to check just from the zuul status page | 15:10 |
clarkb | 300-0020957863 and 300-0020957869 are the node requests that appear to belong to 877242 builds that are queued | 15:12 |
clarkb | oh that info is exposed in the web ui now neat | 15:12 |
opendevreview | Merged zuul/zuul-jobs master: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/878538 | 15:13 |
clarkb | nl04 appears to have processed 300-0020957869 trying to make sense of the logs now | 15:14 |
clarkb | yes I think this is a nodepool bug working on a paste of relevant logs now | 15:16 |
clarkb | corvus: https://paste.opendev.org/show/bBEKYQz0E2v7gZvYtvg6/ I think the cache implementation created the 0000 lock event but then saw that path wasn't present so it created a deleted event for that path. But the 0001 lock is/was held and we essentially deleted an unlocked unready node out for under the node creation process | 15:18 |
clarkb | corvus: that must just be excellent timing between the deleted node working and our vision of locks? | 15:19 |
clarkb | In theory we can get things moving again by manually removing the lock? | 15:19 |
clarkb | but I'm not sure what sort of data consistency problems that might expose particularly for leaking nodes in zk for nodes that don't exist in clouds (I think those may be cleaned up though) | 15:20 |
clarkb | oh! this occurs during a nodepool launcher restart | 15:22 |
clarkb | I think the issue is that our consistency view of cache event ordering isn't atomic enough at startup for tasks like node deletions. Do we need to hold off on processing until the cache is coherent? I suspect that is the fix | 15:23 |
clarkb | chandankumar: ^ tldr this is very likely a bug in nodepool itself due to a chnge in how caching was done (which should reduce zk load and be quicker overall but likely exposed us to this situation) | 15:23 |
clarkb | I hesitate to blindly restart nodepool though since restarts seem to have triggered this problem | 15:24 |
chandankumar | arxcruz: ^^ | 15:25 |
chandankumar | clarkb: thank you :-) | 15:25 |
clarkb | infra-root if I hear no objections in the next little bit (I'm catching up on morning stuff right now) I intend on deleting static01.opendev.org and merging https://review.opendev.org/c/opendev/zone-opendev.org/+/879781 to clean up its DNS records. static02 should be serving all the content at this point for almost a week I think it is happy | 15:33 |
clarkb | corvus: looking at that node in question it is currently in a locked deleting state | 15:36 |
clarkb | it seems to be stuck there which is why another provider isn't attempting to take it over. I'm not sure why it hasn't been deleted though if it is locked. Maybe the lock holder is not the deleter and they have mutually locked each other out of progressing through their respective state machines? | 15:36 |
clarkb | I need to step away for a bit but I suspect the next steps are one looking at preventing the locking issues when restarting in the first place and two inspecting the zk nodes to determine how to get these particular requests movign forward again (possibly by removing locks and issuing manual delete requests?) | 15:37 |
corvus | clarkb: i'm around; i can look into that | 15:42 |
corvus | clarkb: i have an alternate read of those log lines: lock 0 was the old building lock and disappeared due to the restart. on reconnect, the cache generated a NONE event to cause it to refresh the node; we don't know the result of that (zk either said it exists or not; i don't think it's important). then we get a real DELETED event from zk (that probably means zk thought it did exist, and suggests that the restart sequence may have involved an | 15:57 |
corvus | unclean shutdown so zk held the ephemeral node of lock 0 for the full 30 second timeout; by that time, the launcher had restarted and established a new watch. so when the timeout happens and zk deletes the lock 0 ephemeral node, we get a real deleted event). now we have an unlocked building node. the launcher notices that and locks it, creating lock 1. it then marks it for deletion, and starts the delete state machine. | 15:57 |
corvus | clarkb: assuming all that, then i think the next question is why is that node stuck deleting? i didn't see any delete server api errors in the log. | 15:59 |
Clark[m] | Oh that makes sense since we do remove building nodes on restart. | 15:59 |
Clark[m] | It also doesn't seem to be logging it is reattempting any delete actions which I would expect it there were API errors | 16:00 |
corvus | agreed | 16:01 |
corvus | i'm going to see if i can figure out the right way to get an openstack cli prompt today :) | 16:01 |
corvus | apparently --os-cloud doesn't work even though the error message i get when i try to use it says it should | 16:02 |
corvus | env vars work | 16:03 |
corvus | OS_CLOUD=openstackjenkins-ovh OS_REGION_NAME=GRA1 openstack | 16:03 |
corvus | it looks like 0033718669 does not exist | 16:04 |
corvus | i mean the openstack server for that node | 16:05 |
corvus | assuming OS_CLOUD=openstackjenkins-ovh OS_REGION_NAME=GRA1 openstack server list|grep 0033718669 is the right place to be looking | 16:05 |
Clark[m] | Yes I think that is the correct location | 16:05 |
corvus | then i think we may want to look for an error in the server cache related to deletion... | 16:06 |
corvus | (though obviously it's not completely broken...) | 16:06 |
Clark[m] | Do the new np prefixed node names have the full ID number as the suffix? Might also search by uuid or double check the output of that listing without grep | 16:08 |
corvus | yeah full number | 16:08 |
corvus | example: np0033717123 | 16:08 |
corvus | i just double checked the uuid, no dice | 16:10 |
Clark[m] | Ok very likely it was actually deleted then | 16:10 |
corvus | i wonder if the lazyexecutolttlcache for servers on gra1 is stuck | 16:11 |
corvus | yeah i don't think it's turning over any servers | 16:12 |
corvus | looks like it stopped doing anything at 15:56 on april 12 | 16:13 |
corvus | so shortly after this restart we're looking at... or possibly even earlier | 16:14 |
corvus | i'm going to get a thread dump | 16:15 |
corvus | i don't see any stuck api calls. i do see 10 executor workers all waiting for work. also 2 extra ones, possibly from a previous unclean executor shutdown? i don' know what's going on there, but i doubt they are harmful. that makes me suspect a bug related to the cache invalidation in https://opendev.org/zuul/nodepool/src/branch/master/nodepool/driver/utils.py#L498 | 16:25 |
corvus | yeah, there was an internal stop/start cycle around 06:53:43 on april 13, that's probably why we have extra executor threads. i think that means we have 2 adapters running. that should be okay -- the old one should continue to run until all its state machines are done. | 16:38 |
corvus | whatever the problem is, it started before that internal reload anyway | 16:38 |
corvus | and yeah, i see the stop thread running and waiting, so that confirms that. | 16:39 |
corvus | there are 3 runDeleteStateMachine threads as expected (1 for the new gra1 and bhs1 adapters, and 1 for the old gra1 adapter) | 16:41 |
corvus | now we're getting somewher: | 16:42 |
corvus | File "/usr/local/lib/python3.11/site-packages/nodepool/zk/zookeeper.py", line 2337, in unlockNode | 16:42 |
corvus | with node._thread_lock: | 16:42 |
corvus | i think there's a deadlock between the cleanupworker and the deletedstatemachinerunner | 16:50 |
corvus | one of them holds the python local node._thread_lock while trying to unlock the zk node lock, and the other holds the zk node lock while trying to acquire the python local node._thread_lock | 16:51 |
clarkb | that local node._thread_lock was recently added right? | 16:52 |
corvus | yeah a few weeks ago | 16:53 |
clarkb | re static01 I've grown a sudden paranoia that we might host things there that are either not in ansible or afs. However looking at all of the apache config files and /var/www I think we're ok to delete that server as it basically serves as a frontend to fileservers and doesn't store any data itself (other than a few index.html files that ansible writes out). Debating if I should have | 16:55 |
clarkb | ianw ACK that before deletion since ianw did a fair bit of the setup on that host | 16:55 |
corvus | operationally, i think we are okay to just fix this by restarting the launcher. if it's not urgent, maybe let's wait a few more minutes in case i have any more questions | 16:56 |
clarkb | corvus: its been a day that changes have been stuck. I think we can wait a few more minutes | 16:56 |
*** amoralej is now known as amoralej|off | 17:18 | |
clarkb | ianw: the launch node change lgtm but I did have one small nit/cleanup | 17:26 |
clarkb | ianw: and left a concern on https://review.opendev.org/c/openstack/project-config/+/880115 | 17:38 |
clarkb | https://158.69.65.113:3081/opendev/system-config looks like a functional gitea 1.19.1 installation | 17:46 |
opendevreview | Merged zuul/zuul-jobs master: containers : update test variable https://review.opendev.org/c/zuul/zuul-jobs/+/878175 | 18:02 |
opendevreview | Merged zuul/zuul-jobs master: container role docs : clarify requirements https://review.opendev.org/c/zuul/zuul-jobs/+/878176 | 18:05 |
clarkb | I am testing db transplant from etherpad01 to etherpad02 now. THis should give us rough timing for how long the process will take too | 18:06 |
clarkb | Currently running an online db dump of 01 which can be restored in 02. This isn't super fast so the downtime will be longer than I had previously anticipated | 18:07 |
clarkb | When we d othe actual prod move we'll do it with sevices stopped on both servers and dns can update while we do the data migration | 18:07 |
corvus | clarkb: i think we're good to restart | 18:29 |
corvus | i still have a term open there, i can do it | 18:29 |
clarkb | thanks | 18:33 |
clarkb | I'll review the change you pushed just as soon as I finish this etherpad02 data migration test | 18:33 |
clarkb | dealing with largish databases is not fast | 18:44 |
clarkb | if anyone is wondering | 18:44 |
*** dhill is now known as Guest10899 | 18:47 | |
fungi | catching up mid-vacation, a quick skim says the new contributor call was productive (lots of notes in the pad at least?) and nothing that needs urgent attention... yeah? | 18:57 |
fungi | the nodepool thing today seems mostly sorted out? | 18:58 |
clarkb | fungi: nothing urgent. I was going to delete static01 then got paranoid we might host actual data there but I checked and can't find evidence of that | 18:58 |
clarkb | Am hoping another root can double check those assumptions though. I'm happy to wait for ianw on that. I'm als testing the data migration from etherpad01 to 02 in preparation for that move (probably next week at this point) | 18:58 |
clarkb | looks like it will be at least an hour downtime | 18:58 |
clarkb | enjoy your vacation nothing demands attention right now | 18:59 |
fungi | cool.i have a couple hours before i need to go meet friends for margaritas, so just catching up on e-mail and irc backlog | 18:59 |
fungi | i don't recall us having any locally-served content on static01, it replaced the old static.o.o which did have local content with a migration to the afs-only hosting model for files.o.o which it also replaced | 19:01 |
clarkb | and ya for nodepool there is a change up I need to review. I've pulled it up but trying to get etherpad done in one go so I don't make a silly distracted mistake | 19:01 |
clarkb | fungi: ya there are a couple of index.html files that ansible writes out (so is also on static02) and other than that I can't find local content | 19:01 |
clarkb | everything else seems to be redirects or afs | 19:01 |
fungi | etherpad migration is going to be slow because of moving from trove to local sql db? putting it on a cinder volume or rootfs? | 19:02 |
clarkb | fungi: it is local sql db to local sql db. On different cinder volume on each side. We could theoretically move the cinder volume instead, but I rulled against that because the current cinder volume is a bit small so the new host got a new larger volume instead | 19:02 |
clarkb | the time is in dumping, copying, restoring the 30gb of sql database | 19:03 |
fungi | sounds good to me. announcing an outage for pad migration should be fine | 19:03 |
clarkb | I'm hoping I'll have the db running under 02 for functionality testing of the process in about 10 minutes | 19:05 |
clarkb | this is effectively a fork though and shouldn't be relied on. | 19:05 |
clarkb | I have used https://etherpad.opendev.org/p/opendev-contributor-bootstrap-202304 for testing. If you set /etc/hosts to point etherpad.opendev.org to etherpad02's IP address then you'll see newer edits that aren't there in prod | 19:18 |
clarkb | as far as I can tell this is working so now just need to pick a time and announce it. I think it will take at least 90 minutes to do the data migration | 19:19 |
clarkb | its like 30 minutes to dump the db, a little longer to restore it with some time in the middle to copy and validate data as you go | 19:19 |
clarkb | I will need to convert my notes into a document that can be shared for the process too (not an etherpad though I'll probably use paste) | 19:22 |
opendevreview | James E. Blair proposed opendev/base-jobs master: Set default job vars for container image promote https://review.opendev.org/c/opendev/base-jobs/+/880362 | 19:58 |
clarkb | infra-root https://paste.opendev.org/show/brRuhPssVLSi4UnF5hcN/ there is the draft for production migration of etherpad data based on my notes testing this | 20:05 |
*** Trevor is now known as Guest10911 | 20:15 | |
ianw | clarkb: afaik there's nothing on static that's either not pushed from git or on afs | 20:59 |
ianw | i've certainly always considered it as liable to disappear at any moment :) | 20:59 |
ianw | don't think it's in the backup roster either, so lgtm for removal | 20:59 |
opendevreview | Ian Wienand proposed opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264 | 21:05 |
ianw | clarkb: ^ thanks, that one should split the dns better | 21:08 |
opendevreview | Merged openstack/diskimage-builder master: fix ifupdown pkg map for dhcp-all-interfaces of redhat family https://review.opendev.org/c/openstack/diskimage-builder/+/879537 | 21:46 |
opendevreview | Merged opendev/base-jobs master: Set default job vars for container image promote https://review.opendev.org/c/opendev/base-jobs/+/880362 | 21:49 |
opendevreview | Merged opendev/system-config master: dns: move tsig_key into common group variable https://review.opendev.org/c/opendev/system-config/+/880252 | 22:13 |
clarkb | ianw: thank you for confirming I'll delete static01 momentarily | 22:58 |
clarkb | corvus: fyi you have ashell still on static01. I think this is from earlier this week when we debugged why it was apparently still serving zuul content | 23:00 |
clarkb | also to double check it has no volumes attached (another indication it might be serving data outside of afs) that will need to be claened up | 23:03 |
corvus | clarkb: shell closed | 23:04 |
clarkb | corvus: thanks just in time for me to delete it | 23:04 |
clarkb | #status log Deleted static01.opendev.org (ae2fe734-cf8f-4ead-91bf-5e4e627c8d2c) as it has been replaced by static02.opendev.org | 23:06 |
opendevstatus | clarkb: finished logging | 23:06 |
clarkb | https://review.opendev.org/c/opendev/zone-opendev.org/+/879781 has been approved as well to reflect that in dns | 23:07 |
opendevreview | Merged opendev/zone-opendev.org master: Remove old static01 records https://review.opendev.org/c/opendev/zone-opendev.org/+/879781 | 23:09 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Update etherpad.o.o to point at etherpad02 https://review.opendev.org/c/opendev/zone-opendev.org/+/880168 | 23:11 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Cleanup etherpad DNS records https://review.opendev.org/c/opendev/zone-opendev.org/+/880169 | 23:11 |
clarkb | those are just rebases needed due to the previous change merging and updating the serial. I'll WIP them again | 23:11 |
clarkb | ianw: its probably a bit late to try the gitea upgrade today, but if you can review that change https://review.opendev.org/c/opendev/system-config/+/877541 (held node at https://158.69.65.113:3081/opendev/system-config) maybe we can upgrade gitea early next week | 23:16 |
clarkb | and then I'll probably look to swap etherpad servers ~tuesday or wednesday next week if that works for others. Then I can send email about the outage a couple days in advance | 23:16 |
ianw | ++ will do | 23:21 |
clarkb | side note the server show output is super verbose now. Not sure I'm a fan | 23:32 |
ianw | i still see things like "Munch('...')" which i'm pretty sure isn't really supposed to be user-visible object-details | 23:35 |
opendevreview | Merged opendev/system-config master: launch: refactor to work https://review.opendev.org/c/opendev/system-config/+/880264 | 23:51 |
opendevreview | Merged opendev/system-config master: launch: use apt to update packages https://review.opendev.org/c/opendev/system-config/+/880262 | 23:56 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!