Thursday, 2023-04-13

opendevreviewMerged opendev/system-config master: Fix rax reverse DNS setup in launch
*** Trevor is now known as Guest1078600:46
opendevreviewIan Wienand proposed opendev/system-config master: dns: move tsig_key into common group variable
opendevreviewIan Wienand proposed opendev/system-config master: dns: move tsig_key into common group variable
opendevreviewMerged opendev/system-config master: Refactor adns variables
ianwopenstack --os-cloud=openstackci-rax server list03:38
ianwVersion 2 is deprecated, use alternative version 3 instead.03:38
opendevreviewIan Wienand proposed opendev/system-config master: launch: use apt to update packages
ianwlooks like the rdns setup in the launch node worked04:20
ianwbut *still* i don't see the ssh keys printed out for adding04:20
ianwoh i see it.  we're not passing the full domain04:48
opendevreviewIan Wienand proposed opendev/system-config master: launch: refactor
fricklerianw: can you consider a new dib release with the latest openeuler fix?05:28
frickleralso reviews on would be nice to enable full debian testing in kolla05:29
ianwfrickler: sure05:36
opendevreviewMerged openstack/project-config master: Remove gerritbot from #openstack-charms
opendevreviewMerged opendev/ master: Set default ttl to one hour
opendevreviewMerged openstack/project-config master: Add nested-virt-debian-bullseye label to nodepool
ianwfrickler: pushed 3.29.005:55
fricklerianw: thx06:00
opendevreviewIan Wienand proposed opendev/system-config master: launch: refactor to work
opendevreviewIan Wienand proposed opendev/system-config master: launch: refactor to work
*** amoralej|off is now known as amoralej07:25
ianwclarkb: ^ i finally got sick of this never working.  it mostly comes down to us not waiting for the host to come back up so we can scan it's keys.  but there was a fair bit of room for cleaning up.  i think that makes things better, and the output is nicer07:25
ianw is a sample07:25
opendevreviewMerged openstack/project-config master: Retire patrole project: end project gating
opendevreviewMerged openstack/diskimage-builder master: Update satellite_repo labels + add env var
opendevreviewMartin Kopec proposed opendev/irc-meetings master: Update Interop meeting details
genekuoclarkb: I will take the python container image task, already marked my name on it.13:58
Clark[m]genekuo: sounds good. Let us know if you have any questions. Thank you!14:02
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.19.1
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node
clarkbI've removed my old 1.19 hold and created a new one for this newer patchset14:59
chandankumarHello infra15:06
chandankumar jobs on multiple reviews are queued for more than 23 hours15:07
chandankumarcan someone take a look? what is going on there15:07
chandankumarthank you :-)15:08
clarkbcorvus: ^ possible fallout from nodepool's zk cache changes?15:09
clarkbit isn't specific to a single node type which rules out a problem with an image. Thats an easy thing to check just from the zuul status page15:10
clarkb300-0020957863 and 300-0020957869 are the node requests that appear to belong to 877242 builds that are queued15:12
clarkboh that info is exposed in the web ui now neat15:12
opendevreviewMerged zuul/zuul-jobs master: Update promote-container-image to copy from intermediate registry
clarkbnl04 appears to have processed 300-0020957869 trying to make sense of the logs now15:14
clarkbyes I think this is a nodepool bug working on a paste of relevant logs now15:16
clarkbcorvus: I think the cache implementation created the 0000 lock event but then saw that path wasn't present so it created a deleted event for that path. But the 0001 lock is/was held and we essentially deleted an unlocked unready node out for under the node creation process15:18
clarkbcorvus: that must just be excellent timing between the deleted node working and our vision of locks?15:19
clarkbIn theory we can get things moving again by manually removing the lock?15:19
clarkbbut I'm not sure what sort of data consistency problems that might expose particularly for leaking nodes in zk for nodes that don't exist in clouds (I think those may be cleaned up though)15:20
clarkboh! this occurs during a nodepool launcher restart15:22
clarkbI think the issue is that our consistency view of cache event ordering isn't atomic enough at startup for tasks like node deletions. Do we need to hold off on processing until the cache is coherent? I suspect that is the fix15:23
clarkbchandankumar: ^ tldr this is very likely a bug in nodepool itself due to a chnge in how caching was done (which should reduce zk load and be quicker overall but likely exposed us to this situation)15:23
clarkbI hesitate to blindly restart nodepool though since restarts seem to have triggered this problem15:24
chandankumararxcruz: ^^15:25
chandankumarclarkb: thank you :-)15:25
clarkbinfra-root if I hear no objections in the next little bit (I'm catching up on morning stuff right now) I intend on deleting and merging to clean up its DNS records. static02 should be serving all the content at this point for almost a week I think it is happy15:33
clarkbcorvus: looking at that node in question it is currently in a locked deleting state15:36
clarkbit seems to be stuck there which is why another provider isn't attempting to take it over. I'm not sure why it hasn't been deleted though if it is locked. Maybe the lock holder is not the deleter and they have mutually locked each other out of progressing through their respective state machines?15:36
clarkbI need to step away for a bit but I suspect the next steps are one looking at preventing the locking issues when restarting in the first place and two inspecting the zk nodes to determine how to get these particular requests movign forward again (possibly by removing locks and issuing manual delete requests?)15:37
corvusclarkb: i'm around; i can look into that15:42
corvusclarkb: i have an alternate read of those log lines: lock 0 was the old building lock and disappeared due to the restart.  on reconnect, the cache generated a NONE event to cause it to refresh the node; we don't know the result of that (zk either said it exists or not; i don't think it's important).  then we get a real DELETED event from zk (that probably means zk thought it did exist, and suggests that the restart sequence may have involved an15:57
corvusunclean shutdown so zk held the ephemeral node of lock 0 for the full 30 second timeout; by that time, the launcher had restarted and established a new watch.  so when the timeout happens and zk deletes the lock 0 ephemeral node, we get a real deleted event).  now we have an unlocked building node.  the launcher notices that and locks it, creating lock 1.  it then marks it for deletion, and starts the delete state machine.15:57
corvusclarkb: assuming all that, then i think the next question is why is that node stuck deleting?  i didn't see any delete server api errors in the log.15:59
Clark[m]Oh that makes sense since we do remove building nodes on restart.15:59
Clark[m]It also doesn't seem to be logging it is reattempting any delete actions which I would expect it there were API errors16:00
corvusi'm going to see if i can figure out the right way to get an openstack cli prompt today :)16:01
corvusapparently --os-cloud doesn't work even though the error message i get when i try to use it says it should16:02
corvusenv vars work16:03
corvusOS_CLOUD=openstackjenkins-ovh OS_REGION_NAME=GRA1 openstack16:03
corvusit looks like 0033718669 does not exist16:04
corvusi mean the openstack server for that node16:05
corvusassuming OS_CLOUD=openstackjenkins-ovh OS_REGION_NAME=GRA1 openstack server list|grep 0033718669 is the right place to be looking16:05
Clark[m]Yes I think that is the correct location16:05
corvusthen i think we may want to look for an error in the server cache related to deletion...16:06
corvus(though obviously it's not completely broken...)16:06
Clark[m]Do the new np prefixed node names have the full ID number as the suffix? Might also search by uuid or double check the output of that listing without grep16:08
corvusyeah full number16:08
corvusexample: np003371712316:08
corvusi just double checked the uuid, no dice16:10
Clark[m]Ok very likely it was actually deleted then16:10
corvusi wonder if the lazyexecutolttlcache for servers on gra1 is stuck16:11
corvusyeah i don't think it's turning over any servers16:12
corvuslooks like it stopped doing anything at 15:56 on april 1216:13
corvusso shortly after this restart we're looking at... or possibly even earlier16:14
corvusi'm going to get a thread dump16:15
corvusi don't see any stuck api calls.  i do see 10 executor workers all waiting for work.  also 2 extra ones, possibly from a previous unclean executor shutdown?  i don' know what's going on there, but i doubt they are harmful.  that makes me suspect a bug related to the cache invalidation in
corvusyeah, there was an internal stop/start cycle around 06:53:43 on april 13, that's probably why we have extra executor threads.  i think that means we have 2 adapters running.  that should be okay -- the old one should continue to run until all its state machines are done.16:38
corvuswhatever the problem is, it started before that internal reload anyway16:38
corvusand yeah, i see the stop thread running and waiting, so that confirms that.16:39
corvusthere are 3 runDeleteStateMachine threads as expected (1 for the new gra1 and bhs1 adapters, and 1 for the old gra1 adapter)16:41
corvusnow we're getting somewher:16:42
corvus  File "/usr/local/lib/python3.11/site-packages/nodepool/zk/", line 2337, in unlockNode16:42
corvus    with node._thread_lock:16:42
corvusi think there's a deadlock between the cleanupworker and the deletedstatemachinerunner16:50
corvusone of them holds the python local node._thread_lock while trying to unlock the zk node lock, and the other holds the zk node lock while trying to acquire the python local node._thread_lock16:51
clarkbthat local node._thread_lock was recently added right?16:52
corvusyeah a few weeks ago16:53
clarkbre static01 I've grown a sudden paranoia that we might host things there that are either not in ansible or afs. However looking at all of the apache config files and /var/www I think we're ok to delete that server as it basically serves as a frontend to fileservers and doesn't store any data itself (other than a few index.html files that ansible writes out). Debating if I should have16:55
clarkbianw ACK that before deletion since ianw did a fair bit of the setup on that host16:55
corvusoperationally, i think we are okay to just fix this by restarting the launcher.  if it's not urgent, maybe let's wait a few more minutes in case i have any more questions16:56
clarkbcorvus: its been a day that changes have been stuck. I think we can wait a few more minutes16:56
*** amoralej is now known as amoralej|off17:18
clarkbianw: the launch node change lgtm but I did have one small nit/cleanup17:26
clarkbianw: and left a concern on
clarkbhttps:// looks like a functional gitea 1.19.1 installation17:46
opendevreviewMerged zuul/zuul-jobs master: containers : update test variable
opendevreviewMerged zuul/zuul-jobs master: container role docs : clarify requirements
clarkbI am testing db transplant from etherpad01 to etherpad02 now. THis should give us rough timing for how long the process will take too18:06
clarkbCurrently running an online db dump of 01 which can be restored in 02. This isn't super fast so the downtime will be longer than I had previously anticipated18:07
clarkbWhen we d othe actual prod move we'll do it with sevices stopped on both servers and dns can update while we do the data migration18:07
corvusclarkb: i think we're good to restart18:29
corvusi still have a term open there, i can do it18:29
clarkbI'll review the change you pushed just as soon as I finish this etherpad02 data migration test18:33
clarkbdealing with largish databases is not fast18:44
clarkbif anyone is wondering18:44
*** dhill is now known as Guest1089918:47
fungicatching up mid-vacation, a quick skim says the new contributor call was productive (lots of notes in the pad at least?) and nothing that needs urgent attention... yeah?18:57
fungithe nodepool thing today seems mostly sorted out?18:58
clarkbfungi: nothing urgent. I was going to delete static01 then got paranoid we might host actual data there but I checked and can't find evidence of that18:58
clarkbAm hoping another root can double check those assumptions though. I'm happy to wait for ianw on that. I'm als testing the data migration from etherpad01 to 02 in preparation for that move (probably next week at this point)18:58
clarkblooks like it will be at least an hour downtime18:58
clarkbenjoy your vacation nothing demands attention right now18:59
fungicool.i have a couple hours before i need to go meet friends for margaritas, so just catching up on e-mail and irc backlog18:59
fungii don't recall us having any locally-served content on static01, it replaced the old static.o.o which did have local content with a migration to the afs-only hosting model for files.o.o which it also replaced19:01
clarkband ya for nodepool there is a change up I need to review. I've pulled it up but trying to get etherpad done in one go so I don't make a silly distracted mistake19:01
clarkbfungi: ya there are a couple of index.html files that ansible writes out (so is also on static02) and other than that I can't find local content19:01
clarkbeverything else seems to be redirects or afs19:01
fungietherpad migration is going to be slow because of moving from trove to local sql db? putting it on a cinder volume or rootfs?19:02
clarkbfungi: it is local sql db to local sql db. On different cinder volume on each side. We could theoretically move the cinder volume instead, but I rulled against that because the current cinder volume is a bit small so the new host got a new larger volume instead19:02
clarkbthe time is in dumping, copying, restoring the 30gb of sql database19:03
fungisounds good to me. announcing an outage for pad migration should be fine19:03
clarkbI'm hoping I'll have the db running under 02 for functionality testing of the process in about 10 minutes19:05
clarkbthis is effectively a fork though and shouldn't be relied on.19:05
clarkbI have used for testing. If you set /etc/hosts to point to etherpad02's IP address then you'll see newer edits that aren't there in prod19:18
clarkbas far as I can tell this is working so now just need to pick a time and announce it. I think it will take at least 90 minutes to do the data migration19:19
clarkbits like 30 minutes to dump the db, a little longer to restore it with some time in the middle to copy and validate data as you go19:19
clarkbI will need to convert my notes into a document that can be shared for the process too (not an etherpad though I'll probably use paste)19:22
opendevreviewJames E. Blair proposed opendev/base-jobs master: Set default job vars for container image promote
clarkbinfra-root there is the draft for production migration of etherpad data based on my notes testing this20:05
*** Trevor is now known as Guest1091120:15
ianwclarkb: afaik there's nothing on static that's either not pushed from git or on afs20:59
ianwi've certainly always considered it as liable to disappear at any moment :)20:59
ianwdon't think it's in the backup roster either, so lgtm for removal20:59
opendevreviewIan Wienand proposed opendev/system-config master: launch: refactor to work
ianwclarkb: ^ thanks, that one should split the dns better21:08
opendevreviewMerged openstack/diskimage-builder master: fix ifupdown pkg map for dhcp-all-interfaces of redhat family
opendevreviewMerged opendev/base-jobs master: Set default job vars for container image promote
opendevreviewMerged opendev/system-config master: dns: move tsig_key into common group variable
clarkbianw: thank you for confirming I'll delete static01 momentarily22:58
clarkbcorvus: fyi you have ashell still on static01. I think this is from earlier this week when we debugged why it was apparently still serving zuul content23:00
clarkbalso to double check it has no volumes attached (another indication it might be serving data outside of afs) that will need to be claened up23:03
corvusclarkb: shell closed23:04
clarkbcorvus: thanks just in time for me to delete it23:04
clarkb#status log Deleted (ae2fe734-cf8f-4ead-91bf-5e4e627c8d2c) as it has been replaced by static02.opendev.org23:06
opendevstatusclarkb: finished logging23:06
clarkb has been approved as well to reflect that in dns23:07
opendevreviewMerged opendev/ master: Remove old static01 records
opendevreviewClark Boylan proposed opendev/ master: Update etherpad.o.o to point at etherpad02
opendevreviewClark Boylan proposed opendev/ master: Cleanup etherpad DNS records
clarkbthose are just rebases needed due to the previous change merging and updating the serial. I'll WIP them again23:11
clarkbianw: its probably a bit late to try the gitea upgrade today, but if you can review that change (held node at maybe we can upgrade gitea early next week23:16
clarkband then I'll probably look to swap etherpad servers ~tuesday or wednesday next week if that works for others. Then I can send email about the outage a couple days in advance23:16
ianw++ will do23:21
clarkbside note the server show output is super verbose now. Not sure I'm a fan23:32
ianwi still see things like "Munch('...')" which i'm pretty sure isn't really supposed to be user-visible object-details23:35
opendevreviewMerged opendev/system-config master: launch: refactor to work
opendevreviewMerged opendev/system-config master: launch: use apt to update packages

Generated by 2.17.3 by Marius Gedminas - find it at!