Thursday, 2025-04-17

clarkbmy intention today is to reboot the server and resync data and ensure that reboots behave as expected and syncing is still about as quick as the last time I did it. If that checks out then I'd like to reduce the dns ttl today14:46
clarkbok last warning re the review03 reboot. In particular it will kill the screen session there. I'll plan to start that in ~10-15 minutes15:21
fungino objection from me15:21
fungithough i need to step away to run a couple of quick errands in a few minutes15:21
clarkbya I don't expect problems mostly just warning people about losing that context which isn't the end of the world15:22
fungiokay, headed out now, should be back shortly15:32
clarkband review03 is rebooting15:33
clarkbwhen I rebooted teh containers were running. When it comes back I expect them to not be running15:34
clarkbwow that reboot was very fast. And confirmed `docker ps -a` shows the containers have existed and are not running15:34
clarkbI'll leave thing shutdown for the moment as I'm going to resync data before starting again so that I can collect more sync timing data15:35
clarkbI think I figured out why the first index sync was so large/slow. We (gerrit really) seeems to keep old index versions around. So we had to copy all of the old data and the active data. Now that the old data is synced it isn't updated and we ignore it on subsequent rsyncs15:46
opendevreviewJames E. Blair proposed openstack/project-config master: Temporarily stop loading nodesets from zuul-providers  https://review.opendev.org/c/openstack/project-config/+/94760515:46
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Add a copy of the nodepool nodesets  https://review.opendev.org/c/opendev/zuul-providers/+/94760615:47
opendevreviewJames E. Blair proposed openstack/project-config master: Use zuul-providers for nodesets in opendev/zuul tenants  https://review.opendev.org/c/openstack/project-config/+/94760715:49
corvusclarkb: if you have a sec to review those, that would be great.  the plan is in the first commit msg15:49
clarkbcorvus: yup just started on that. I left a comment on the second one witha qusetion15:50
clarkbthe two bookend changes lgtm and make sense. But the one in the middle has a data mapping thing I'm not sure about15:51
corvusreplied... and i'm writing a change that may make it more clear when you see it in action.  will have that in a min.15:53
clarkbcorvus: aha that explains it. Thanks15:53
clarkbbecause on the node request side everything is going into the same queue and niz will handle it if it can and nodepool will handle it if it can and the labels are disjoint right now so tehy don't fight over labels just over quota15:54
corvusyep.  and we'll keep the labels disjoint until nodepool is completely retired.15:54
corvusbut we're now at a stage where, if we want to switch tenants back and forth, we should no longer keep the nodesets disjoint15:54
corvus(because disjoint nodesets means way too many changes to individual projects)15:55
corvusby changing the meaning of the "ubuntu-noble" nodeset, we can switch over a whole bunch of jobs in a tenant at once15:56
corvus(we won't catch jobs that define their own nodesets, but we'll get a lot)15:56
corvuswhen we're ready for everything to use niz, then we switch the labels over15:56
corvus(and even that can still be done one at a time)15:57
clarkbyup15:57
clarkbreview03 has been synced with additional timing data added to the etherpad. I have also started containers again and things continue to appaer to be happy15:57
clarkbI think we are ready to lnad https://review.opendev.org/c/opendev/zone-opendev.org/+/947136 and also do the associated record update in the openstack.org zone cc fungi for when you get back15:58
clarkbbut feel free to ssh in look at the server, update your /etc/hosts to point review.o.o at the server and login and perform read/write operations through the web ui or even push a change if you want15:59
clarkbI'm going to keep doing that sort of testing through the day, but I've been trying to keep the laptop as the only host with /etc/hosts overridden so that I don't confuse myself and I'm not on the laptop right now15:59
clarkbfungi: I've also thought about the dns problem for the actual switchover and I raelly like that we can use the dns update as test of replication config before we actually cut over. I think the risk with rolling back an update to that repo is low because dns uses a serial number on the zone file contents making this an excellent guinea pig16:01
clarkbbasically even if something goes wrong we'll continue to be in a known state rolling forward and can sort out from there16:02
clarkbthe downside is as you point out a longer outage16:02
clarkbbut I think that it may be worthwhile in this case?16:02
clarkbok I synced indexes after git repos and that creates problems. I thought this would be the correct way but I guess gerrit looks at the eindex then expects any info in there to be present in the git repos16:09
clarkband when they aren't you get exceptions. I'm going to stop gerrit again. And resync git repos so that the index is older and get these exceptions to clean up (that way we don't have the noise while testing)16:09
fungiokay, back and catching up16:11
clarkbgerrit on 03 is happier now that I've done things in the other order. I'll update the etherpad to note that the old assumed correct way is wrong16:15
fungion using the dns change as a canary, there's always the option of making a separate no-op change ahead of time that we can test with immediately following cut-over16:16
clarkbtrue, I guess my thought was dns is resilient to rolling forward and back if things go wrong. But we could pick a noop change in a low impact project too (eg not project-config or system-config. Maybe bindep or similar)16:19
fungialso speaking of replication it just dawned on me to check the iptables rules on gitea backends, but looks like they allow ssh from everywhere and aren't limited to gerrit servers16:19
fungiwe could even still use the dns zone repo, just increment the soa serial with no other records changed16:20
clarkboh thats a good thing to check. But ya I think we rely on auth for that16:20
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Move some nodepool nodesets to niz  https://review.opendev.org/c/opendev/zuul-providers/+/94760916:20
clarkbfungi: oh I like that. dig would still show the soa serial value and we could look at gitea repo head16:20
fungiyeah, gives us end-to-end test of a simple deploy sequence too16:22
clarkbif we go that route the plan would basically be put review02 and 03 in the emergency file, do a pre sync of index and git data, then just before 1600 UTC merge the dns update (maybe force merge it so that we get in before the hourly jobs), once deployed shutdown gerrit on both 02 and 03, rerun syncs, copy replication config from 02 to 03, start gerrit on 03, quickly force merge the16:28
clarkbserial bump change and confirm replication is happy, this should also theoretically automatically deploy if zuul has reconnected. Then followup with cleanups and so on16:28
clarkbfungi: ^ does that roughly sound right to you? I'll wiork on updating the etherpad if so16:28
opendevreviewMerged openstack/project-config master: Temporarily stop loading nodesets from zuul-providers  https://review.opendev.org/c/openstack/project-config/+/94760516:30
fungiclarkb: yeah, that seems like the best option we have for minimizing the duration of the overall outage16:32
clarkbfungi: ok I updated the etherpad. If hyou have a moment a quick skim of that would be great16:36
clarkbits a bit more details than my notes above16:36
clarkbfungi: and then if you think we're safe to reduce TTLs I think today is a good day to do that since you're out tomorrow and the review.openstack.org record will need your intervention16:39
fungiclarkb: the dns update for review.openstack.org currently on line 108 could move up to between lines 95 and 96?16:40
fungithat'll give it more time to propagate, similar to the review.opendev.org record16:40
clarkb++16:40
fungianyway, plan as outlined there lgtm, i'll get started on the advance dns updates16:44
corvusi sent the email drafted tues/wed about images to the service-discuss list16:45
fungithanks!16:45
corvushttps://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/ZLZ7OUFAOAZ7OS2PO2MHGJJKOBYVWB3G/16:46
opendevreviewMerged opendev/zone-opendev.org master: Reduce the review.o.o record TTL  https://review.opendev.org/c/opendev/zone-opendev.org/+/94713616:46
opendevreviewMerged opendev/zuul-providers master: Add a copy of the nodepool nodesets  https://review.opendev.org/c/opendev/zuul-providers/+/94760616:51
corvuswhat's changing about review.openstack.org?16:52
clarkbcorvus: it points at review02.opendev.org like review.opendev.org does and we need to switch the CNAME record to point to review03.opendev.org as part of the outage and swap.16:53
corvusisn't it just cname?16:53
clarkbcorvus: but openstack.org is hosted in cloudflare now so fungi needs to do it16:53
clarkboh wait maybe it points at review.opendev.org?16:53
corvusit's not a cname to review.opendev.org16:53
clarkbok sorry for all the confusion it does point to review.opendev.org which points to review02.opendev.org16:54
clarkbso I guess we don't have to do anything for it16:54
corvus++16:54
clarkbfungi: ^ fyi16:54
fungigood point16:54
fungiyeah, i'll just set the ttl on it back to "auto"16:54
clarkbI thought I looked this up and it was not this way but I must've misread the dig output since both are listed16:54
clarkbfungi: thanks16:54
clarkbI'll update the etherpad to drop openstack.org mentions16:54
corvusyeah dig being helpful i tihnk16:54
fungiit didn't even dawn on me that it would be a cname to a cname, since that was highly discouraged back in ye olden times16:54
fungiparts of my mind are still stuck in the dawn of the internet16:55
corvusa good choice in this instance though i think16:55
clarkb++16:56
clarkbI see the new ttl showing up in dns too so I'll mark those steps done on the etherpad as well16:56
fungiyeah, i don't think it's really problematic these days with ~everyone relying on caching resolvers that helpfully include records in advance that they think you're going to query next16:56
fungiso any more if you make a query for review.openstack.org your resolver is probably going to tell you it's a cname to review.opendev.org and also mention that review.opendev.org is a cname to review02.opendev.org and even go so far as to include the a and aaaa records for review02.opendev.org while at it16:57
fungiand you end up making only one dns query instead of 316:58
opendevreviewMerged openstack/project-config master: Use zuul-providers for nodesets in opendev/zuul tenants  https://review.opendev.org/c/openstack/project-config/+/94760716:59
clarkbthat change is deploying now. Got in just before the hourlies17:00
clarkbinfra-root https://etherpad.opendev.org/p/opendev_newsletter was meant to go out last month but got delayed for reasons. I've done a quick edit to make it relevant to this month and I expect it to go out this month17:09
clarkblet me know if you see any problems with my edits17:09
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Move some nodepool nodesets to niz  https://review.opendev.org/c/opendev/zuul-providers/+/94760917:11
corvusclarkb: ^ that one makes the switch17:11
corvusnewsletter lgtm17:12
clarkbcorvus: +217:13
corvushaha look that change is building new images -- because we switched the underlying images for those jobs from nodepool to niz!17:13
clarkbcorvus: we can recheck https://review.opendev.org/c/opendev/system-config/+/947165 after 947609 lands and that should use the jammy nodes17:13
clarkboh except we use custom nodesets in the system-config-run jobs17:13
clarkbso maybe not17:13
corvusi find that really amusing, but also, wonderfully correct!17:14
clarkbbindep and git-revie are probably better17:14
clarkbcorvus: but doesn't niz already have those images?17:14
corvusclarkb: all the system-config jobs run in the openstack tenant, right?17:15
clarkbcorvus: yup I just realized the tenant is wrong too17:15
clarkbcorvus: bindep and git review are better canaries17:15
corvusclarkb: it does, but because they use the nodesets, that change is switching the *label* they use17:15
clarkbcorvus: oh! the image build jobs have a new label and therefore we need to rebuild images because the job updated17:16
clarkbnow I understand17:16
corvusif we wanted to insulate the build jobs from this (perhaps so we could more easily flip things back and forth) we could take a one-time hit to move those jobs to specifying labels rather than using the nodesets, then we can change the nodesets instantly without affecting the jobs17:16
corvusi think i may do that.17:16
clarkbmore iterations on things from the testing is probably not a bad idea. But your described plan may also better represent a transition in other zuuls17:17
clarkbas you'd bootstrap with image A then switch to image B to build images later17:17
corvuswhat i like about it is that it lets us instantly revert to nodepool if we see a problem17:17
corvusotherwise i wouldn't care too much17:18
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Move some nodepool nodesets to niz  https://review.opendev.org/c/opendev/zuul-providers/+/94760917:21
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Temporarily pin image build job labels  https://review.opendev.org/c/opendev/zuul-providers/+/94761317:21
corvusclarkb: what do you think of that?  if i'm reading that right, neither of those should trigger image builds now.17:21
clarkbcorvus: there is a typo error in the base one17:22
clarkbbut I think that looks right to me17:22
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Temporarily pin image build job labels  https://review.opendev.org/c/opendev/zuul-providers/+/94761317:22
opendevreviewJames E. Blair proposed opendev/zuul-providers master: Move some nodepool nodesets to niz  https://review.opendev.org/c/opendev/zuul-providers/+/94760917:22
clarkbhrm its still rebuilding.17:23
clarkbmaybe beacuse the nodeset itself is changing from foo to anonymous?17:23
clarkband once we're anonymous it will stop rebuilding?17:23
corvushrm, yeah, i guess the comparison is done before dereferencing that17:23
corvusif that's the case... then maybe we should drop that change17:24
corvusno... hrm.  i'm puzzled.17:25
corvusoh, it's probably the fact that the anonymous nodeset doesn't have a name, so it still looks different.17:27
clarkbya that is what I wondered about. The name changes even if the labels don't so we get a delta detected. Once this change is in then we should stabilize and be able to flip back and forth?17:28
corvusokay, i think we should stick with this stack.  i think we're likely to take the one-time hit to switch to the anonymous nodeset, then -- yes, exactly -- we should be able to change the named ones without affecting it17:28
corvus(and when we switch the image jobs back to using named nodesets, one more image build round)17:28
corvusi've gone ahead and approved those.  they can do their image builds in gate.17:29
clarkbsounds good17:29
clarkbno response to my question about sigint vs sighup handling in gerrit on discord yet17:31
clarkbI think my rough plan for today is to switch to the laptop after lunch and start more targetted testing of gerrit functionality that way17:31
clarkbotherwise I think I just need to get that SOA record edit pushed up so it is ready for us and I'm feely fairly ready17:32
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Update the SOA serial  https://review.opendev.org/c/opendev/zone-opendev.org/+/94761417:34
clarkbfungi: ^ that is what you had in mind right?17:35
fungiyeah, though when i read the plan i thought you were also wanting to test pushing a change to gerrit, so figured it would be created at that time17:36
clarkbI want to do both things but I don't want to couple them together. The primary reason is that I can test everything but replication ahead of time so I'm less owrried about that stuff17:42
fungihttps://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/message/O4THT732BMZWACYFZKVRDFHAYOG2G44Z/ mentions the test node image call for help, to see if we can get some attention on it from openstack quarters17:42
clarkbso I want to get replication tested as quickly as possible17:42
fungimakes sense17:42
clarkban in theory replication should work as we manage host keys for giteas in ansible and that was histoprically the main snag17:43
clarkbwe can do another soa serial bump as a followup to test the other bits if replication is happy17:45
fungisure, simple enough17:45
clarkbI skipped breakfast today. Going to figure out early lunch now then maybe do some quick yardwork before looking at review03 testing from my laptop18:00
NeilHanlono/18:05
* NeilHanlon answers the batman signal for image help18:05
corvusfungi: thanks!18:06
corvusalso nice that worked :)18:06
corvusNeilHanlon: i think my message had some links at the bottom to get started.  please let me/us know if you have any questions!18:07
NeilHanloncorvus++ yep, thank you :) gonna see if I can drag some coworkers in, too :D 18:08
corvusgreat!18:08
JayFfungi: to be clear re: the email; will we be unable to run arm unit tests for openstack if nobody volunteers to update the image?18:41
fungiJayF: basically, yes18:44
fungiand help with ongoing maintenance tasks from time to time18:45
JayFack. interestingly enough, it would *not* cause Ironic's aarch64 guest tempest testing to fail18:46
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Move OSA sync to integrated repository  https://review.opendev.org/c/openstack/project-config/+/94762819:37
opendevreviewDmitriy Rabotyagov proposed openstack/project-config master: Deprecate openstack-ansible-tests repository  https://review.opendev.org/c/openstack/project-config/+/94762919:37
fungiinfra-root: oh, also a reminder, i'm afk all day tomorrow19:51
clarkbok back. My plan is to create a change on review03. Then I'm going to shutdown gerrit on review03, resync git and indexes and delete caches and make sure that when I turn it back on again there are no problems with that change number being looked up after being updated from 0220:34
clarkbbecause in theory I'm going to create a change id collision and my sync process should clear that out and I want to check that20:34
opendevreviewMerged opendev/zuul-providers master: Temporarily pin image build job labels  https://review.opendev.org/c/opendev/zuul-providers/+/94761320:36
opendevreviewMerged opendev/zuul-providers master: Move some nodepool nodesets to niz  https://review.opendev.org/c/opendev/zuul-providers/+/94760920:36
corvusi rechecked a change in the zuul tenant, so that should be using mostly niz nodes now20:37
corvusclarkb: sounds logical20:37
clarkbcompare https://review03.opendev.org/c/opendev/bindep/+/947611 to https://review.opendev.org/c/openstack/governance/+/94761120:41
clarkbI think that shows you can push changes to the new server and that works.20:42
clarkbI'll leave that as is for a few so that yall can look if you like then proceed with cleaning that up on 03 and ensuring its happy20:42
corvusclarkb: a strange thing happened... 1 sec20:51
corvushttps://imgur.com/a/r64rSpG20:52
corvusfirst image is what happened when i clicked on the diff for test_depends.py the first time.  i opened a new empty tab, went to the change url, clicked the file again and got that.20:52
clarkbcorvus: and this was using review03 urls both times (so we probably don't need to worry about caching oddities with review.o.o?)20:53
corvusmaybe my browser had a sad.  maybe the internet did.  or maybe gerrit had an error?20:53
corvuscorrect, haven't looked at the governance change yet20:53
corvusi opened the devtools panel on the bad tab, and it said it crashed due to oom.  the good tab works fine.20:54
clarkbweird. fwiw that did not happen to me either view review03 urls or my /etc/hosts overriden urls on laptop.20:54
clarkbI just tried in incognito too20:54
clarkbthat looks similar to what happens when caches are stale on gerrit startup though20:55
clarkbexcept in this case you got line numbers (the stale cache gets you one line I think)20:55
corvusi'd like to say this is probably just local browser problem.  like 95% confidence based on evidence so far.  but hey, we're testing, so i mention it.20:55
clarkbI'm going to look at server logs just to see if there is anything interesting there but I agree20:55
corvusi will now quit my browser and restart to free the memories.  :)20:56
clarkbcorvus: do you want me to leave review03 up or are you done?20:56
clarkbnothing in the logs that stands out20:56
corvusdone. i checked one more time with a fresh browser and everything looks good.20:57
clarkbok I'm going to restore and resync and then 03 should load the governance change instead20:57
clarkbhttps://review03.opendev.org/c/openstack/governance/+/947611 now loads21:05
clarkband https://review03.opendev.org/q/project:opendev/bindep has no knowledge of my previous test change21:06
clarkbso I think this works out. I did clear caches too which other than git and index would be the only place I expect problems21:07
clarkband the log is clear after that last restart21:09
clarkbat this point I feel like doing more nda more testing is likely to have diminishing returns21:11
clarkbthe synchronization process seems reliable and basic functionality has been confirmed afterwards21:11
clarkbI'm thinking about testing sighup vs sigint in gerrit. I don't think strace will do waht I want. I need the java version of strace but if I had that I could compare the two and be confident that sigint is ok21:46
fungiwhy wouldn't strace work? it should at least indicate the engagement of the top-level signal handler in that process, right?21:47
fungior you want to see what happens internally to confirm it's still a graceful shutdown?21:48
clarkbright I want to see that the code running in the jvm is executing graceful shutdown21:53
clarkbthat is entirely opaque to strace I think21:53
clarkblooks like bcc javaflow.sh is what I want21:53
clarkblet me hold a node then I can try that21:53
opendevreviewClark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test sigint vs sighup  https://review.opendev.org/c/opendev/system-config/+/89357121:57
clarkbcan also run strace at the same time and see if it does expose useful info21:58
clarkbbut what I want to see in particular is that the jvm calls gerrit internal shutdown method for greaceful stopping whethre sigint or sighup is used21:59
clarkband can confirm that it doesn't happen with sigkill21:59
clarkbhttps://github.com/iovisor/bcc/blob/master/tools/lib/uflow_example.txt should be able to do it based on that documentation21:59
fungiyeah, makes sense22:00
opendevreviewClark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test sigint vs sighup  https://review.opendev.org/c/opendev/system-config/+/89357122:28
clarkbthe first pass tried to build docker images and hit rate limits there. I've updated the change to use the current imageand hopefully we'll avoid the rate limits22:29
clarkboh shoot I disabled the system-config run job not the image build job22:29
opendevreviewClark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test sigint vs sighup  https://review.opendev.org/c/opendev/system-config/+/89357122:31
clarkblets see if that works better22:32
clarkbthat did manage to hold a node. I installed the bcc ebpf tools but uflow and javaflow don't end up tracing anything. I'll have to dig in more. Maybe things need to be enabled more properly in the kernel or something23:00
clarkbbut I think thats about it for today. I'll dig in mor etomorrow23:00
clarkbif this works it woud be a really neat tool to have on the toolbelt23:01
clarkb`startup flag "-XX:+ExtendedDTraceProbes" is required`23:02
clarkbthat explains it23:02
clarkbgot it "working"23:08
clarkbthe -C flag didn't work like I expected it to (I get not content after trying to filter for the class I was interested in). So then I tried doing a collect everything run and after a few second sit rwote a half gig file23:09
clarkbok I thought I was done until I discovered that piece of documentation and I think I have successfully teste dthis23:17
clarkbhttps://paste.opendev.org/show/biQgQ57k5gXuv2UdkP5t/23:18
clarkbthe kill and uflow commands are in different shells so not run in order that way I just wanted to show the pids matching up with the output behavior23:18
clarkbboth -INT and -HUP produce the same calls23:18
clarkb-9 doesn't call it at all. This is all what I expected given what the docs say and I think it is safe to use sigint now23:18
clarkbalso this might be the coolest tool ever. WOrks with python too if you build it with dtrace support23:19
corvusclarkb: those are the same calls?23:20
corvusShutdownCallbackhodHand != ShutdownCallbackdSelect23:21
clarkboh hrm.23:21
corvusbut also, i don't know what a Callbackhod is -- could it be just terminal formatting or something?23:22
clarkbya I'm wondering I don't recall those strings showing up in the gerrit source but I'll double check23:22
corvusor maybe something in uflow's reflection or whatever is slightly corrupt...23:23
clarkbprivate static class ShutdownCallback extends Thread23:23
corvusmaybe just missing a terminating null23:23
clarkbbut there isn't a ShutdownCallbackdSelect or ShutdownCallbackhodHand23:23
corvusand we should just read that as ShutdownCallback.......23:24
clarkbI think the bcc tool is just a python script that compiles some ebpf. I may be able to dig into that and widen the output or something23:24
clarkbya I wonder if this being a thread is something that causes it to show up oddly depending on when the ebpf catches it?23:25
clarkbI also noticd when I got the half gig of raw output that it emits notes to stderr that it is skipping calls too23:25
clarkbI guess not a clear indiciator yet but a promising path23:25
clarkbI'll dig in more tomorrow and try to get a better understanding23:25
corvusyeah.  i think i'm positing that looks a lot like a strncopy of something missing a null terminator23:26
corvusi agree though, the stack looks suspiciously similar and based on the fact that the difference is gibberish, it's probably a good chance they're the same23:26
Clark[m]My laptop just died. I didn't think it was low battery but I guess it was23:27
corvusclarkb: you could repeat the experiment, see if you get different gibberish23:27
corvusthat might help confirm the theory that, whatever the cause, we can ignore those 7 chars.23:27
Clark[m]++ I was going to suggest repeating it a few times and then laptop said I'm done23:27
clarkbok it was low battery and decided to hibernate to disk because everything is still here after plugging in. Neat23:30
corvusyou hit the jackpot23:32
clarkbbut ya I'll do a more scientific appraoch repeating the experiment for both sigint and sighup a few times to see if those values vary23:33
clarkband really this is a neat tool once you can narrow down what you are looking for23:33

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!