Tuesday, 2024-11-12

timburkefrickler, it depends on the test job. probe tests (which were the ones affected by https://github.com/eventlet/eventlet/issues/989) run unconstrained00:02
opendevreviewKarolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404511:34
opendevreviewSuzan Song proposed opendev/git-review master: Support GIT_SSH_COMMAND  https://review.opendev.org/c/opendev/git-review/+/93474511:48
opendevreviewSuzan Song proposed opendev/git-review master: Support GIT_SSH_COMMAND  https://review.opendev.org/c/opendev/git-review/+/93474512:58
dtantsurFYI folks. Not sure if it's a know issue or not, but the opendev.org v6 address does not look reachable here (Deutsche Telekom): https://paste.opendev.org/show/b5WW19Pmlpwk9d8GoPdP/14:23
dtantsurThis may explain why it sometimes take me many seconds to open any page14:23
fricklerdtantsur: yes, known issue with vexxhost's connectivity14:24
dtantsurSo you know, good14:24
fungiis it still the same as a year or two ago when the prefix they were announcing into bgp was too small and was dropped by some providers' filters?14:29
fungi(prefix too long, technically, network too small)14:30
fricklerI didn't check yet this time. but a /48 isn't too small, the issue was lack of proper database entries/certificates14:32
fungiwell, i meant too small for the range. at one point iana decided that different ranges would have different minimum network sizes for bgp announcements, at least back in the early days of v6 allocation, and a number of backbone providers baked those assumptions into their filter. a /48 was considered valid to announce from some parts of the v6 space but not other parts (similar to how parts of14:36
fungithe v4 space were intended for smaller networks and others not)14:36
fungiby "lack of proper certificates" do you mean providers are expecting rpki now?14:38
fungi(rfc 6480)14:39
*** artom_ is now known as artom14:42
clarkbI'm doing a quick pass of meeting agenda updates. Hoping to send that one shortly. Anything important to add/remove/edit?15:45
opendevreviewClark Boylan proposed opendev/system-config master: Update backup verifier to handle purged repos  https://review.opendev.org/c/opendev/system-config/+/93476816:01
fricklerclarkb: something about promote-openstack-manuals-developer jobs failing?16:08
opendevreviewJoel Capitao proposed opendev/system-config master: Enable mirroring of CentOS Stream 10 contents  https://review.opendev.org/c/opendev/system-config/+/93477016:10
clarkbfrickler: ack working on it16:10
clarkbtrying to pull up an example job log and seems like zuul build search might be slow? Either that or I'm not doing search terms properly16:12
clarkbcorvus: ^ fyi in case there were db updates with the recent upgrade of zuul last weekend16:12
clarkbI found https://review.opendev.org/c/openstack/api-site/+/933901 is that the most recent run? Looks like that is still failing on an undefined variable?16:15
clarkbanyway its on the draft on the wiki and I'll send that out as able with meetings this morning16:17
fricklerclarkb: build searches with just a job as a filter have been non-working for me for weeks, but I've given up complaining about that16:19
corvusclarkb: no db changes16:20
fricklercombined with a project it works fine, so the above was at least the last one for that repo https://zuul.opendev.org/t/openstack/builds?job_name=promote-openstack-manuals-developer&project=openstack/api-site16:21
opendevreviewKarolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404516:31
clarkbthere are new jitsi meet images today. We should automatically upgrade when daily infra-prod jobs run. I don't have any concerns with this as there are no big events and this usually works16:36
opendevreviewKarolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10  https://review.opendev.org/c/openstack/diskimage-builder/+/93404516:41
clarkbsending the meeting agenda has resulted in two service-discuss members having non zero bounce processing scores16:49
fungisounds like it's working!17:10
JayFSo I know some folks have reported this some, I saw it *frequently* this weekend (Sunday, mainly), where initial connections to review.opendev.org were failing17:13
JayFI can't quite put my finger on what's happening. If I didn't know any better I'd say that sometimes I'm getting a bad DNS result; but it's not consistent enough to really apply meaningful network tools to it.17:14
JayFThis is v4 only, from Centurylink/(old Qwest network) in .wa.us17:14
clarkbwe have dnssec enabled bad DNS shouldn't be possible just good dns or no dns17:16
clarkbthough I guess taht depends on client side settings and whether or not you verify17:16
clarkbJayF: are you seeing this with ssh or https or both?17:18
JayFboth17:18
JayFwell, I'll say, the experience with ssh was weird enough I blamed WSL17:18
JayFbut I had a patch push that took almost 5 minutes before failing, but it didn't actually fail (the new patchset was in gerrit; just with a delay)17:19
clarkbhow did the failure manifest?17:19
JayFbut I saw this a *bunch* of times with firefox on windows. review.opendev.org (failure to connect) multiple times until I shift-refresh17:20
JayFonce I got any successful connection, it stayed working17:20
clarkbfwiw I don't see any ssh connections from you in the logs from saturday, sunday, or monday. You are in there today. (of course I could be doing something wrong or you're using multiple usernames perhaps)17:24
clarkbit is worth noting that I sometimes get sad gerrit change pages sometime after they have loaded becuase background refreshes or whatever gerrit does to reload things (I think it polls for status updates like new patchset available and new comments) will occur when my local network is sad (for example during system updates that touch networking and shut it down temproarily)17:29
clarkbthis is distinct to `open a new tab and connect to gerrit there`17:29
clarkbit sounds like you're talking about open new tab and see change there, but if it is the former situation instead I think this is expected unfortunately as long as gerrit does polling17:29
fricklerhmm three release job failures in a row, all POST_FAILURE without logs available17:40
fricklerall in the last 10 minutes17:42
clarkblooking at live streamed logs it appears to be some problem with rax_* uploads17:43
JayFclarkb: apparently the slow SSH was yesterday, around 5:10pm, looking at my patch log17:43
JayFclarkb: (PST)17:43
JayFclarkb: https://review.opendev.org/c/openstack/ironic/+/931055 patchset 6 here17:44
clarkbnot seeing any tracebacks in ansible or anything like that just result failure attempting to upload to swift in at least rax dfw and rax iad17:45
clarkbwe can disable rax_* for uploads if it persists17:45
JayFWeird that you don't see that in the logs though17:45
JayFoh, other thread17:45
clarkbJayF: I think those are today's logs because 5:10 pst is after 00:00 UTC17:46
fricklerhttps://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=100 looks like it might have started around 15 UTC17:46
JayFclarkb: ...does something happen at 5pm PST which would cause slowness? Like a backup that's at 00:00 no splay?17:46
clarkbJayF: I think gerrit does some internal processes at 00:00 UTC17:46
JayFwell the timing lines up 100%, and that push took me *three attempts*17:47
clarkbbut repacking and actual backups don't run around then17:47
JayFso maybe something underscaled or a job that's growing linearly or worse with patchset #? /me just guessing17:47
clarkbJayF: can you clarify how it failed? You said it succeeded so I'm wondering what the failure case looks like or even just how you know it failed17:47
clarkbslow != failed17:48
clarkbfrickler: or earlier? there aer also blocks around 0700-0800?17:48
clarkbfrickler: but we run a lot more jobs around 1500 UTC that 0800 UTC17:49
fungihttps://rackspace.service-now.com/system_status?id=service_status&service=af7982f0db6cf200e93ff2e9af96198d17:50
fungiit's rackspace keystone-ish again17:50
JayFclarkb: 3 total attempts: first time: long hang during "Processing changes " spinner, I waited maybe 90 seconds, ^c, retried. second time: error message after a REALLY LONG TIME. Minutes, easily, but it actually worked because... third time: rejected after around 30+ seconds because it was missing a revision (because the second time worked even if it errored)17:50
fungioh, though that incident today says it ended 16 hours ago17:50
JayFclarkb: I'll also note that after the second time, I refreshed, and it wasn't in the web ui yet. So it almost seems like the ssh connection died/err'd before the backend processing of the change completed17:51
fungii wonder if it's picked back up again and their status dashboard hasn't been updated to reflect that yet17:51
corvusJayF: possible that it's actually the first that succeeded.  i had a similar experience but did not do your step #2.17:51
clarkbJayF: in the ssh log I see a killed git-recieve-pack. I suspect this corresponds to your ^C.17:51
clarkbit ran for ~67seconds supposedly17:52
JayFthat fits, I am impatient enough I would believe I overestimated the wait17:52
fricklerfungi: most recent one started 11:27 CST according to that, whatever time that is17:53
clarkbJayF: importantly I'm not seeing any evidence of conenctivity problems17:53
clarkbthat doesn't mean http isn't having them but on the ssh side I suspect that patience is simply needed17:53
JayFyeah, it "felt" much more like software slowness or I/O slowness17:53
JayFin fact I specifically looked for I/O issues because WSL is ... bad at I/O17:54
JayFand network transfers were in the hundreds of kb/s17:54
fungifrickler: oh, they just updated it after i pulled it up17:54
fungi11:27 cst is 17:27 utc17:54
fungiso 28 minutes agi17:55
fungiago17:55
clarkboh and this occured at ~0100 UTC today (since DST went away that lins up with ~1700 PST)17:55
clarkbfungi: ack so probably keysteon again then? Do we want to pull those regions out for now?17:55
fungiprobably a good idea, seems like they keep having problems and no idea how long it will take to resolve this time or when the next incident will occur17:56
fungitwo incidents on thursday and now two more (so far) today, just 5 days later17:57
frickler+117:57
fungithough today's looks like it's just ord not other regions17:57
opendevreviewClark Boylan proposed opendev/base-jobs master: Disable rax job log uploads  https://review.opendev.org/c/opendev/base-jobs/+/93481917:58
fungi(according to their incident info)17:58
fungihopefully that change won't be blocked by the same problem it's trying to work around17:59
clarkbit probably will be17:59
fungimight get lucky if it's really only ord that's broken18:00
clarkbfwiw I'm running a gerrit show-caches --show-jvm right now to see ifthere are any obvious memory issues that may be contributing to JayF's thing.18:00
fricklerthat patch disables ovh, not rax?!?18:00
clarkbfrickler: yes in base-test18:00
fungifrickler: you're looking at base-test18:01
clarkbthat way we can run jobs against base-test and confirm that rax is workign again before we revert18:01
fricklerah, right18:01
clarkbin addition to memory and potential gerrit daily tasks another idea is that maybe the jgit update that came with the most recent point release could be contributing. We upgraded what 3 weeks ago?18:02
clarkbalso could be the AI bot army18:02
fungioctober 30 according to ps18:02
fungi(was the last restart)18:02
clarkbfungi: that restart would've been for cache changes not the software update18:03
fungioh, right, that was the config update18:03
clarkbI suppose caches could also be at fault but we made them bigger which you'd expect would make things faster not slower18:03
corvusclarkb: JayF https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-11-07.log.html#t2024-11-07T01:09:14  there's the timing for my report of a similar problem18:03
clarkbcorvus: thats also ~0100 UTC so something about that time may be the thread to pull on18:03
corvus++18:04
JayFif I remember around that time today I might push something chonky to sandbox18:04
clarkbI did check that git repack and borg backups don't run then so its probably something internal to Gerrit or some external thing also on a timer18:04
clarkbexternal to opendev I mean18:04
clarkblike meta bot going crazy or something18:04
fungiclarkb: 934819 failed due to not adding spaces after your hashmarks18:06
clarkbone sec18:06
fungiwhich, separately, seems like a silly rule18:06
opendevreviewClark Boylan proposed opendev/base-jobs master: Disable rax job log uploads  https://review.opendev.org/c/opendev/base-jobs/+/93481918:07
clarkbJayF: corvus: I'll try to run a gerrit show-queue at 0100 today and every 30 seconds or so18:09
clarkbin theory that would show the tasks that Gerrit might be running internally and their runtimes18:13
clarkbMem: 96.00g total = 57.96g used + 37.65g free + 400.00m buffers18:14
clarkbcurrent memory consumption looks good to me18:15
opendevreviewMerged opendev/base-jobs master: Disable rax job log uploads  https://review.opendev.org/c/opendev/base-jobs/+/93481918:17
opendevreviewJeremy Stanley proposed zuul/zuul-jobs master: Switch logo color in docs pages to dark blue  https://review.opendev.org/c/zuul/zuul-jobs/+/93445319:03
opendevreviewJames E. Blair proposed openstack/project-config master: Fix openstack developer docs promote job  https://review.opendev.org/c/openstack/project-config/+/93483219:58
clarkbfungi: when we create lists we apply either private-default or legacy-default as the style depending on whether or not the list is private or public19:58
clarkbfungi: I suspect that we simply need to update those styles in mm3 to enable bounce processing by default? Not sure if updating a style changes existing lists or only new ones19:59
fungiaha, yeah i was trying to track that down20:00
clarkbupdating the style will not update existing lists20:00
clarkbstyles are initial defaults according to the docs20:01
fungiyep20:01
fungii was just looking for how to adjust it for new lists, figured we'd still need to update the existing ones directly (which can be done in bulk from the cli if we like)20:01
clarkbI'm going to manually do the three opendev lists I manage now20:02
clarkbhuh service-announce and service-incident are already set20:03
clarkbis it possible that the setting is set vhost wide so when I did service-discuss it did all of them? or maybe we just carried over these settings from mm2?20:03
clarkband only some lists had it disabled?20:03
clarkbactually I bet that is what happened20:03
clarkbbut then additionally the legacy-default style must not be enabling it so new lists don't get it?20:04
timburkecould i get a node hold on https://zuul.opendev.org/t/openstack/stream/1358e40883c64c07b05ecdfd9ff8dcba?logfile=console.log ? seems like another hang, but i suspect it's a little different from the last one i saw...20:11
fungitimburke: added20:17
timburkethanks20:18
fungiclarkb: so based on https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/config/docs/config.html#styles it sounds like we can define our own styles https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/styles/docs/styles.html20:22
fungii'm still hunting for an example of where to define them20:22
clarkbfungi: ya the mm3 docs are very sparse on this stuff20:33
fungiyou could have stopped at "the mm3 docs are very sparse"20:34
fungilooks like the default styles are defined in src/mailman/styles/default.py20:36
fungi3.2.0 release notes mention "you can now specify additional styles using the new plugin architecture"20:40
fungiso i think we have to create a mailman plugin with our custom style(s)20:40
fungithe good news is that mailman plugins are written in python20:41
fungihttps://docs.mailman3.org/projects/mailman/en/latest/src/mailman/plugins/docs/intro.html#plugins20:42
clarkbfungi: I think you can define then via the rest api too20:42
clarkbbut it isn't clear how to set all the attributes of that style via the api20:42
clarkbhttps://docs.mailman3.org/projects/mailman/en/latest/src/mailman/styles/docs/styles.html#registering-styles20:43
fungiand doing that idempotently is perhaps another challenge20:43
fungibut maybe still simpler than maintaining a full-blown python project/package20:44
clarkbfor lists and domains we just check if they exist first and create if not20:44
clarkbworks as long as you only have one ansible running at a time20:45
gouthamro/ fungi: frickler mentioned you were aware of the problem with opendevmeetbot joining #openstack-eventlet-removal .. is there a config change i need to make? or something i can seek help for?20:47
gouthamropendevmeet* 20:47
clarkbas a heads up I approved https://review.opendev.org/c/zuul/zuul-jobs/+/934243 to trigger rtd with curl20:52
fungigouthamr: i wasn't aware of a problem20:53
fungii see the channel is listed in the supybot.networks.oftc.channels setting of /var/lib/limnoria/limnoria.config on eavesdrop01, which should suffice20:54
gouthamroh, I tried to start a meeting there and nothing happened20:54
clarkbgouthamr: is the bot in the room?20:54
gouthamrso I thought opendevmeet needed to be in the channel20:54
gouthamrNope20:54
clarkbok I suspect the issue is we don't auto restart that bot because it is disruptive20:55
gouthamri op’ed and invited it too, in case that worked :)20:55
fungiit's not in the channel, but yes it needs to be20:55
clarkbwe probably just need to manually restart the bot20:55
fungioh, right, checking...20:55
clarkbwe don't auto restart becuse it impacts running meetings so usually we wait for an empty meeting block of time to restart it20:55
fungiyeah, the current process has been running since some time last year20:55
clarkbprobably a good idea to confirm the configuration updated before restarting but ya I suspect that is all that is needed20:56
fungiit's been so long since we added a new channel, i forgot we were doing manual restarts20:57
fungibut anyway, yes, i checked the config and it contains the channel, so the config update did deploy successfully20:57
opendevreviewMerged zuul/zuul-jobs master: Use "curl" to trigger rtd.org webhook with http basic auth  https://review.opendev.org/c/zuul/zuul-jobs/+/93424320:57
fungithe only meeting held at 21z on tuesdays according to https://meetings.opendev.org/ is the scientific sig, and i don't see it going on in #openstack-meeting so it should be safe to restart the bot container now21:11
fungii'll do that unless there are objections in the next few minutes21:11
clarkbwfm21:14
*** dhill is now known as Guest923321:16
opendevstatusfungi: finished logging21:23
fungigouthamr: ^21:25
clarkbI see the bot in the channel now21:27
fungiyeah, i watched it join as well21:31
timburkefungi, how's that node doing? looks like the job finally timed out -- hopefully whatever stuck process is still running21:44
fungitimburke: it's held, what's your ssh key?21:46
timburkehttps://gist.githubusercontent.com/tipabu/d5f2319c19b2672143b9c153f6a67ebd/raw/0909646bc09ffa2b156d63163a5f4cc506899fd9/gistfile1.txt21:46
fungitimburke: ssh root@23.253.166.11221:47
timburkethanks!21:47
fungiany time21:49
timburkefungi, think i've got what i need -- looks like more motivation to get off eventlet, but i think it's because we were holding the tool wrong this time22:03
fungiyeah, make sure you have the right end22:03
fungiinjury will result22:03
timburkemixing tpool with manually-managed forking might be a recipe for a bad time22:04
timburkegotta watch them tines22:04
fungii've freed that node to be recycled back into our quota22:04
opendevreviewTony Breeds proposed opendev/system-config master: Also include tzdata when installing ARA  https://review.opendev.org/c/opendev/system-config/+/92368423:57
opendevreviewTony Breeds proposed opendev/system-config master: Update ansible-devel job to run on a newer bridge  https://review.opendev.org/c/opendev/system-config/+/93053823:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!