timburke | frickler, it depends on the test job. probe tests (which were the ones affected by https://github.com/eventlet/eventlet/issues/989) run unconstrained | 00:02 |
---|---|---|
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 11:34 |
opendevreview | Suzan Song proposed opendev/git-review master: Support GIT_SSH_COMMAND https://review.opendev.org/c/opendev/git-review/+/934745 | 11:48 |
opendevreview | Suzan Song proposed opendev/git-review master: Support GIT_SSH_COMMAND https://review.opendev.org/c/opendev/git-review/+/934745 | 12:58 |
dtantsur | FYI folks. Not sure if it's a know issue or not, but the opendev.org v6 address does not look reachable here (Deutsche Telekom): https://paste.opendev.org/show/b5WW19Pmlpwk9d8GoPdP/ | 14:23 |
dtantsur | This may explain why it sometimes take me many seconds to open any page | 14:23 |
frickler | dtantsur: yes, known issue with vexxhost's connectivity | 14:24 |
dtantsur | So you know, good | 14:24 |
fungi | is it still the same as a year or two ago when the prefix they were announcing into bgp was too small and was dropped by some providers' filters? | 14:29 |
fungi | (prefix too long, technically, network too small) | 14:30 |
frickler | I didn't check yet this time. but a /48 isn't too small, the issue was lack of proper database entries/certificates | 14:32 |
fungi | well, i meant too small for the range. at one point iana decided that different ranges would have different minimum network sizes for bgp announcements, at least back in the early days of v6 allocation, and a number of backbone providers baked those assumptions into their filter. a /48 was considered valid to announce from some parts of the v6 space but not other parts (similar to how parts of | 14:36 |
fungi | the v4 space were intended for smaller networks and others not) | 14:36 |
fungi | by "lack of proper certificates" do you mean providers are expecting rpki now? | 14:38 |
fungi | (rfc 6480) | 14:39 |
*** artom_ is now known as artom | 14:42 | |
clarkb | I'm doing a quick pass of meeting agenda updates. Hoping to send that one shortly. Anything important to add/remove/edit? | 15:45 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update backup verifier to handle purged repos https://review.opendev.org/c/opendev/system-config/+/934768 | 16:01 |
frickler | clarkb: something about promote-openstack-manuals-developer jobs failing? | 16:08 |
opendevreview | Joel Capitao proposed opendev/system-config master: Enable mirroring of CentOS Stream 10 contents https://review.opendev.org/c/opendev/system-config/+/934770 | 16:10 |
clarkb | frickler: ack working on it | 16:10 |
clarkb | trying to pull up an example job log and seems like zuul build search might be slow? Either that or I'm not doing search terms properly | 16:12 |
clarkb | corvus: ^ fyi in case there were db updates with the recent upgrade of zuul last weekend | 16:12 |
clarkb | I found https://review.opendev.org/c/openstack/api-site/+/933901 is that the most recent run? Looks like that is still failing on an undefined variable? | 16:15 |
clarkb | anyway its on the draft on the wiki and I'll send that out as able with meetings this morning | 16:17 |
frickler | clarkb: build searches with just a job as a filter have been non-working for me for weeks, but I've given up complaining about that | 16:19 |
corvus | clarkb: no db changes | 16:20 |
frickler | combined with a project it works fine, so the above was at least the last one for that repo https://zuul.opendev.org/t/openstack/builds?job_name=promote-openstack-manuals-developer&project=openstack/api-site | 16:21 |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 16:31 |
clarkb | there are new jitsi meet images today. We should automatically upgrade when daily infra-prod jobs run. I don't have any concerns with this as there are no big events and this usually works | 16:36 |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 16:41 |
clarkb | sending the meeting agenda has resulted in two service-discuss members having non zero bounce processing scores | 16:49 |
fungi | sounds like it's working! | 17:10 |
JayF | So I know some folks have reported this some, I saw it *frequently* this weekend (Sunday, mainly), where initial connections to review.opendev.org were failing | 17:13 |
JayF | I can't quite put my finger on what's happening. If I didn't know any better I'd say that sometimes I'm getting a bad DNS result; but it's not consistent enough to really apply meaningful network tools to it. | 17:14 |
JayF | This is v4 only, from Centurylink/(old Qwest network) in .wa.us | 17:14 |
clarkb | we have dnssec enabled bad DNS shouldn't be possible just good dns or no dns | 17:16 |
clarkb | though I guess taht depends on client side settings and whether or not you verify | 17:16 |
clarkb | JayF: are you seeing this with ssh or https or both? | 17:18 |
JayF | both | 17:18 |
JayF | well, I'll say, the experience with ssh was weird enough I blamed WSL | 17:18 |
JayF | but I had a patch push that took almost 5 minutes before failing, but it didn't actually fail (the new patchset was in gerrit; just with a delay) | 17:19 |
clarkb | how did the failure manifest? | 17:19 |
JayF | but I saw this a *bunch* of times with firefox on windows. review.opendev.org (failure to connect) multiple times until I shift-refresh | 17:20 |
JayF | once I got any successful connection, it stayed working | 17:20 |
clarkb | fwiw I don't see any ssh connections from you in the logs from saturday, sunday, or monday. You are in there today. (of course I could be doing something wrong or you're using multiple usernames perhaps) | 17:24 |
clarkb | it is worth noting that I sometimes get sad gerrit change pages sometime after they have loaded becuase background refreshes or whatever gerrit does to reload things (I think it polls for status updates like new patchset available and new comments) will occur when my local network is sad (for example during system updates that touch networking and shut it down temproarily) | 17:29 |
clarkb | this is distinct to `open a new tab and connect to gerrit there` | 17:29 |
clarkb | it sounds like you're talking about open new tab and see change there, but if it is the former situation instead I think this is expected unfortunately as long as gerrit does polling | 17:29 |
frickler | hmm three release job failures in a row, all POST_FAILURE without logs available | 17:40 |
frickler | all in the last 10 minutes | 17:42 |
clarkb | looking at live streamed logs it appears to be some problem with rax_* uploads | 17:43 |
JayF | clarkb: apparently the slow SSH was yesterday, around 5:10pm, looking at my patch log | 17:43 |
JayF | clarkb: (PST) | 17:43 |
JayF | clarkb: https://review.opendev.org/c/openstack/ironic/+/931055 patchset 6 here | 17:44 |
clarkb | not seeing any tracebacks in ansible or anything like that just result failure attempting to upload to swift in at least rax dfw and rax iad | 17:45 |
clarkb | we can disable rax_* for uploads if it persists | 17:45 |
JayF | Weird that you don't see that in the logs though | 17:45 |
JayF | oh, other thread | 17:45 |
clarkb | JayF: I think those are today's logs because 5:10 pst is after 00:00 UTC | 17:46 |
frickler | https://zuul.opendev.org/t/openstack/builds?result=POST_FAILURE&skip=100 looks like it might have started around 15 UTC | 17:46 |
JayF | clarkb: ...does something happen at 5pm PST which would cause slowness? Like a backup that's at 00:00 no splay? | 17:46 |
clarkb | JayF: I think gerrit does some internal processes at 00:00 UTC | 17:46 |
JayF | well the timing lines up 100%, and that push took me *three attempts* | 17:47 |
clarkb | but repacking and actual backups don't run around then | 17:47 |
JayF | so maybe something underscaled or a job that's growing linearly or worse with patchset #? /me just guessing | 17:47 |
clarkb | JayF: can you clarify how it failed? You said it succeeded so I'm wondering what the failure case looks like or even just how you know it failed | 17:47 |
clarkb | slow != failed | 17:48 |
clarkb | frickler: or earlier? there aer also blocks around 0700-0800? | 17:48 |
clarkb | frickler: but we run a lot more jobs around 1500 UTC that 0800 UTC | 17:49 |
fungi | https://rackspace.service-now.com/system_status?id=service_status&service=af7982f0db6cf200e93ff2e9af96198d | 17:50 |
fungi | it's rackspace keystone-ish again | 17:50 |
JayF | clarkb: 3 total attempts: first time: long hang during "Processing changes " spinner, I waited maybe 90 seconds, ^c, retried. second time: error message after a REALLY LONG TIME. Minutes, easily, but it actually worked because... third time: rejected after around 30+ seconds because it was missing a revision (because the second time worked even if it errored) | 17:50 |
fungi | oh, though that incident today says it ended 16 hours ago | 17:50 |
JayF | clarkb: I'll also note that after the second time, I refreshed, and it wasn't in the web ui yet. So it almost seems like the ssh connection died/err'd before the backend processing of the change completed | 17:51 |
fungi | i wonder if it's picked back up again and their status dashboard hasn't been updated to reflect that yet | 17:51 |
corvus | JayF: possible that it's actually the first that succeeded. i had a similar experience but did not do your step #2. | 17:51 |
clarkb | JayF: in the ssh log I see a killed git-recieve-pack. I suspect this corresponds to your ^C. | 17:51 |
clarkb | it ran for ~67seconds supposedly | 17:52 |
JayF | that fits, I am impatient enough I would believe I overestimated the wait | 17:52 |
frickler | fungi: most recent one started 11:27 CST according to that, whatever time that is | 17:53 |
clarkb | JayF: importantly I'm not seeing any evidence of conenctivity problems | 17:53 |
clarkb | that doesn't mean http isn't having them but on the ssh side I suspect that patience is simply needed | 17:53 |
JayF | yeah, it "felt" much more like software slowness or I/O slowness | 17:53 |
JayF | in fact I specifically looked for I/O issues because WSL is ... bad at I/O | 17:54 |
JayF | and network transfers were in the hundreds of kb/s | 17:54 |
fungi | frickler: oh, they just updated it after i pulled it up | 17:54 |
fungi | 11:27 cst is 17:27 utc | 17:54 |
fungi | so 28 minutes agi | 17:55 |
fungi | ago | 17:55 |
clarkb | oh and this occured at ~0100 UTC today (since DST went away that lins up with ~1700 PST) | 17:55 |
clarkb | fungi: ack so probably keysteon again then? Do we want to pull those regions out for now? | 17:55 |
fungi | probably a good idea, seems like they keep having problems and no idea how long it will take to resolve this time or when the next incident will occur | 17:56 |
fungi | two incidents on thursday and now two more (so far) today, just 5 days later | 17:57 |
frickler | +1 | 17:57 |
fungi | though today's looks like it's just ord not other regions | 17:57 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Disable rax job log uploads https://review.opendev.org/c/opendev/base-jobs/+/934819 | 17:58 |
fungi | (according to their incident info) | 17:58 |
fungi | hopefully that change won't be blocked by the same problem it's trying to work around | 17:59 |
clarkb | it probably will be | 17:59 |
fungi | might get lucky if it's really only ord that's broken | 18:00 |
clarkb | fwiw I'm running a gerrit show-caches --show-jvm right now to see ifthere are any obvious memory issues that may be contributing to JayF's thing. | 18:00 |
frickler | that patch disables ovh, not rax?!? | 18:00 |
clarkb | frickler: yes in base-test | 18:00 |
fungi | frickler: you're looking at base-test | 18:01 |
clarkb | that way we can run jobs against base-test and confirm that rax is workign again before we revert | 18:01 |
frickler | ah, right | 18:01 |
clarkb | in addition to memory and potential gerrit daily tasks another idea is that maybe the jgit update that came with the most recent point release could be contributing. We upgraded what 3 weeks ago? | 18:02 |
clarkb | also could be the AI bot army | 18:02 |
fungi | october 30 according to ps | 18:02 |
fungi | (was the last restart) | 18:02 |
clarkb | fungi: that restart would've been for cache changes not the software update | 18:03 |
fungi | oh, right, that was the config update | 18:03 |
clarkb | I suppose caches could also be at fault but we made them bigger which you'd expect would make things faster not slower | 18:03 |
corvus | clarkb: JayF https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-11-07.log.html#t2024-11-07T01:09:14 there's the timing for my report of a similar problem | 18:03 |
clarkb | corvus: thats also ~0100 UTC so something about that time may be the thread to pull on | 18:03 |
corvus | ++ | 18:04 |
JayF | if I remember around that time today I might push something chonky to sandbox | 18:04 |
clarkb | I did check that git repack and borg backups don't run then so its probably something internal to Gerrit or some external thing also on a timer | 18:04 |
clarkb | external to opendev I mean | 18:04 |
clarkb | like meta bot going crazy or something | 18:04 |
fungi | clarkb: 934819 failed due to not adding spaces after your hashmarks | 18:06 |
clarkb | one sec | 18:06 |
fungi | which, separately, seems like a silly rule | 18:06 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Disable rax job log uploads https://review.opendev.org/c/opendev/base-jobs/+/934819 | 18:07 |
clarkb | JayF: corvus: I'll try to run a gerrit show-queue at 0100 today and every 30 seconds or so | 18:09 |
clarkb | in theory that would show the tasks that Gerrit might be running internally and their runtimes | 18:13 |
clarkb | Mem: 96.00g total = 57.96g used + 37.65g free + 400.00m buffers | 18:14 |
clarkb | current memory consumption looks good to me | 18:15 |
opendevreview | Merged opendev/base-jobs master: Disable rax job log uploads https://review.opendev.org/c/opendev/base-jobs/+/934819 | 18:17 |
opendevreview | Jeremy Stanley proposed zuul/zuul-jobs master: Switch logo color in docs pages to dark blue https://review.opendev.org/c/zuul/zuul-jobs/+/934453 | 19:03 |
opendevreview | James E. Blair proposed openstack/project-config master: Fix openstack developer docs promote job https://review.opendev.org/c/openstack/project-config/+/934832 | 19:58 |
clarkb | fungi: when we create lists we apply either private-default or legacy-default as the style depending on whether or not the list is private or public | 19:58 |
clarkb | fungi: I suspect that we simply need to update those styles in mm3 to enable bounce processing by default? Not sure if updating a style changes existing lists or only new ones | 19:59 |
fungi | aha, yeah i was trying to track that down | 20:00 |
clarkb | updating the style will not update existing lists | 20:00 |
clarkb | styles are initial defaults according to the docs | 20:01 |
fungi | yep | 20:01 |
fungi | i was just looking for how to adjust it for new lists, figured we'd still need to update the existing ones directly (which can be done in bulk from the cli if we like) | 20:01 |
clarkb | I'm going to manually do the three opendev lists I manage now | 20:02 |
clarkb | huh service-announce and service-incident are already set | 20:03 |
clarkb | is it possible that the setting is set vhost wide so when I did service-discuss it did all of them? or maybe we just carried over these settings from mm2? | 20:03 |
clarkb | and only some lists had it disabled? | 20:03 |
clarkb | actually I bet that is what happened | 20:03 |
clarkb | but then additionally the legacy-default style must not be enabling it so new lists don't get it? | 20:04 |
timburke | could i get a node hold on https://zuul.opendev.org/t/openstack/stream/1358e40883c64c07b05ecdfd9ff8dcba?logfile=console.log ? seems like another hang, but i suspect it's a little different from the last one i saw... | 20:11 |
fungi | timburke: added | 20:17 |
timburke | thanks | 20:18 |
fungi | clarkb: so based on https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/config/docs/config.html#styles it sounds like we can define our own styles https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/styles/docs/styles.html | 20:22 |
fungi | i'm still hunting for an example of where to define them | 20:22 |
clarkb | fungi: ya the mm3 docs are very sparse on this stuff | 20:33 |
fungi | you could have stopped at "the mm3 docs are very sparse" | 20:34 |
fungi | looks like the default styles are defined in src/mailman/styles/default.py | 20:36 |
fungi | 3.2.0 release notes mention "you can now specify additional styles using the new plugin architecture" | 20:40 |
fungi | so i think we have to create a mailman plugin with our custom style(s) | 20:40 |
fungi | the good news is that mailman plugins are written in python | 20:41 |
fungi | https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/plugins/docs/intro.html#plugins | 20:42 |
clarkb | fungi: I think you can define then via the rest api too | 20:42 |
clarkb | but it isn't clear how to set all the attributes of that style via the api | 20:42 |
clarkb | https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/styles/docs/styles.html#registering-styles | 20:43 |
fungi | and doing that idempotently is perhaps another challenge | 20:43 |
fungi | but maybe still simpler than maintaining a full-blown python project/package | 20:44 |
clarkb | for lists and domains we just check if they exist first and create if not | 20:44 |
clarkb | works as long as you only have one ansible running at a time | 20:45 |
gouthamr | o/ fungi: frickler mentioned you were aware of the problem with opendevmeetbot joining #openstack-eventlet-removal .. is there a config change i need to make? or something i can seek help for? | 20:47 |
gouthamr | opendevmeet* | 20:47 |
clarkb | as a heads up I approved https://review.opendev.org/c/zuul/zuul-jobs/+/934243 to trigger rtd with curl | 20:52 |
fungi | gouthamr: i wasn't aware of a problem | 20:53 |
fungi | i see the channel is listed in the supybot.networks.oftc.channels setting of /var/lib/limnoria/limnoria.config on eavesdrop01, which should suffice | 20:54 |
gouthamr | oh, I tried to start a meeting there and nothing happened | 20:54 |
clarkb | gouthamr: is the bot in the room? | 20:54 |
gouthamr | so I thought opendevmeet needed to be in the channel | 20:54 |
gouthamr | Nope | 20:54 |
clarkb | ok I suspect the issue is we don't auto restart that bot because it is disruptive | 20:55 |
gouthamr | i op’ed and invited it too, in case that worked :) | 20:55 |
fungi | it's not in the channel, but yes it needs to be | 20:55 |
clarkb | we probably just need to manually restart the bot | 20:55 |
fungi | oh, right, checking... | 20:55 |
clarkb | we don't auto restart becuse it impacts running meetings so usually we wait for an empty meeting block of time to restart it | 20:55 |
fungi | yeah, the current process has been running since some time last year | 20:55 |
clarkb | probably a good idea to confirm the configuration updated before restarting but ya I suspect that is all that is needed | 20:56 |
fungi | it's been so long since we added a new channel, i forgot we were doing manual restarts | 20:57 |
fungi | but anyway, yes, i checked the config and it contains the channel, so the config update did deploy successfully | 20:57 |
opendevreview | Merged zuul/zuul-jobs master: Use "curl" to trigger rtd.org webhook with http basic auth https://review.opendev.org/c/zuul/zuul-jobs/+/934243 | 20:57 |
fungi | the only meeting held at 21z on tuesdays according to https://meetings.opendev.org/ is the scientific sig, and i don't see it going on in #openstack-meeting so it should be safe to restart the bot container now | 21:11 |
fungi | i'll do that unless there are objections in the next few minutes | 21:11 |
clarkb | wfm | 21:14 |
*** dhill is now known as Guest9233 | 21:16 | |
opendevstatus | fungi: finished logging | 21:23 |
fungi | gouthamr: ^ | 21:25 |
clarkb | I see the bot in the channel now | 21:27 |
fungi | yeah, i watched it join as well | 21:31 |
timburke | fungi, how's that node doing? looks like the job finally timed out -- hopefully whatever stuck process is still running | 21:44 |
fungi | timburke: it's held, what's your ssh key? | 21:46 |
timburke | https://gist.githubusercontent.com/tipabu/d5f2319c19b2672143b9c153f6a67ebd/raw/0909646bc09ffa2b156d63163a5f4cc506899fd9/gistfile1.txt | 21:46 |
fungi | timburke: ssh root@23.253.166.112 | 21:47 |
timburke | thanks! | 21:47 |
fungi | any time | 21:49 |
timburke | fungi, think i've got what i need -- looks like more motivation to get off eventlet, but i think it's because we were holding the tool wrong this time | 22:03 |
fungi | yeah, make sure you have the right end | 22:03 |
fungi | injury will result | 22:03 |
timburke | mixing tpool with manually-managed forking might be a recipe for a bad time | 22:04 |
timburke | gotta watch them tines | 22:04 |
fungi | i've freed that node to be recycled back into our quota | 22:04 |
opendevreview | Tony Breeds proposed opendev/system-config master: Also include tzdata when installing ARA https://review.opendev.org/c/opendev/system-config/+/923684 | 23:57 |
opendevreview | Tony Breeds proposed opendev/system-config master: Update ansible-devel job to run on a newer bridge https://review.opendev.org/c/opendev/system-config/+/930538 | 23:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!