opendevreview | gnuoy proposed openstack/project-config master: Create mono repo for sunbeam charms https://review.opendev.org/c/openstack/project-config/+/900167 | 11:29 |
---|---|---|
*** elodilles_pto is now known as elodilles | 11:39 | |
opendevreview | gnuoy proposed openstack/project-config master: Create mono repo for sunbeam charms https://review.opendev.org/c/openstack/project-config/+/900167 | 14:04 |
opendevreview | Merged openstack/project-config master: Remove OSA rsyslog noop jobs once repo content is removed https://review.opendev.org/c/openstack/project-config/+/863087 | 14:45 |
opendevreview | gnuoy proposed openstack/project-config master: Create mono repo for sunbeam charms https://review.opendev.org/c/openstack/project-config/+/900167 | 14:45 |
tonyb | fungi, clarkb: IIUC the next step in mirgtaing mirror nodes to jammy is for someone with infra-root to boot a new node in each region. Then I/we can add those to the proper inventory and frobnicate DNS | 15:01 |
fungi | tonyb: sounds right | 15:02 |
tonyb | .... followed by cleanup etc | 15:02 |
tonyb | I don't see any docs in system-config/docs/source about meetpad/jvb. In the bionic-servers etherpad I see "no real local state." I'm trying to determine what if any local state there is. Otherwise it looks really easy to test new meetpad/jvb nodes on jammy | 16:03 |
clarkb | tonyb: because we don't do recordings etc I think the only "state" as it is is the CA stuff. However, because it is java it isn't actually checking for proper trust amongst the CAs instead it just wants to see a signed cert and then do ssl | 16:06 |
clarkb | its weird, but ya I think it simplifies things | 16:06 |
*** gouthamr_ is now known as gouthamr | 16:09 | |
clarkb | fungi: for the meeting agenda do we want to keep mm3 on the list to do a recap or should we pull it off (I beleive all tasks are completed now) | 16:41 |
clarkb | when I scheduled the gerrit upgrade for 15:30 UTC I failed to remember that we'd be dropping DST this last weekend. Thats fine I'll just get up early | 16:51 |
clarkb | more of a "DST strikes again!" situation | 16:52 |
clarkb | fungi: I guess one oustanding mm3 item is to figure out the error we had post upgrade here: https://paste.opendev.org/show/bc7jfeZCt97fZm0dCPKw/ Compressing... Error parsing template socialaccount/signup.html: socialaccount/base.html | 16:57 |
fungi | clarkb: oh, right. i'll look into that. i guess keep it on the agenda for now | 17:06 |
tonyb | fungi: Possibly https://lists.mailman3.org/archives/list/mailman-users@mailman3.org/thread/MGY6JA6O7BWGR2KNKD3PQTMW7ZY7NHS3/ ... pip install django_allauth\<0.58 | 17:10 |
fungi | tonyb: supposed to be fixed in latest django-allauth i thouhgt | 17:11 |
tonyb | Okay | 17:11 |
clarkb | looks like the newer version fiex it. But maybe our pip resolution didn't pull it in? | 17:12 |
corvus | clarkb fungi tonyb hiya; i wrote a change to the nodepool-openstack functional test (the test that runs devstack and then runs nodepool against it). part of that test boots a node (in potentially nested vm) and keyscans it. i noticed that it was taking an unusually long time to boot and start ssh, sometimes causing the test to fail, so my change increases the time we wait for the node to boot. the extra time could be natural evolution (maybe | 17:15 |
corvus | some new thing on the image is just a little slower, and it's a straw-that-broke-the-camels-back situation). or it could be a string of bad luck in noisy-neighbor scheduling on our test nodes. or it could be something more substantial either in the devstack cloud or the test image used. i bring it up here in order to surface the problem so that more than just zuul maintainers would see it. if you have any thoughts about who else should be | 17:15 |
corvus | made aware of this, feel free to ping them and let them know. https://review.opendev.org/900048 is the change to bump the timeout | 17:15 |
clarkb | corvus: TheJulia has been debugging some unnecessary slowness with cloud-init when using config drive (it will dhcp first, then look at config drive and learn it must statically configure things). But I'm 98% sure that those jobs use glean on the images not cloud-init so I don't think that issue is related | 17:16 |
clarkb | other than that I can't think of anything that might be slowing down the test node booting process | 17:17 |
TheJulia | speaking of, you guys can nuke that held node now | 17:17 |
opendevreview | Tony Breeds proposed opendev/system-config master: Add Tony Breeds to base_users for all hosts https://review.opendev.org/c/opendev/system-config/+/900220 | 17:17 |
TheJulia | since I think the outstanding questions got answered on Friday | 17:17 |
clarkb | TheJulia: will do | 17:17 |
clarkb | done. | 17:18 |
clarkb | fungi: should I also delete the etherpad 1.9.4 hold since we have upgraded? | 17:19 |
corvus | clarkb: yeah, i think they do use glean. they are centos stream 9 images. | 17:23 |
corvus | here is the log output from one of the runs with an embedded console log: https://zuul.opendev.org/t/zuul/build/78382647905044ff8287dd03da8c154f/log/podman/nodepool_nodepool-launcher_1.txt#8019 | 17:24 |
fungi | clarkb: oh, yes don't need the held nodes for etherpad any longer | 17:27 |
clarkb | fungi: all done | 17:28 |
fungi | clarkb: we held the mailman upgrade change until django-allauth got fixed and then i rechecked it to confirm the new images had the fixed version | 17:28 |
clarkb | fungi: huh, maybe it has to do with the upgrade then? | 17:28 |
tonyb | corvus: I haven't looked at a centos boot for a while but amost a minute here seems excessive: https://zuul.opendev.org/t/zuul/build/78382647905044ff8287dd03da8c154f/log/podman/nodepool_nodepool-launcher_1.txt#8597 | 17:30 |
fungi | clarkb: tonyb running `sudo docker-compose -f /etc/mailman-compose/docker-compose.yaml exec mailman-web pip list|grep allauth` on lists01 reports "django-allauth 0.58.1" by the way | 17:31 |
tonyb | fungi: Okay. That verifies that. | 17:32 |
clarkb | tonyb: corvus: these are doing qemu and not nested virt. So slowness is to be expected. I think generous timeouts are a good idea. Also I agree that looks like glean and no cloud-init to me | 17:32 |
clarkb | as for why there is a bit time gap where tonyb points to I'm not sure. That looks like systemd working through rc scripts and /etc settings for unit files | 17:33 |
clarkb | maybe it is waiting for a non RO filesystem ? | 17:33 |
tonyb | Could be. | 17:35 |
* tonyb is grabbing a CentOS image to poke at | 17:37 | |
clarkb | tonyb: note the images in these tests are built from scratch using the -minimal image | 17:37 |
clarkb | er -minimal element so they won't have cloud-init in them for example. Maybe not necessary to see why early systemd is slow, but want to call it out | 17:38 |
fungi | if you have the bandwidth, you can just download an already built one from https://nb01.opendev.org/images/ (or nb01) | 17:38 |
fungi | (er, or nb02) | 17:38 |
tonyb | Thanks. I was prepared for differences. Just wanted to poke around locally | 17:38 |
tonyb | ... or indeed I could do that :) | 17:39 |
fungi | https://nb02.opendev.org/images/centos-9-stream-60d70c487274458ab3d42567cce05714.qcow2 is from 2023-11-05 15:46 and 7.6G in size | 17:40 |
fungi | so pretty sure that's the newest one | 17:40 |
corvus | unfortunately, i don't think we get a useful console log unless the boot fails, which makes it hard to compare with earlier successful builds | 17:42 |
fungi | could collect syslog from the successfully booted one? | 17:43 |
fungi | oh, i get it. you're saying we don't have earlier records, right | 17:43 |
corvus | yeah would be a good idea. the job attempts to collect a console log outside of nodepool, but all it gets is grub. see for example this earlier successful build: https://zuul.opendev.org/t/zuul/build/2b10d60c2d804a3a9afdce06f50d7653/log/instances/57a231ba-c4f7-40bc-bed3-b5790d906a22/console.log | 17:44 |
corvus | (yeah, so 2 probs: 1 we don't have the collection in place to record successful console logs (or its broken); 2 since that's true, we'll never have old data to compare it to; we can only fix that going forward) | 17:46 |
tonyb | For a one off reference we could potentially use an older 9-stream compose and generate an image from there. I don't think that'd be too much work | 18:06 |
tonyb | Looks like we could go back in time by about a month: https://composes.stream.centos.org/production/ | 18:09 |
clarkb | https://element.io/blog/element-to-adopt-agplv3/ I don't think this will affect directly as we aren't making changes to matrix services (and tehy would be publicly hosted even if we did) | 18:14 |
tonyb | "Please see here for The Matrix.org Foundation's position. Contributors to these new repositories will now need to agree to a standard Apache CLA with Element before their PRs can be merged - this replaces the previous DCO sign-off mechanism." Interesting they're switching away from DCO | 18:21 |
clarkb | tonyb: it is because they need a CLA to allow them to sell the code under other license terms | 18:22 |
clarkb | they mention that explicitly later in the post | 18:22 |
clarkb | (which is good that they aren't trying to be sneaky about it) | 18:22 |
tonyb | Okay. | 18:40 |
clarkb | fungi: any idea what the story is with https://review.opendev.org/c/openstack/project-config/+/800442 ? Is that somethign we just missed? I worry that merging it now would create more problems... | 18:49 |
*** dhill is now known as Guest6059 | 19:42 | |
*** JayF is now known as Guest6061 | 19:55 | |
*** JasonF is now known as JayF | 19:55 | |
*** zigo_ is now known as zigo | 19:57 | |
fungi | i'm not sure. i guess it might be worth checking with the author to see if they still want it created (but also creating new repos in the x/ namespace seems questionable, ideally they would pick a new namespace) | 19:59 |
johnsom | FYI, lots of job failures at the moment. orange post failers | 20:09 |
clarkb | yes it was just bourhgt up in #openstack-infra | 20:13 |
clarkb | appears to be a problem with at least one swift endpoint (rax-iad) | 20:13 |
clarkb | need to spot check others to see if the rest of rax is ok or now | 20:13 |
clarkb | *or not | 20:13 |
clarkb | second one is in rax-dfw so probably a rax problem and not region specific | 20:15 |
clarkb | 2 on rax iad and 1 on rax dfw so far. I think thats enough to push a chagne to disable rax for now | 20:17 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Temporarily disable log uploads to rax swift https://review.opendev.org/c/opendev/base-jobs/+/900243 | 20:18 |
clarkb | nothing on the rax status page yet. Could be on our end I suppose, but hard to say until we reproduce independently | 20:19 |
clarkb | and I need to eat lunch while it is hot | 20:19 |
johnsom | I am also having lunch, cheers | 20:20 |
clarkb | https://zuul.opendev.org/t/openstack/stream/d3a19b50ca2f405aa9e5020cecf054d9?logfile=console.log is currently uploading to rax iad. I think it is going to fail, btu if it doesnt' that could be another useful data point | 20:28 |
clarkb | fungi: ^ fyi a change to disable rax log uploads. And a job we can watch to see if it eventually fails | 20:29 |
clarkb | https://zuul.opendev.org/t/openstack/stream/4d91518405774bc9acd1a62a24f7d40d?logfile=console.log is another to watch | 20:30 |
clarkb | fungi: if it were an api/sdk compatibility thing I would expect these jobs to fail quicker | 20:32 |
clarkb | but instead they seem to eb doign uploads or attempting them and then failing | 20:32 |
clarkb | ok I just ran openstack container list against IAD and got back a 500 | 20:32 |
clarkb | goign to repeat for the other two regions, but I think this is not on our end | 20:33 |
fungi | done with an old known-working cli/sdk install? | 20:33 |
clarkb | it works against ORD but not IAD and DFW | 20:33 |
fungi | aha, yep that sounds like it's on the service end | 20:33 |
clarkb | fungi: just using what we've got installed on bridge but it works against ORD. But also servers shouldn't report Internal errors 500 on api comaptibility issues | 20:33 |
fungi | i'm going to bypass zuul to merge that change | 20:33 |
clarkb | fungi: do we want to modify it to keep ord for nopw? | 20:34 |
clarkb | fungi: that could be good data that rax generally works iwth our sdk versions | 20:34 |
fungi | clarkb: yeah, push a revision and then i'll bypass testing to merge it | 20:34 |
clarkb | on it | 20:34 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Temporarily disable log uploads to rax dfw and iad swift https://review.opendev.org/c/opendev/base-jobs/+/900243 | 20:35 |
clarkb | done | 20:35 |
opendevreview | Merged opendev/base-jobs master: Temporarily disable log uploads to rax dfw and iad swift https://review.opendev.org/c/opendev/base-jobs/+/900243 | 20:36 |
fungi | #status log Bypassed testing to merge change 900243 as a temporary workaround for an outage in one of our log storage providers | 20:37 |
opendevstatus | fungi: finished logging | 20:37 |
clarkb | fwiw those two builds I linked did eventually fail which gives more evidence towards a problem on the other end | 20:38 |
clarkb | the jobs for 900179 should have all started after the fix landed | 20:49 |
clarkb | s/fix/workaround/ | 20:49 |
clarkb | I can list containers in iad and dfw again | 21:52 |
clarkb | I suspect that we can revert that change. I'll push up a test only change first though to confirm | 21:53 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Force base-test to upload only to rax-iad and rax-dfw https://review.opendev.org/c/opendev/base-jobs/+/900248 | 21:56 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Reset log upload targets https://review.opendev.org/c/opendev/base-jobs/+/900249 | 21:56 |
clarkb | I'm going to self approve that first one so that I can do a round of testing with a test change | 21:57 |
clarkb | I'll recheck https://review.opendev.org/c/zuul/zuul-jobs/+/680178 as my test change once the base-test update lands | 21:59 |
opendevreview | Merged opendev/base-jobs master: Force base-test to upload only to rax-iad and rax-dfw https://review.opendev.org/c/opendev/base-jobs/+/900248 | 22:02 |
clarkb | 680178 jobs lgtm I think we can land 900249 but I won't self approve that one since it affects production jobs | 22:35 |
JayF | fungi: any way to get mailman to send me a clean copy of an email to the list once it's released from moderation? | 22:46 |
JayF | oh, there it goes, it's just very late, weird | 22:47 |
JayF | my response to that in the web ui landed in my inbox before the email I released did | 22:47 |
JayF | spooky | 22:47 |
clarkb | I'm not sure I parsed that. You sent two responses one via email and the other by web ui? | 22:51 |
JayF | I'm a moderator on the MM list. I get the email saying "moderate this, you!" | 22:51 |
JayF | I go moderate it. Wait some time for the email to hit because it'll need a reply. | 22:52 |
JayF | Never hits, so I assume (wrongly at this point) that I don't get my own copy, so I reply in web UI, that reply shows up in my inbox a minute or two later. | 22:52 |
JayF | Literally three minutes later; the original message I approved shows up. | 22:52 |
fungi | JayF: it's possible your mail provider is greylisting messages from the listserver. you should be able to look at the received header chain to determine where the delays were | 22:53 |
clarkb | I think mailman operates internally via cron as well | 22:53 |
JayF | I'm just over here trying to solve the architectural question of how it's possible, except via a delivery failure to my email address, ... yes exactly | 22:53 |
clarkb | may have needed the release email job to execute | 22:53 |
JayF | fungi: the extra-irony is: the email that arrived faster was marked by google as phishing, because it was from:@gr-oss.io and not sent by google | 22:53 |
JayF | that makes sense, actually, from an architectural standpoint why I woulda seen the behavior I expected | 22:54 |
fungi | we can work around that in the latest mailman version by setting specific domains to always get their from addresses rewritten | 22:54 |
JayF | I'm sure it's fine, it got delivered to my inbox so it can't have made google that angry lol | 22:55 |
fungi | google is especially tricky in that regard, because the dmarc policy they publish says to do one thing, but then they disregard it and do something different | 22:55 |
JayF | You know my first tech job was for a spamhou^W email marketing company, right? | 22:55 |
JayF | This stuff was a pain back then, and it's only moved to being more and more hidden and obscure. At least back in the late 2000s they'd still send you feedback when things were blocked or someone marked as spam so you could take action. | 22:56 |
fungi | JayF: yes, my employer at the time was their hosting provider, i think we established | 23:03 |
JayF | HS or P10? | 23:03 |
fungi | so i was the employee tasked with receiving, triaging and forwarding along all the abuse complaints | 23:03 |
fungi | hs | 23:03 |
fungi | looks like whatever the swift blip in dfw/iad was, https://rackspace.service-now.com/system_status never recorded it | 23:12 |
opendevreview | Merged opendev/base-jobs master: Reset log upload targets https://review.opendev.org/c/opendev/base-jobs/+/900249 | 23:36 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!