Monday, 2023-11-06

opendevreviewgnuoy proposed openstack/project-config master: Create mono repo for sunbeam charms
*** elodilles_pto is now known as elodilles11:39
opendevreviewgnuoy proposed openstack/project-config master: Create mono repo for sunbeam charms
opendevreviewMerged openstack/project-config master: Remove OSA rsyslog noop jobs once repo content is removed
opendevreviewgnuoy proposed openstack/project-config master: Create mono repo for sunbeam charms
tonybfungi, clarkb: IIUC the next step in mirgtaing mirror nodes to jammy is for someone with infra-root to boot a new node in each region.  Then I/we can add those to the proper inventory and frobnicate DNS15:01
fungitonyb: sounds right15:02
tonyb.... followed by cleanup etc15:02
tonybI don't see any docs in system-config/docs/source about meetpad/jvb.  In the bionic-servers etherpad I see "no real local state."  I'm trying to determine what if any local state there is.  Otherwise it looks really easy to test new meetpad/jvb nodes on jammy16:03
clarkbtonyb: because we don't do recordings etc I think the only "state" as it is is the CA stuff. However, because it is java it isn't actually checking for proper trust amongst the CAs instead it just wants to see a signed cert and then do ssl16:06
clarkbits weird, but ya I think it simplifies things16:06
*** gouthamr_ is now known as gouthamr16:09
clarkbfungi: for the meeting agenda do we want to keep mm3 on the list to do a recap or should we pull it off (I beleive all tasks are completed now)16:41
clarkbwhen I scheduled the gerrit upgrade for 15:30 UTC I failed to remember that we'd be dropping DST this last weekend. Thats fine I'll just get up early16:51
clarkbmore of a "DST strikes again!" situation16:52
clarkbfungi: I guess one oustanding mm3 item is to figure out the error we had post upgrade here: Compressing... Error parsing template socialaccount/signup.html: socialaccount/base.html16:57
fungiclarkb: oh, right. i'll look into that. i guess keep it on the agenda for now17:06
tonybfungi: Possibly  ... pip install django_allauth\<0.5817:10
fungitonyb: supposed to be fixed in latest django-allauth i thouhgt17:11
clarkblooks like the newer version fiex it. But maybe our pip resolution didn't pull it in?17:12
corvusclarkb fungi tonyb hiya; i wrote a change to the nodepool-openstack functional test (the test that runs devstack and then runs nodepool against it).  part of that test boots a node (in potentially nested vm) and keyscans it.  i noticed that it was taking an unusually long time to boot and start ssh, sometimes causing the test to fail, so my change increases the time we wait for the node to boot.  the extra time could be natural evolution (maybe17:15
corvussome new thing on the image is just a little slower, and it's a straw-that-broke-the-camels-back situation).  or it could be a string of bad luck in noisy-neighbor scheduling on our test nodes.  or it could be something more substantial either in the devstack cloud or the test image used.  i bring it up here in order to surface the problem so that more than just zuul maintainers would see it. if you have any thoughts about who else should be17:15
corvusmade aware of this, feel free to ping them and let them know. is the change to bump the timeout17:15
clarkbcorvus: TheJulia has been debugging some unnecessary slowness with cloud-init when using config drive (it will dhcp first, then look at config drive and learn it must statically configure things). But I'm 98% sure that those jobs use glean on the images not cloud-init so I don't think that issue is related17:16
clarkbother than that I can't think of anything that might be slowing down the test node booting process17:17
TheJuliaspeaking of, you guys can nuke that held node now17:17
opendevreviewTony Breeds proposed opendev/system-config master: Add Tony Breeds to base_users for all hosts
TheJuliasince I think the outstanding questions got answered on Friday17:17
clarkbTheJulia: will do17:17
clarkbfungi: should I also delete the etherpad 1.9.4 hold since we have upgraded?17:19
corvusclarkb: yeah, i think they do use glean.  they are centos stream 9 images.17:23
corvushere is the log output from one of the runs with an embedded console log:
fungiclarkb: oh, yes don't need the held nodes for etherpad any longer17:27
clarkbfungi: all done17:28
fungiclarkb: we held the mailman upgrade change until django-allauth got fixed and then i rechecked it to confirm the new images had the fixed version17:28
clarkbfungi: huh, maybe it has to do with the upgrade then?17:28
tonybcorvus: I haven't looked at a centos boot for a while but amost a minute here seems excessive:
fungiclarkb: tonyb running `sudo docker-compose -f /etc/mailman-compose/docker-compose.yaml exec mailman-web pip list|grep allauth` on lists01 reports "django-allauth 0.58.1" by the way17:31
tonybfungi: Okay.  That verifies that.17:32
clarkbtonyb: corvus: these are doing qemu and not nested virt. So slowness is to be expected. I think generous timeouts are a good idea. Also I agree that looks like glean and no cloud-init to me17:32
clarkbas for why there is a bit time gap where tonyb points to I'm not sure. That looks like systemd working through rc scripts and /etc settings for unit files17:33
clarkbmaybe it is waiting for a non RO filesystem ?17:33
tonybCould be.17:35
* tonyb is grabbing a CentOS image to poke at17:37
clarkbtonyb: note the images in these tests are built from scratch using the -minimal image17:37
clarkber -minimal element so they won't have cloud-init in them for example. Maybe not necessary to see why early systemd is slow, but want to call it out17:38
fungiif you have the bandwidth, you can just download an already built one from (or nb01)17:38
fungi(er, or nb02)17:38
tonybThanks.  I was prepared for differences.  Just wanted to poke around locally17:38
tonyb... or indeed I could do that :)17:39
fungi is from 2023-11-05 15:46 and 7.6G in size17:40
fungiso pretty sure that's the newest one17:40
corvusunfortunately, i don't think we get a useful console log unless the boot fails, which makes it hard to compare with earlier successful builds17:42
fungicould collect syslog from the successfully booted one?17:43
fungioh, i get it. you're saying we don't have earlier records, right17:43
corvusyeah would be a good idea.  the job attempts to collect a console log outside of nodepool, but all it gets is grub.  see for example this earlier successful build:
corvus(yeah, so 2 probs: 1 we don't have the collection in place to record successful console logs (or its broken); 2 since that's true, we'll never have old data to compare it to; we can only fix that going forward)17:46
tonybFor a one off reference we could potentially use an older 9-stream compose and generate an image from there.  I don't think that'd be too much work18:06
tonybLooks like we could go back in time by about a month:
clarkb I don't think this will affect directly as we aren't making changes to matrix services (and tehy would be publicly hosted even if we did)18:14
tonyb"Please see here for The Foundation's position. Contributors to these new repositories will now need to agree to a standard Apache CLA with Element before their PRs can be merged - this replaces the previous DCO sign-off mechanism."  Interesting they're switching away from DCO18:21
clarkbtonyb: it is because they need a CLA to allow them to sell the code under other license terms18:22
clarkbthey mention that explicitly later in the post18:22
clarkb(which is good that they aren't trying to be sneaky about it)18:22
clarkbfungi: any idea what the story is with ? Is that somethign we just missed? I worry that merging it now would create more problems...18:49
*** dhill is now known as Guest605919:42
*** JayF is now known as Guest606119:55
*** JasonF is now known as JayF19:55
*** zigo_ is now known as zigo19:57
fungii'm not sure. i guess it might be worth checking with the author to see if they still want it created (but also creating new repos in the x/ namespace seems questionable, ideally they would pick a new namespace)19:59
johnsomFYI, lots of job failures at the moment. orange post failers20:09
clarkbyes it was just bourhgt up in #openstack-infra20:13
clarkbappears to be a problem with at least one swift endpoint (rax-iad)20:13
clarkbneed to spot check others to see if the rest of rax is ok or now20:13
clarkb*or not20:13
clarkbsecond one is in rax-dfw so probably a rax problem and not region specific20:15
clarkb2 on rax iad and 1 on rax dfw so far. I think thats enough to push a chagne to disable rax for now20:17
opendevreviewClark Boylan proposed opendev/base-jobs master: Temporarily disable log uploads to rax swift
clarkbnothing on the rax status page yet. Could be on our end I suppose, but hard to say until we reproduce independently20:19
clarkband I need to eat lunch while it is hot20:19
johnsomI am also having lunch, cheers20:20
clarkb is currently uploading to rax iad. I think it is going to fail, btu if it doesnt' that could be another useful data point20:28
clarkbfungi: ^ fyi a change to disable rax log uploads. And a job we can watch to see if it eventually fails20:29
clarkb is another to watch20:30
clarkbfungi: if it were an api/sdk compatibility thing I would expect these jobs to fail quicker20:32
clarkbbut instead they seem to eb doign uploads or attempting them and then failing20:32
clarkbok I just ran openstack container list against IAD and got back a 50020:32
clarkbgoign to repeat for the other two regions, but I think this is not on our end20:33
fungidone with an old known-working cli/sdk install?20:33
clarkbit works against ORD but not IAD and DFW20:33
fungiaha, yep that sounds like it's on the service end20:33
clarkbfungi: just using what we've got installed on bridge but it works against ORD. But also servers shouldn't report Internal errors 500 on api comaptibility issues20:33
fungii'm going to bypass zuul to merge that change20:33
clarkbfungi: do we want to modify it to keep ord for nopw?20:34
clarkbfungi: that could be good data that rax generally works iwth our sdk versions20:34
fungiclarkb: yeah, push a revision and then i'll bypass testing to merge it20:34
clarkbon it20:34
opendevreviewClark Boylan proposed opendev/base-jobs master: Temporarily disable log uploads to rax dfw and iad swift
opendevreviewMerged opendev/base-jobs master: Temporarily disable log uploads to rax dfw and iad swift
fungi#status log Bypassed testing to merge change 900243 as a temporary workaround for an outage in one of our log storage providers20:37
opendevstatusfungi: finished logging20:37
clarkbfwiw those two builds I linked did eventually fail which gives more evidence towards a problem on the other end20:38
clarkbthe jobs for 900179 should have all started after the fix landed20:49
clarkbI can list containers in iad and dfw again21:52
clarkbI suspect that we can revert that change. I'll push up a test only change first though to confirm21:53
opendevreviewClark Boylan proposed opendev/base-jobs master: Force base-test to upload only to rax-iad and rax-dfw
opendevreviewClark Boylan proposed opendev/base-jobs master: Reset log upload targets
clarkbI'm going to self approve that first one so that I can do a round of testing with a test change21:57
clarkbI'll recheck as my test change once the base-test update lands21:59
opendevreviewMerged opendev/base-jobs master: Force base-test to upload only to rax-iad and rax-dfw
clarkb680178 jobs lgtm I think we can land 900249 but I won't self approve that one since it affects production jobs22:35
JayFfungi: any way to get mailman to send me a clean copy of an email to the list once it's released from moderation?22:46
JayFoh, there it goes, it's just very late, weird22:47
JayFmy response to that in the web ui landed in my inbox before the email I released did22:47
clarkbI'm not sure I parsed that. You sent two responses one via email and the other by web ui?22:51
JayFI'm a moderator on the MM list. I get the email saying "moderate this, you!"22:51
JayFI go moderate it. Wait some time for the email to hit because it'll need a reply.22:52
JayFNever hits, so I assume (wrongly at this point) that I don't get my own copy, so I reply in web UI, that reply shows up in my inbox a minute or two later.22:52
JayFLiterally three minutes later; the original message I approved shows up.22:52
fungiJayF: it's possible your mail provider is greylisting messages from the listserver. you should be able to look at the received header chain to determine where the delays were22:53
clarkbI think mailman operates internally via cron as well22:53
JayFI'm just over here trying to solve the architectural question of how it's possible, except via a delivery failure to my email address, ... yes exactly22:53
clarkbmay have needed the release email job to execute22:53
JayFfungi: the extra-irony is: the email that arrived faster was marked by google as phishing, because it was and not sent by google22:53
JayFthat makes sense, actually, from an architectural standpoint why I woulda seen the behavior I expected22:54
fungiwe can work around that in the latest mailman version by setting specific domains to always get their from addresses rewritten22:54
JayFI'm sure it's fine, it got delivered to my inbox so it can't have made google that angry lol22:55
fungigoogle is especially tricky in that regard, because the dmarc policy they publish says to do one thing, but then they disregard it and do something different22:55
JayFYou know my first tech job was for a spamhou^W email marketing company, right?22:55
JayFThis stuff was a pain back then, and it's only moved to being more and more hidden and obscure. At least back in the late 2000s they'd still send you feedback when things were blocked or someone marked as spam so you could take action.22:56
fungiJayF: yes, my employer at the time was their hosting provider, i think we established23:03
JayFHS or P10?23:03
fungiso i was the employee tasked with receiving, triaging and forwarding along all the abuse complaints23:03
fungilooks like whatever the swift blip in dfw/iad was, never recorded it23:12
opendevreviewMerged opendev/base-jobs master: Reset log upload targets

Generated by 2.17.3 by Marius Gedminas - find it at!