Tuesday, 2023-12-05

clarkbI've now spent a lot of time drafting docs and remember that I haven't sent an infra meeting agenda yet. I think I'll just have to send that first thing tomorrow01:07
clarkbThe first draft of the zuul document is up though01:10
*** tobias-urdin-pto is now known as tobias-urdin11:15
*** elodilles is now known as elodilles_pto11:21
n0obis some one here having experience with developing the sushy emulator?14:06
fungin0ob: you should ask in the #openstack-ironic channel14:14
fungipretty sure that's the team who maintains sushy14:15
n0obthx fungi!14:15
fungiyep, confirmed: https://governance.openstack.org/tc/reference/projects/ironic.html14:15
fricklerfungi: seems gmail is now rejecting mail via lists.o.o due to missing dkim/spf14:17
frickler550-5.7.26 Gmail requires all senders to authenticate with either SPF or DKIM.14:17
fungii'm going to try not to say something harsh about how people should have already ditched gmail and all of google's other services too, but i suppose adding an spf ~all record for our lists domains only makes me vomit into my mouth a little bit14:19
fungifrickler: want to add that to the meeting agenda? clarkb hasn't sent it out yet anyway14:21
fricklerI'm still in other meetings, just checked this when it was mentioned in the neutron meeting14:22
fungii'll put it on the wiki14:23
fungiactually, i wonder if they'd accept a policy that contained +all (http://www.open-spf.org/SPF_Record_Syntax/ includes that in an example described as "The domain owner thinks that SPF is useless and/or doesn't care.")14:29
fungibut if we actually want it to be an accurate policy, we can i think do "v=spf1 a -all" or "v=spf1 a ~all" or maybe "v=spf1 a ?all" if we're really worried about it14:30
fricklerif you check exim logs they also respond with a link to look at, but it might contain PII so not sharing here14:31
fungiin theory, only our listserv should mail-from those domains anyway14:31
fungifrickler: the good news is that they seem to currently only be rate-limiting our deliveries and deferring14:34
opendevreviewJeremy Stanley proposed opendev/zone-opendev.org master: Add an SPF record for the listserv  https://review.opendev.org/c/opendev/zone-opendev.org/+/90268914:48
opendevreviewJeremy Stanley proposed opendev/zone-zuul-ci.org master: Add an SPF record for the listserv  https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/90269014:48
fungifor discussion ^14:48
fungisimilar additions for the other domains will have to be made manually in rackspace and cloudflare depending on the domain14:49
clarkbI'll have to refresh on spf rules14:58
clarkband ya I'm hoping to get the agenda out as soon as the board meeting concludes14:58
* SvenKieske has flashbacks about being a mail admin15:01
clarkbso basically a means match the A/AAAA records for lists.zuul-ci.org in this case as valid senders and ?all means neutral stance on any other sender?15:03
fungiwe could make it tighter with ~all or even -all if we wanted to tell receiving mtas to reject messages outright when they come from another server15:04
clarkbI think ?all is fine to start15:04
clarkbeasier to get more restrictive as we go than to start that way15:04
clarkbfungi: I did leave one suggestion about using shorter TTLs in case we need to iterate on the rules. FWIW I don't think this change should negatively impact our mailing lists so we should probably go ahead with it?15:05
clarkbif we were doing -all then I'd be more cautious but ?all should be pretty safe and if it makes other mail servers happier...15:05
fungiyes, i can set a shorter ttl on these for now15:06
opendevreviewJeremy Stanley proposed opendev/zone-opendev.org master: Add an SPF record for the listserv  https://review.opendev.org/c/opendev/zone-opendev.org/+/90268915:09
opendevreviewJeremy Stanley proposed opendev/zone-zuul-ci.org master: Add an SPF record for the listserv  https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/90269015:09
clarkbI made some small edits to the meeting agenda. I'll send that out around 1600 or whenever the board meeting kicks out the public membership15:21
*** dmellado206 is now known as dmellado15:54
clarkbfungi: frickler: any comments on https://review.opendev.org/c/opendev/system-config/+/902436 ? This will affect all ansible runs on bridge (by default at least since it updates the ansible config)16:02
clarkbalso if you have time https://review.opendev.org/c/opendev/system-config/+/901469 and its parent should be safe to land at this point. Just adds Gerrit 3.9 image builds and tests but we don't push that into production16:03
frickleroh, that could have been the cause for the sporadic LE failures, too? I remember wondering where the -13 came about16:05
jrosseri have had to patch the same thing a long time ago here https://review.opendev.org/c/openstack/openstack-ansible/+/85142616:07
jrosseri made a reproducer in the linked bug report16:07
fricklerha, I was just thinking why we shouldn't directly go to 300 instead of 120 ;)16:08
clarkbfrickler: yup it was tonyb's investigation of that that led us to the ansible issue16:08
clarkbfrickler: I think the main reason to not go to 300 is simply to avoid "leaking" a bunch of ssh processes if we can get away with less16:08
clarkbIts actually a bit surprising to me that this isn't a drop the world and fix it type of bug for ansible16:09
clarkbits been happening for a while now and seems to affect many users and is fatal due to default configuration with no other interaction from the user16:09
fricklermy idea is that 120 is still a possible runtime for an LE run, 300 would be much less likely I think16:10
clarkbfrickler: its specific to the time between two tasks. 120 and 300 seconds are both possible dependingon the playbook16:10
clarkbits just that as the time increases we believe statistically there are fewer incidences of that time delta16:10
clarkbso ya fewer 300s than 120s16:10
clarkbbut every single ansible run will have an ssh process hanging around for 5 times as long at 300 seconds16:11
clarkband then that is multiplied by the numebr of hosts16:11
fricklerbut only while the playbook is running, not afterwards?16:11
clarkbI want to say I've seen them live beyond the playbook process' life. But now that you ask I'm not 100% sure of that16:13
fricklerthen let's say we use 180 as a compromise?16:14
clarkbsure I can update to 180s16:15
opendevreviewClark Boylan proposed opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 180s  https://review.opendev.org/c/opendev/system-config/+/90243616:17
clarkbI've got an hvac person doing annual cleaning/inspection/maintenance type stuff in about half an hour. Shouldn't interfere with the meeting16:59
clarkbtonyb: I think your mirror CNAME update changes are in merge conflict due to the serial updating for other changes. Do you think we are ready to proceed with those. Mostly thinking A) we should keep making progress on that (sorry I've been distracted the last few days with gerrit/gitea stuff) and B) we'll need to coordinate those updates with any spf record updates we do17:00
clarkbtonyb: for validation things are working prior to updating the CNAME records I generally just browse the afs filesystem tree via the webbrowser hitting the main host dns record and if that shows records we know afs is working17:01
tonybOkay, I'll rebase them on top of the SPF updates17:01
tonybOkay I'll do that.  I had done similar via the officsal hostname but I can also do it via the new CNAME17:02
tonybclarkb: you have absolutly nothing to appologise for.17:02
fungii wouldn't worry too much about rebasing other dns changes onto the spf change. whichever doesn't merge first can just be revised for a new serial as needed17:04
fungii set the changes for spf addition to wip until we have a chance to discuss them in the meeting, but that's still a couple of hours out17:04
clarkbwe need a merge hook that updates the serial automatically for us17:09
clarkbthat seems like more pain that it is owrth though as you won't get matching commit shas17:09
tonybclarkb, fungi: https://review.opendev.org/c/opendev/zone-opendev.org/+/902100 looks good to me.  I think the "merge conflict" was related to it being set WIP?  As it sems fine to me after I removed WIP17:19
clarkboh yes gerrit treats wip changes as merge conflicts17:20
clarkbbecause merge conflict really means "not mergeable"17:20
tonybOkay that matches my guess based on observed behaviour17:21
clarkb+2 from me. i also verified I can browse the afs hosted stuff17:23
clarkbI think we can land that whenever we get a second reviewer to double check it17:23
clarkb2fa will be permanently required on github accounts starting January 11, 202417:31
clarkbnot an issue for my personal account, but I do wonder if that will impact the accoutns used by projects to replicate to github (I'm guessing they may not have 2fa setup already)17:31
clarkbalso maybe the zuul account? Something to watch I guess17:32
fungiclarkb: tonyb: i approved 902100 just now17:41
tonybclarkb: I'd say the github 2fa change will impact those accounts: https://docs.github.com/en/organizations/keeping-your-organization-secure/managing-two-factor-authentication-for-your-organization/managing-bots-and-service-accounts-with-two-factor-authentication17:44
opendevreviewMerged opendev/zone-opendev.org master: Switch CNAME records to new mirrors  https://review.opendev.org/c/opendev/zone-opendev.org/+/90210017:49
clarkbtonyb: but it isn't clear if the account will stop working until that is done or if we just need ot update it on next login17:49
tonybAh okay17:50
fungialso we did, long ago, set up 2fa for our shared github admin account17:51
fungino idea if that was done for any of the replication accounts17:51
clarkbya and I think I would recommend a similar setup for any bot account needing similar. Basically use a software totp generator17:52
fungi"we" (opendev sysadmins) don't control most/any of those so i don't think we even have a way to find out17:52
fungithe approach we've taken is that we don't want to be responsible or know about github, the individual org owners create replication accounts and corresponding zuul secrets for their credentials17:53
fungiso it will be up to them to make any necessary updates17:53
clarkbya for opendev I think there are two things we care about. The old admin account which is largely unused now and already using 2fa. And whatever the zuul token is generated with17:54
clarkbthose might be the same underlying account with scoped permissions or something17:54
opendevreviewJeremy Stanley proposed opendev/zone-opendev.org master: Add an SPF record for the listserv  https://review.opendev.org/c/opendev/zone-opendev.org/+/90268918:10
tonybIn looking at cleaning up the 4 older mirror nodes I cam across: https://opendev.org/opendev/system-config/src/branch/master/hiera/common.yaml#L45-L49  ... Should I update these to use the un-numbered mirror names ?18:10
fungirebased ^ fpr the serial update18:10
clarkbtonyb: for simplicity of maintenance I think we should consider that. One reason to make it extra specific is if we need to measure performance between servers. I don't think that is likely for mirror nodes and we can always add an explicit host to achieve that18:11
tonybWill do18:18
opendevreviewTony Breeds proposed opendev/system-config master: Add a helper script for doing the LVM setup on mirror nodes.  https://review.opendev.org/c/opendev/system-config/+/90150418:35
opendevreviewTony Breeds proposed opendev/system-config master: Update cacti_hosts to more generic names.  https://review.opendev.org/c/opendev/system-config/+/90271418:35
opendevreviewTony Breeds proposed opendev/system-config master: Remove Ansible configuration and inventory entries for old mirror servers  https://review.opendev.org/c/opendev/system-config/+/90271518:35
opendevreviewTony Breeds proposed opendev/zone-opendev.org master: Remove old mirror nodes from DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/90271618:38
fungiclarkb: did you happen to see https://review.opendev.org/899219 ?18:43
clarkbI did not18:44
clarkbI don't really rely on the reviewer list stuff much beacuse people add me to random changes18:44
fungiyeah, that's why i brought it up. it's been waiting a while, seems reasonably straightforward. it's not a small change but it comes with tests and should have basically no runtime impact except on a platform that isn't supported without that change18:45
clarkbI can try to review it but it probably won't be until late today at the earliest18:46
fungii'm trusting the ironic developers understand tinycore's expectations18:46
clarkbI still need to do some edits to annual report things and have a zuul change or two to review18:46
fungibut didn't want to single-core approve it18:46
clarkband am hoping we can merge the ansible controlpersist change update18:46
tonybI think the latter is ready.18:47
fungiyep, i think we should go ahead and approve the controlpersist change. three reviewers gave +2, and the one who weighed in asking for an adjustment is presumably satisfied with it too now that's changed18:47
tonybI didn't +W it as I wanted to double check it was safe during "peak hours".18:48
fungii'm going to go ahead and approve 902481 so that we can take the gerrit server back out of the emergency disable list before we forget18:49
tonybfungi: sounds good to me18:50
clarkband ya i think the controlpersist change is fine to merge whenever. If it breaks we manually update the ansible config and revert it18:52
clarkbI'm definitely trying to clean up my own backlog before year end so want to help others do that too and will do my best to get to that glean chnage18:52
clarkbhttps://zuul.opendev.org/t/openstack/build/c7060e07d77148eab826f3feba913c5d I think this change should more or less self test the ansible config update? But I'm not 100% certain of that18:53
clarkbhttps://zuul.opendev.org/t/openstack/build/c7060e07d77148eab826f3feba913c5d/log/bridge99.opendev.org/etc/ansible/ansible.cfg#31 the config update does show up there and the job runs the base playbook so I think that is a good indication we should be fine18:53
clarkbfungi: maybe lets +W the controlpersist change after you are satisfied with the gerrit update (so that we don't break before we clean u pgerrit)18:54
fungifair enough18:54
opendevreviewMerged openstack/project-config master: Address TODO in acl normalization script  https://review.opendev.org/c/openstack/project-config/+/90231419:26
clarkbI need to pop out and make lunch but when I get back I'm going to do my best to review all those changes I said I would review :)19:57
* tonyb needs to go AFK for a bit (about an hour)20:00
opendevreviewMerged opendev/zone-zuul-ci.org master: Add an SPF record for the listserv  https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/90269020:03
opendevreviewMerged opendev/zone-opendev.org master: Add an SPF record for the listserv  https://review.opendev.org/c/opendev/zone-opendev.org/+/90268920:07
opendevreviewMerged opendev/system-config master: Revert "Switch Gerrit replication to a larger RSA key"  https://review.opendev.org/c/opendev/system-config/+/90248120:09
opendevreviewMerged opendev/system-config master: Make bookworm the python Dockerfile parent default image  https://review.opendev.org/c/opendev/system-config/+/89875520:09
opendevreviewMerged opendev/system-config master: Add python3.12 bookworm base images  https://review.opendev.org/c/opendev/system-config/+/89875620:09
fungidns update deploys are queued up waiting for the hourly prod jobs to finish20:18
fungii've also done the other ones that needed manual processing20:19
fungigetting the expected records back from queries for all of lists.airshipit.org, lists.katacontainers.io, lists.opendev.org, lists.openinfra.dev, lists.openstack.org, lists.starlingx.io, lists.zuul-ci.org20:37
fungi#status log Added (neutral) SPF records for all Mailman sites in order to comply with delivery requirements for some mass mail providers20:37
opendevstatusfungi: finished logging20:37
fungiinfra-root: i've removed review02.opendev.org from the emergency disable list now that 902481 has merged20:38
fungideployment for 898755 (Make bookworm the python Dockerfile parent default image) is failing some bookworm python 3.12 promote image builds20:40
Clark[m]Usually that is due to a race in cleaning up old tags20:41
fungi"promote-docker-image: Get manifest" failed. "the output has been hidden due to the fact that 'no_log: true' was specified for this result"20:41
Clark[m]Oh get manifest wouldn't be that though20:41
Clark[m]The child change should promote and effectively replace things though so if the child is fine I think we are good20:41
fungii'll keep an eye on it, should start running momentarily20:42
fungiyeah, the precise builds which failed when run for the parent are what were rerun for the child and are succeeding20:43
fungiprobably because the child merged before the parent was enqueued into deploy?20:44
fungiso we ended up running the new jobs one change early20:44
fungiif we'd held up approving the second change until the first one deployed, i guess they wouldn't have been selected for the first deploy20:45
Clark[m]I'm not sure I parse that20:45
fungithe parent and child merged within seconds of each other. the child adds new jobs. because the child was already merged when the parent entered the deploy pipeline, the jobs added by the child got run for the parent20:47
Clark[m]Oh! Yes I see. Fun21:00
clarkbTheJulia: I left some super minor comments on https://review.opendev.org/c/opendev/glean/+/899219 I'm happy to approve that as is but wanted to make sure you saw those comments before it merged just in case any of those items are a bigger concern than I foresee21:17
clarkbtonyb: I approved the cacti update and the inventory removals of the old servers. I'll let you deide if you want to approve https://review.opendev.org/c/opendev/system-config/+/901504/ or address frickler's comments21:21
clarkband I've also approved the dns update to remove dns records for the old servers just now21:23
clarkbI think all of that should be fine to land in whatever order relative to one another21:23
clarkbthe main thing is not removing the actual servers until the inventory updates otherwse we get failures21:23
clarkb*we get failures from LE21:23
tonybAhh okay.  I misunderstood what you said the time I asked.21:26
clarkbtonyb: I think the order you have is most strictly correct21:27
clarkband that way we don't accidentally try to update LE certs if the system-config inventory cleanup fails in CI21:27
tonybgot it21:29
clarkboh heh they don't share a queue so need to recheck that once the dependency merges21:32
opendevreviewMerged opendev/system-config master: Update cacti_hosts to more generic names.  https://review.opendev.org/c/opendev/system-config/+/90271421:34
tonybclarkb: okay21:39
clarkbfungi: since review02 is out of the emergency file now should I approve https://review.opendev.org/c/opendev/system-config/+/902436 ?21:46
fungiclarkb: yes21:46
clarkbdouble checking things I compared to jrosser's change and that change doesn't have teh s suffix21:48
clarkbso now I'm wondering if this is correct or maybe secondsi s the default. I'm double checking ebfore approving21:48
clarkb"If set to a time in seconds, or a time in any of the formats documented in sshd_config(5)"21:49
clarkblooks like seconds is default and s as a qualifier is acceptable21:50
fungiso fine either way21:50
clarkbI've approved it21:51
tonybclarkb: I read over the opendev and zuul annual reports.  opendev looks good to me.  I think there's a small opportunity for clarification at: https://etherpad.opendev.org/p/2023-zuul-annual-report#L6322:30
tonybthe ways it's currently worded sounds like there are frequent privilege escalations *in zuul*22:31
opendevreviewMerged opendev/system-config master: Remove Ansible configuration and inventory entries for old mirror servers  https://review.opendev.org/c/opendev/system-config/+/90271522:42
opendevreviewMerged opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 180s  https://review.opendev.org/c/opendev/system-config/+/90243622:42
kozhukalovThis is the standard role use-docker-mirror and I often see this error appearing randomly See for example https://37c3b0502f727da81ef7-95e9c5ca80133f8e8d1a12029c605fe6.ssl.cf5.rackcdn.com/902162/2/check/loci-ironic/8c1562c/job-output.txt22:48
clarkbtonyb: thanks I'll see what I can do to make that more claer is a linux issue we're just the victim of :)22:49
tonybclarkb: That's what I thought.22:50
clarkbkozhukalov: chances are exactly what it says: the disk is full. Do these errors occur in a particular cloud region? Some of them have epehermal drives mounted on /opt that provide additional disk space rather than having all disk available at /22:50
tonybkozhukalov: I think the error is clear .... the disk is full.  I guess the real question is what chewed up all the disk22:50
clarkbif this happens in rax nodes chances are you're filling up the 20GB / (which already has about 11GB dedicated to the OS and caches used up)22:51
kozhukalovyes, the error is clear. However it is not quite clear why it happens. 22:51
clarkband ya need to indentify what in your job is using all the disk and see if it can be migrated to /opt22:51
kozhukalovThis is in the very beginning of the job22:51
kozhukalovI'll try to figure out if it happens only in a particular region22:52
clarkb2023-12-05 19:05:20.612963 | ubuntu-jammy | Setting up swapspace version 1, size = 20 GiB (21474832384 bytes)22:52
clarkbthat is almost certainly your problem22:52
tonybkozhukalov: What change?22:52
fungizuul should also be collecting info at the start of the job showing what the initial filesystem utilization looks like22:52
fungiin one of the standard zuul_info logs we collect22:53
kozhukalovWill try to set smaller swap allocation. 22:54
kozhukalovThanks @tonyb @clarkb @fungi22:54
clarkbkozhukalov: on rax nodes what should happen is the ephemeral drive should be formatted to include swap22:57
tonybkozhukalov: FWIW: https://zuul.opendev.org/t/openstack/build/8c1562cc0075408cb33bfdcee7a0a758/log/zuul-info/zuul-info.ubuntu-jammy.txt#123 shows the initial diskspace22:57
clarkbI'm not sure why that particular job is using a swapfile. We only use swapfile outside of rax because we don't get epehermal drives we can repartition and format22:57
clarkbthat said I think we suggest at 1GB swap at most22:57
clarkb20GB is massive and I'm not sure I would recommend it in any setup on our nodes22:58
clarkbso I think there are two bugs here: the first is using a swapfile on rax and the second is the size of the swap file/partition22:58
JayFI have a strange service suggestion, that I feel like might already exist23:01
JayFsome kind of link shortener.23:01
kozhukalovThis is probably due to the nature of the job. For me it is also unclear why it allocates so huge swap file. it has been the case since 2017. Will try to rework this.23:02
clarkbJayF: its come up in the past but there are plenty that exist out there already. Its a very low return big headache type of service for us23:02
JayFIs there one that you'd trust to put out there?23:02
tonybkozhukalov: You could try removing the hunk from openstack/loci:playbooks/setup-gate.yaml23:03
JayFI always get skiddish when picking one, you never know when all the sudden short-dot-io is going to turn into a spamhaus23:03
tonybJayF: I use tiny.cc23:03
clarkbJayF: bitly is a long term resident in the space23:03
clarkbthey don't seem to have imploded and created themselves problems yet23:04
JayFbitly now charges23:04
clarkboh neat23:04
JayFfor new ones23:04
tonybJayF: one nice feature there is that yoy can edit the links for a given short URL23:04
JayFso I am counting the minutes until our old ones go through an ad redirect23:04
JayFthe encrappification cycle of url shorteners :/ 23:04
clarkbbut also I try to avoid shorteners if I can23:04
kozhukalov@tonyb Yes, I am trying to rework this. This was likely to make it faster. It first allocates huge swap file and then deploys docker to use tmpfs. Kinda funny23:04
clarkbI don't like not knowing what the other end is when I click on something23:05
JayFtiny.cc works, thanks tonyb 23:05
clarkbit drives me nuts that twitter made shorteners common for everything23:05
JayFclarkb: agreed, but this is for a long link that's going into a youtube video for Openinfra live on Thurs23:05
JayFdon't want to make someone type in a 200 character URL in 13 seconds :D23:05
fungiqr code ;)23:05
tonybJayF: tiny.cc also generates QR codes23:05
JayFfungi: where'd you get the idea that I believe in OR 23:06
JayFfungi: AND :D 23:06
* JayF feels about QR codes the way clark feels about URL shorteners23:06
JayFexcept at least with a URL shortener I can curl it23:06
fungior https://pypi.org/project/qrcode has a convenient command line interface23:06
clarkbJayF: ya I'm somewhat anti qr code too fwiw23:08
clarkbI've gone as far as saying the fcc should stop allowing them on tv but I'm weird like that23:09
clarkbbetween superbowl ads showing QR codes and restaurants replacing menus with QR codes we've normalized behavior that can only lead to problems in a few yaers when everyone thinks nothing of scanning and following them23:09
fungii usually try to accompanying them with the actual url for people who want to pause and type23:09
JayFthat is exactly what I am doing :)23:10
tonybkozhukalov: Oh ... that's an interesting setup.  I'm sure it made sense when it was implemented23:10
opendevreviewMerged openstack/project-config master: [Neutron-lib] Update Grafana Dashboard  https://review.opendev.org/c/openstack/project-config/+/90211923:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!