Thursday, 2022-02-10

opendevreviewIan Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440
corvus2 executors stopped.  ah ah ah.00:16
funginarration by the count never gets old00:18
corvusone left00:44
corvushrm, i can't teel what it's waiting on00:52
corvusi don't see any build related subprocesses.  i do see a bunch of stale looking 'git cat-file' jobs00:52
corvusi may send it a sigusr200:53
corvusah, it's a paused build00:54
corvustripleo-ci-centos-8-content-provider head of gate00:56
clarkbya that has confused me before but Zuul does the correct thing01:07
corvusit's resumed; apparently the ooo quickstart collect logs is not fast01:10
corvusdone; on to batch 2 now01:14
corvusthe first batch looks like it's running jobs okay.  i'm going to afk now01:15
ianwhrm, reported NODE_FAILURE when i switched the node types to centos-8 anyway.  it feels like that should have run on centos-8-stream nodes.  i wonder what i'm missing...01:27
clarkbianw: label: centos-8 is not stream01:34
clarkband that change seems to set lable to centos-801:34
clarkbmaybe we can just alnd that invalid config and it will report NODE_FAILURE? I thought zuul would validate more than that but seems not to01:35
ianwclarkb: yeah, but i thought that centos-8 label now actually selected centos-8-stream nodes01:36
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Replace kpartx with qemu-nbd in extract-image
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440
opendevreviewIan Wienand proposed openstack/diskimage-builder master: Futher bootloader cleanups
*** ysandeep|out is now known as ysandeep03:15
Clark[m]ianw: not the label. The centos-8 nodeset03:23
ianwClark[m]: yeah, i had it wrong, it was not using the nodeset defined in base-jobs03:24
opendevreviewIan Wienand proposed zuul/zuul-jobs master: [DNM] make sure centos-8 nodes fail with 828440
opendevreviewIan Wienand proposed openstack/diskimage-builder master: Futher bootloader cleanups
*** ysandeep is now known as ysandeep|afk05:11
*** ysandeep|afk is now known as ysandeep05:57
*** amoralej|off is now known as amoralej07:21
sshnaidmclarkb, ianw, corvus if you have merge rights, please merge:
*** ysandeep is now known as ysandeep|lunch08:31
*** jpena|off is now known as jpena08:38
*** ysandeep|lunch is now known as ysandeep10:06
*** bhagyashris__ is now known as bhagyashris11:26
*** dviroel|out is now known as dviroel|ruck11:30
*** rlandy|out is now known as rlandy|ruck11:38
*** ykarel is now known as ykarel|away12:23
fricklerkevinz_: Certificate for is at 21 days, could you have a look please?12:42
kevinz_frickler: Sure, I will re-gen it.12:43
*** amoralej is now known as amoralej|lunch13:09
*** dviroel is now known as dviroel|ruck13:13
*** artom__ is now known as artom13:14
*** ysandeep is now known as ysandeep|afk13:52
*** pojadhav is now known as pojadhav|afk13:53
*** amoralej|lunch is now known as amoralej13:58
*** ysandeep|afk is now known as ysandeep14:55
dtantsurhi folks! I remember I was talking to some of you about having a real partition image for cirros. Has there been any movement around it?15:08
dtantsurI'm asking because we're about to start building our own centos images with DIB in the CI, but I'd rather not to15:08
fungidtantsur: frickler has been working on a cirros fork at so maybe he has some ideas15:15
dtantsuroh, I also wanted to ask about the reason of creating a fork. is upstream development stagnating?15:16
fungihe can speak better to the reasons, but my understanding is that he wanted to set up some zuul jobs, possibly do integration testing with devstack, and discuss with smoser about relocating development to here and/or adopting the project15:17
dtantsurk understood15:17
fungithere have been ml threads in the past about cirros going stale upstream for long periods and the possibility of the openstack community picking up maintenance of it, but i don't know if those prior discussions had any bearing on the present situation15:18
fricklerso currently this isn't a fork, but an attempt to get a working CI again for the original project15:19
fricklerregarding the "real partition image", I did some testing, and the main issue seems to be getting grub installed into the image, which requires changing library options and in the end makes the result twice as large as the original15:21
fricklerso I don't think that this is feasible as a default solution, but only optionally as a different flavor of cirros possibly15:21
fricklerI also don't have too much time for this myself, so expect some results in a couple of months, not within days or weeks15:24
dtantsurfrickler: is it something I could pick up or is there too much context to transfer?15:24
fricklerdtantsur: well help is always welcome, if you want to look into setting up the build to generate what you need, best join us in #cirros (currently still on libera), so smoser and myself can work together answering your questions15:28
frickleryou could also look at and help me find out how to collect and store build results in a useful way without exploding our log storage15:30
*** ysandeep is now known as ysandeep|out16:03
*** priteau_ is now known as priteau16:11
dtantsurfrickler: you mean, store the actual generated images? I wish I knew, we could use that in Ironic...16:24
*** tkajinam is now known as Guest21016:30
corvusthe zuul executor restarts from yesterday are done; i'm going to restart zuul01 now16:31
corvus2022-02-10 16:33:05,483 INFO zuul.ComponentRegistry: System minimum data model version 1; this component 316:34
corvus2022-02-10 16:33:05,484 INFO zuul.ComponentRegistry: The data model version of this component is newer than the rest of the system; this component will operate in compatability mode until the system is upgraded16:34
corvusthat's as expected.  as soon as i shut down the zuul02 components, that should bump up.16:34
clarkband then 02 will start on the new version when it see everyone else is at the new rev too16:34
corvusi was just thinking, a zuul CD job to upgrade zuul would be a little tricky... gracefully restarting an executor can take longer than the max job runtime due to paused jobs...16:38
clarkbcorvus: crazy talk but I wonder if zuul could fork itself on the new code and just keep running with the old state16:39
clarkbbasically replace itself in place and not need a synchronization at all16:40
corvusclarkb: kinda awesome.. but tricky with our container deployment model... :/16:41
corvuszuul01 is up, restarting 02 now16:44
clarkbIn theory it would work pretty well to do that if you got the mechanics down since we're already storing the bulk of the state in zk16:44
clarkbthe danger would be if needing to migrate internal datastructures but they could be forced to refetch from zk maybe16:45
corvus2022-02-10 16:45:19,445 INFO zuul.ComponentRegistry: System minimum data model version 3; this component 316:45
corvus2022-02-10 16:45:19,445 INFO zuul.ComponentRegistry: The rest of the system has been upgraded to the data model version of this component16:45
clarkbcorvus: if you get a chance this morning can you look at since you pointed out the slurp module which I used there16:47
*** priteau is now known as priteau_16:48
*** priteau_ is now known as priteau16:48
corvusclarkb: lgtm16:49
clarkbthanks I went ahead and approved it (and responded to your question16:49
corvusthe big zuul changes necessitating the model upgrade are related to semaphores and changes in gate superceding check; so please keep an eye out for any unexpected behavior there16:50
clarkbI'm going to followup on that gerrit bug I filed about the cloning weirdness once my brain has fully booted.16:52
clarkbI suspect that we can go ahead and close the bug out16:52
fungiclarkb: migrating file descriptors and socket handles gets tricky when you're replacing processes live, but it's doable16:52
fungiclosing and reopening everything is probably simpler16:53
fungiforks do mostly inherit them though16:53
gibiis it just me or the zuul web ui is down?
fungigibi: it's being restarted16:53
gibiahh, OK16:53
corvusand it's up16:54
fungigibi: zuul itself is able to do hitless rolling restarts now, but we only have one zuul-web service at the moment so it goes offline for a bit16:54
corvus#status log rolling restarted all of zuul on ad1351c225c8516a0281d5b7da173a75a60bf10d16:54
clarkbthere are some TODOs to get a load balancer in front of multiple webs16:54
opendevstatuscorvus: finished logging16:54
gibifungi: nice improvement 16:54
corvuswhat was the decision about LB -- make a new one or reuse the gitea one?16:56
clarkbcorvus: I think my slgiht perference is to make a new one since small nodes seem to work well for haproxy and this way we can continue to operate zuul and gitea independently16:57
fungiis the gitea one in the same region as the zuul servers anyway?17:01
clarkbfungi: it is not17:01
fungibetter to have the lb as topologically close to the servers as possible if it's doing socket forwarding17:01
fungifrom a performance and stability standpoint17:01
corvuscan haproxy handle websockets?17:04
fungithe short answer is "yes" because haproxy has a variety of different load balancing solutions17:06
fungiand client persistence algorithms17:07
fungithe longer answer depends on how exactly you want websockets "handled"17:07
corvusoh and we use tcp right?17:07
fungifor gitea we do, yes17:07
clarkbcorvus: we have historically used tcp17:07
clarkbI think the reason for that is it simplified tls17:08
corvusso that should work fine for this, modulo maybe needing to set max connections or something17:08
clarkbbasically instead of needing certs for every point in btween you just do it int he service and pass straight through17:08
*** rlandy|ruck is now known as rlandy|ruck|mtg17:09
fungiyeah, layer-4 proxying with client ip persistence can work for just about anything modulo cgn clients17:09
fungiif you want to do things like layer-7 forwarding with ssl/tls termination on the load balancer, or more granular client persistence to specific backends based on session ids or injecting cookies, that's where it starts to depend a lot on the application itself17:10
fungiis it important that multiple requests from the same client are persisted to the same backend in this case? like for authenticated sessions?17:11
corvusnope :)17:12
corvusthere is no server-side session state17:12
fungiin that case we can probably ignore client persistence entirely17:13
fungiwhich should get us a much more even load distribution17:13
fungiit's a bigger problem for gitea, where a git operation can involve multiple requests over different connections, and there's no guarantee that the state of the repositories between backends is completely consistent (repacks, replication races, et cetera)17:15
*** dviroel|ruck is now known as dviroel|ruck|afk17:16
clarkbok I updated with what we learned17:22
corvuswe.. have a 5 node job limit?17:23
clarkbya I seem to recall someone went a bit overboard and we had to set that. But maybe I'm misremembering17:24
corvusi would suggest that we increase that for the opendev tenant, but that wouldn't help us.17:24
corvussince the opendev tenant isn't where we run the opendev service jobs17:24
clarkbcorvus: what we can do is use groups rather than hosts and have some thinsg colocated. I'ev thought about doing that in the past but it seemed non urgent17:25
fungii have no problem with raising it if we have jobs that need that many, it was simply a useful starting point17:25
clarkband ya I think we could bump it17:25
clarkbI guess the way we do zuul services doesn't really allow for colocating though (since everything is bind mounted from the same path regardless of service)17:29
opendevreviewJames E. Blair proposed opendev/system-config master: Add Zuul load balancer
corvuspresumably zuul will refuse to run that until we figure out how to run 6 nodes17:32
*** jpena is now known as jpena|off17:32
sshnaidmclarkb, hi, can you please merge the perms patch in your time
clarkbfungi: sshnaidm: we clarified that deleting branche is lossy right?17:35
clarkbsshnaidm: if you delete a branch and don't haev another permanent ref pointing to that commit our regular garbage collection will delete data17:35
sshnaidmclarkb, yeah, I'm aware of that17:36
sshnaidmit was created by mistake, so I just don't want it to be there to confuse users with right branches..17:37
clarkbfungi: I think your ethercalc copy may be trying to abckup and fail based on emails we are getting. Can you double check that and maybe comment out the backup crons on your copy ?17:40
fungiclarkb: i've done one better and deleted the server17:42
fungijust cleaning up the snapshot i built it on now17:42
clarkbcorvus: I think we can either bump the limit or combine the load balancer and zk or merger or similar.17:48
opendevreviewMerged openstack/project-config master: Give perm to release team to delete branches
opendevreviewMerged opendev/system-config master: Test pushes into gitea over ssh
clarkbI was hoping ^ that stack would cause semaphores to be exercised but the second chagne is test only so we don't run the prod deploy after it17:55
clarkboh well17:55
corvusclarkbfungi : oh apparently max is 10 nodes17:55
fungioh neat17:58
clarkbI'm trying to write a chagne to fix ls-members --recursive now18:10
*** amoralej is now known as amoralej|off18:21
clarkb that may do it18:36
fungiso it was working for the rest api, but they pulled the rug out from under the cli?18:37
clarkbfungi: yup I think at some point they combined the rest api and ssh commands but then in cases like this missed that the recursive flag was private and needed to be explicitly handled18:38
clarkbNext I need to work on a depends on to check this output18:38
clarkbit is easier for me to do that than figure out their test framework :/18:38
fungitheir release notes handling is kinda neat, i didn't realize they embed that in commit message footers18:39
clarkbfungi: that is brand new as of last night18:39
fungii wonder if they've considered how to go about correcting/updating a release note after the commit merges ;)18:39
* corvus uploaded an image: (36KiB) < >18:43
corvusmy guess is they write a new note. 18:43
fungia moose once bit my sister18:44
fungitoday i learned about mysql's string replace function. wish i had known about it during the to migration18:53
opendevreviewClark Boylan proposed opendev/system-config master: DNM testing upstream fix for gerrit ls-members
clarkband ^ should hopefully test this for us18:55
*** rlandy|ruck|mtg is now known as rlandy|ruck18:57
clarkbon the surface it is a silly little bug but wow did it create a lot of confusion for us debugging the performance issues18:58
clarkbthe other really neat thing about that depends on setup is if we push the fix to stable-3.4 and not also to stable-3.5 then we'll get a test that checks for failure and success in the same buildset :)19:00
clarkbgranted on different realses of gerrit but for this situation that should be fine19:01
*** dviroel|ruck|afk is now known as dviroel|ruck19:19
opendevreviewAde Lee proposed zuul/zuul-jobs master: WIP/DNM - Test new version of mariadb
opendevreviewClark Boylan proposed opendev/system-config master: DNM testing upstream fix for gerrit ls-members
clarkbapparently admin is already a member of groups it creates? I guess implicitly as owner maybe? To be extra sure that we're getting recursive listings I went ahead and updated that to make another user that is distinct19:47
fungimakes sense, as gerrit sets its groups to be self-owned by default20:52
corvusclarkbfungi is, um, possibly a hole-in-one?20:55
clarkbcorvus: nice I'll review it shortly. Just sitting back down after some lunch20:55
opendevreviewIan Wienand proposed openstack/project-config master: Remove Fedora 34
fungicorvus: and on a par 4 hole at least20:57
clarkbwoot my gerrit test seems to work. I'll update it now to ensure that we aren't always recursive etc but I think the change I wrote is working20:58
clarkbit is really cool that we can do this sort of thing with zuul20:58
opendevreviewMerged openstack/diskimage-builder master: Futher bootloader cleanups
opendevreviewClark Boylan proposed opendev/system-config master: DNM testing upstream fix for gerrit ls-members
ianwcorvus: couple of minor comments, pleasing how easily the roles and testing and just generally everything make it look easy21:08
clarkbI left a few too21:10
ianwfungi / clarkb: the centos-8 failure patch I believe is fully tested now ->
ianwif we're ok we can go with that, and i can send another follow-up email about the current status21:14
fungiianw: sounds good to me, thanks. i also brought it up in the tc meeting today so at least the openstack leadership is aware of the current situation and impending behavior change21:15
clarkbapproved I guess keep an eye out for unexpected fallout, but thank you for testing it so that we avoid all that I hope :)21:17
ianwi think we're covered for migration now.  if you're explicitly using centos-8 then you'll get NODE_FAILURE.  if you're using the base-jobs definition then you'll get RETRY_FAILURE with a clear failure message not to use it21:21
fungiit's too bad there's no clear signal to explicitly trigger a failure result directly from pre-run, but at least it'll triple-fail quickly21:22
clarkblooks like OSA has managed to clean things up for the most part. should be the last one and I rechecekd it21:23
clarkbjrosser: fyi I think that didn't auto enter the gate with its parent because they must not share a gate queue21:24
clarkbya'll may want to ensure all the osa repos share a queue so that they are cogated and your integration testing makes use of speculative states properly21:24
opendevreviewAde Lee proposed zuul/zuul-jobs master: WIP/DNM - Test new version of mariadb
corvusianw: clarkb how sure are you that we don't need the letsencrypt job to run before the lb job?21:26
opendevreviewMerged opendev/base-jobs master: base: fail centos-8 if pointing to centos-8-stream image type
clarkbcorvus: like 80%. I think the risk is that we'll end up having a proxy up that connects to backends that don't have valid https certs yet. It might be better to have users get no tcp connection at all21:27
clarkbits also not a major problem to have that dep there if we want to be safe21:27
fungiwe could also mitigate that by changing the health check to be more than just a tcp socket prove21:27
opendevreviewJames E. Blair proposed opendev/system-config master: Add Zuul load balancer
opendevreviewJames E. Blair proposed opendev/system-config master: Clean up some gitea-lb zuul config
clarkbfungi: I think if you are doing tcp lb you cannot do the richer health checks21:28
clarkbas haproxy does't load the necessary bits into its state tables21:28
corvusokay but if you want to add it back we're totally adding that to my handicap.  strictly speaking, the -focal/-bionic comment is the only error i made :)21:28
fungimulligan's fair there21:29
corvusi believe i made sure that zuul-web doesn't answer on http until it's fully initialized, so should be compatible with a tcp healthcheck21:30
corvus(as we've seen from the rolling-restart outages)21:30
fungicorvus: we put it behind apache which does answer though?21:31
fungialso looks like you're introducing a config error on 82879321:31
corvushrm, that could be a problem then.21:31
corvusdo we just gloss over that discrepancy with gitea?21:32
corvus"if apache is up, gitea is probably up"21:33
clarkbcorvus: ya I think so21:33
fungiit's a good question. and yes i think it's probably something we should try to solve21:33
clarkbwhen manually doing gitea work I always try to tell the load balancer about it first21:33
clarkbfungi: ++21:33
fungiwe put the haproxy config in first. later we added apache in between haproxy and gitea but didn't consider what that might do to our health checks21:34
fungii think that was a regression we simply haven't noticed21:34
clarkbone solution is to have the health check check the direct port21:34
clarkbrather than the apache ssl terminator21:35
clarkbI think that is possible as it has the bits to do the tcp check in place. Its just a matter of telling it to use the other port?21:35
clarkblooking at merger queue graphs and wow can you see the periodic jobs loading in21:39
clarkbtomorrows periodic run will be intersting since we should have the full complement of mergers for it21:42
clarkbtoday's set took about the same amount of time as yseterdays but with 60% of the mergers21:42
*** dviroel|ruck is now known as dviroel|out21:49
opendevreviewMerged openstack/project-config master: Remove Fedora 34
clarkbinfra-root any opinions on the best way to start shutting down subuntu2sql workers and openstack health api? The health api hasn't worked in months and I'ev discussed with the qa team and we're basically going to turn it off. I was thinking I should shutdown apache (running the wsgi service) and the gearman workers for subunit2sql but it looks like puppet will restart apache. Should21:54
clarkbI put the hosts in the emergency file or go ahead and start removing the puppet for them then shutdown the services. Then delete stuff in a bit?21:54
ianwprobably makes sense to emergency them and shutdown22:01
clarkbianw: oh ya maybe that is the easiest thing22:03
clarkb ok I think that shows my upstream fix is working. I left a comment on the upstream chagne pointing to that22:11
clarkbianw: when you do that do you `shutdown -h now` or do it via the nova api?22:12
fungiyeah, that makes sense to me. i do `sudo poweroff`22:25
clarkbcool I'll get started on that shortly. Will be the two subunit2sql workers and the server22:25
clarkbAll the data is in the trove db though so I'm not too concerned about deleting these other than not being sure we'll be able to rebuild them if somehow necessary22:26
clarkbBut I figure if we go slow it gives people a chance to scream :)22:26
fungiwe can make images of them before deleting, i've done that with most of the other services i've shut down22:30
clarkbok health01, subunit-worker01, and are all in the emergency file now22:35
clarkbnext up server shutdowns22:35
clarkband done. They can be booted back up again if necessary but I don't expect much trouble since no one noticed the services was not working for a long time anyway22:37
clarkbgmann: ^ fyi I shutdown the servers as a first step in cleaning things up. If nothing comes up in the next ~week I'll snapshot the servers and delete them22:38 is another related server but it hosts e-r things so I want to make sure that the whole ELK thing is settled before I clean it up22:39
clarkbbut once that is done I think status can go away too22:39
gmannclarkb: ack. 22:43
ianwi wonder if we should move that to static and make redirect 22:46
ianwi did have that in bookmarks for years, just thanks to inertia22:47
clarkbianw: I think the zuul redirect would be just about the only thing that is still valid there when we are done22:47
clarkbreviewday hasn't been working doesn't look liek (and I'm not sure anyone has used it recently), health is broken and going away. e-r + ELK is moving.22:47
opendevreviewMerged opendev/system-config master: Add Zuul load balancer
ianwianw@bridge:/var/log/ansible$ ls  -l *.2020-* | wc -l23:37
ianwdoes anyone have a problem if i remove these?23:38
ianwit seems like in 2020 we had a period where we kept all the logs for a while23:38
ianwthis inspired by trying to figure out why this infra-prod-service-nodepool run failed
clarkbianw: I think I started cleaning those up at one point and then frickler determined something else was hogging all the disk23:39
clarkbBut I'm not opposed to removing the 2020 log files23:39
opendevreviewJeremy Stanley proposed opendev/system-config master: Clean up some gitea-lb zuul config
ianwhrm, i guess actually if i expand it, we're just keeping everything23:41
ianwi thought we only kept the last few runs, but must be wrong23:41
ianwit would be good to have a more direct way to sync zuul build -> file on disk23:42
ianwok, for my own reference, "Rename playbook log on bridge" is i guess that23:43
ianwand then it seems to run "find /var/log/ansible -name 'service-nodepool.yaml.log.*' -type f -mtime 30 -delete"23:43
clarkbya it should be cleaning up. I think what hapepns is if the filename changes we orphan some things23:44
opendevreviewIan Wienand proposed opendev/system-config master: bridge production: fix mtime matching
ianwclarkb: ^ that will probably help23:48
*** prometheanfire is now known as Guest223:49
*** osmanlicilegi is now known as Guest023:49
clarkbah we can leak then if we don't run often enough to get the exact match23:51
ianwcrazy idea; keep a list of "log reader" gpg keys and encrypt each log file from the bridge runs with multiple --recipient keys.  then have a build artifact like our download-all-logs which is a command you can paste to just cat out the logs from a production run23:51
funginot a bad idea23:51
ianwpresumably infra-root, but if someone were interested in a particular service they could add themselves23:52
fungiclarkb: -mtime +30 would solve that23:53
clarkbfungi: ya that is ianw's fix23:54
fungioh, missed it23:54
fungithanks, reviewing23:54

Generated by 2.17.3 by Marius Gedminas - find it at!