Monday, 2022-09-12

*** ysandeep|out is now known as ysandeep04:36
*** ysandeep is now known as ysandeep|afk05:54
*** ysandeep|afk is now known as ysandeep06:30
fricklerianw: seems the translate01 mysql backup is failing since the recent update: mysqldump: Error: 'Access denied; you need (at least one of) the PROCESS privilege(s) for this operation' when trying to dump tablespaces                                                     06:35
ianwfrickler: thanks, will look into it06:39
fricklerClark[m]: seems "docker-compose ps" also lists exited containers, seems to be a bug afaict, because it does have the -a option that should do that, like for "docker ps"06:43
*** gibi_off is now known as gibi07:02
*** jpena|off is now known as jpena07:10
*** ysandeep is now known as ysandeep|lunch10:34
opendevreviewGhanshyam proposed openstack/project-config master: Make python version template unversioned
*** dviroel_ is now known as dviroel11:45
fricklergmann: I think I understand the idea now, but I'm not sure if it can actually word as planned. hoping others with more knowledge will jump in12:00
fungifrickler: gmann: if you're going to set branch matchers, you have to do it in the jobs which the template includes, not in the template itself12:20
fungi"it is not possible to explicity set a branch matcher on a Project Template"
fungii get the impression from the commit message that was the expectation12:20
fungiahh, yes i see the latest review comment points that out as well12:22
gmannfungi: frickler I know template cannot have branches variant but we can do this way
gmannsame way we did for stable periodic job template
fungigmann: yes, if you do it in the jobs that will work. it may make sense to go ahead and add the master branch matcher to the jobs in the new template so that it's clearer12:28
gmannfungi: sure, we can do that and once new stable branch is cut then add those explicitly on required jobs. I can update that in
gmannfungi: plan is if we can merge this then bot patch will propose the right template name otherwise more deliverables releases with stable branch will have different template
fungiwell, we can also (carefully) backport the template swap, but yes avoiding more unnecessary backports would be good12:32
gmannyeah, let's leave those with versioned-named template and now onwards we can take care of branch and master on generic one12:34
fungigmann: you need 856903 merged first though, right? otherwise the release scripts are going to propose patches which can't merge12:34
fungioh, i see, the template name is being reused, so i guess they'll still run tests even before 85690312:35
gmannfungi: we have same name template 'openstack-python3-jobs' exist, so release script bot patches will be ok. but yes we will merge 856903 soon so that it run the latest set of jobs as per 2023.1 testing runtime12:35
gmannthis is helping me to transition from already merged bot patches to this new generic one12:36
fungiright, makes sense12:36
*** ysandeep|lunch is now known as ysandeep12:44
*** dasm|off is now known as dasm13:33
gmannfungi: can you review it, frickler is +2 now.
opendevreviewMerged openstack/project-config master: Make python version template unversioned
fungigmann: ^14:38
gmannfungi:  thanks14:41
frickleramorin: infra-root: I'm seeing issues with nested-kvm on ovh-gra1, I'm not sure we even intended to run these jobs there. strangely the issue only manifests with a new cirros version using the latest 5.15 kernel. Ubuntu 22.04 using the same kernel seems unaffected, as well as older versions of both ubuntu and cirros14:46
fricklerthe same thing on vexxhost is working fine. vexxhost working, ovh broken14:47
fricklerthe VM crashes shortly after boot with: unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffff9f296104 (native_write_msr+0x4/0x30)14:48
fricklerswitching to qemu instead of kvm also solves the issue. seems pretty reproducible, but I've no idea what to do about it14:50
fungifrickler: i see a lot of reports online of benign "unchecked MSR access error: WRMSR to ..." when the kernel is configured to apply microcode updates to the processor at boot or resume from suspend14:51
fungipossible the crash is unrelated to that14:52
fricklerhmm, the kernel should be the same, I can try to check whether the ubuntu image uses some additional cmdline args14:59
fungiwell, the kernel in our job node and the guest created on it are, but you don't know what kernel is used for the underlying provider hypervisor15:02
fungifor these nested virt related crashes, it often involves some interaction between the kernel versions at all three layers15:03
*** ysandeep is now known as ysandeep|out15:16
clarkbnew debian libc6 package is now available that should fix the ansible thing without backporting. I'll work on updating my change to update our python images to incorporate that (as I think upstream may not have updated the base image yet). I'll also look at the zuul restart failure this morning. But first local package updates and reboots and breakfast15:16
*** ysandeep|out is now known as ysandeep15:16
amorinfrickler, ack, first time I heard about such thing15:17
amorinperhaps our CPU behavior is different that the one on vexxhost15:18
amorinwhich flavor are you using?15:18
amorinoh, its flavor from open infra I think15:18
fungiamorin: seems to be called "ssd-osFoundation-3"15:19
amorinack, would you mind trying with a c2-7 or b2-7 for a one shot test?15:20
*** ysandeep is now known as ysandeep|out15:20
fungifrickler: ^ you'll probably need a refined reproducer you can run on a manually booted instance15:21
fungimaybe set a job hold, then shutdown and snapshot the instance for the held node, then boot that snapshot on the other flavors, ssh in and try booting the failed guest in devstack again?15:23
fungi(after restacking, i guess)15:23
opendevreviewClark Boylan proposed opendev/system-config master: Update python builder and base image
clarkbI think if we land ^ we can remove the zuul libc workarounds15:31
clarkbthe docker issue on zm05 has the docker OOM'd return code15:40
clarkband it seems it was the exec to stop the container that OOM'd not the container itself15:40
clarkbhowever dmesg records no OOM either15:40
clarkbin syslog docker records its shim disconnected so it cleaned things up15:43
fricklerI think I can manually boot a node an deploy the test there, but not today15:43
*** marios is now known as marios|out15:47
clarkbok digging more I don't think OOMkiller was invoked. cacti and dmesg and syslog and so on all indicate no oomkiller would have happened. Docker inspect also shows "OOMKilled": false. Instead I htink this is a race between the merger exiting when asked to gracefully shutdown and the exec running in that container completing. The container itself reports it exited 0 and it is the15:53
clarkbdocker exec that requests graceful shutdown with a 137 return code15:53
clarkbI think we can safely ignore rc 137 on the graceful shutdown command15:54
clarkbI'll work on a patch15:54
*** dviroel is now known as dviroel|lunch15:55
fungirc 137 == "fatal error: requested action already complete"?15:56
clarkbfungi: its apparently what you get when you get kill -9'd15:56
clarkbin this case the container stopping and being cleaned up as a runtime is causing the equivalent of a kill -9 against the exec'd command I think15:56
clarkbsince there is no longer a runtime for that15:56
fungiahh, okay15:57
clarkbThe executor theoretically ahs the same race but is far less likely to hit it simply due to the executor having a lot more stuff to do to shutdown. I won't update the executor as I'd like to observe if that ever happens15:59
opendevreviewClark Boylan proposed opendev/system-config master: Handle zuul merger shutdown race in graceful stop
clarkbsomething like that16:03
*** dhill is now known as Guest11716:04
clarkbI expect to land real soon now and we can coordinate ^ to run after that16:05
clarkboh wait but there is another bug which is that docker-compose ps -q will list exited containers16:07
* clarkb works on a fix for that too16:07
clarkbI'm beginning to wonder if I should just do what corvus suggests and set failed_when false on that task and stop trying to check things ahead of time16:08
clarkbthat would address both issues16:08
*** jpena|off is now known as jpena16:09
fungiclarkb: in theory we could hit that same race with an idle executor, right?16:10
*** jpena is now known as jpena|off16:10
clarkbfungi: yes, but far less likely since even an idle executor has a bit more to shutdown16:11
clarkbfwiw I think frickler is correct that docker-compose ps is buggy. `docker ps -q --filter status=running` does not report exited containers but docker-compose command does16:12
clarkbif I flip it to status=exited then I see the containers indicating the filter is working on the docker side16:12
clarkbThe issue with corvus' suggestion that I need to think about is whether or not we can safely failed_when false on the docker wait command. Since having no running containers would cause that to error16:13
clarkbmaybe we need both things. Skip when no running containers, otherwise don't worry about errors too much16:13
opendevreviewClark Boylan proposed opendev/system-config master: Fixup zuul merger and executor graceful shutdowns
clarkbThat tries to check things rather than just doing a failed_when false16:26
clarkballows us to do a wait when there are containers listed (and waiting on an exited container is a noop it just returns)16:26
*** dviroel|lunch is now known as dviroel16:52
clarkbfungi: in the mm3 change were you also going to update the prod inventory file for opendev and then comment out the other bits ? THne we can stack changes on top of that when we reach that point to uncomment each site as we want to deploy it17:08
fungioh, sure! i got tunnel vision on the test config17:15
fungithanks for catching that17:15
clarkbthe python image update change passes now and shows the new libc6 installation in the logs
clarkbI think if we land that we can drop the backport workaround in the zuul images18:16
clarkband if I can get reviews for I can manually rerun that playbook when we're ready to update the zuul nodesets support18:25
clarkbfungi: I guess at this point I can delete the older of the two mm3 autoholds.18:27
clarkbfungi: and we still need to test the pipermail redirect? Have you had a chance to look at that yet?18:27
fungioh, no i haven't checked out the redirect yet18:29
clarkbok we can do that on the newer of the two holds. Any objection to deleting the older of the two now?18:29
fungino objection18:31
fungiwe can set another autohold once i revise the change to add the missing bits added to the production config18:32
clarkbok deleting now18:32
clarkband ya thats probably a good idea to retest and ensure the migration is happier with those lists pre existing18:33
fungiyeah, now that i have it more scripted, it's easier to test that in bbulk18:33
clarkbI've just updated the meeting agenda wiki page. I've tried to add changes that are in flight and similar bits to the various topics. Please feel free to add more topics or info to existing topics18:38
clarkbhrm the system-config-run-zuul job has timed out twice in a row19:39
clarkbThey both ran on ovh gra119:40
clarkband it seems like we may be seeing the high cost per ansible task here at about 3 seconds19:41
fungiokay, maybe not so odd then if performance was similar for both19:41
clarkbthats a good chunk of time just adding known hosts19:44
clarkbabout 13 minutes19:44
clarkbanother minute here
clarkboh my first example is for two ansible loops not one. Each individully take about 6.5 minutes19:48
clarkbwow ok so we do this twice the second pass makes the first one redundant. According to the chagne log we do it redundantly beacuse some test checks the first one19:50
clarkbThis is an easy improvement by removing the first and fixing the test if I can track it down19:51
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Remove redundant ssh known hosts prep
clarkbThat would save 6.5 ish minutes19:58
clarkbanother thing I'm noticing from those logs is that it almost looks like things run sequentially when they can run in parallel20:03
clarkb is output by
clarkbit doesn't look like we set serial: 1 on that playbook20:06
clarkboh! could it be that we run with -f 5? so the first 5 happen quickly and then we have a straggler?20:06
clarkbyes I think that may be what is happening20:07
clarkbjobs with more nodes than ~5 will incur wall clock penalties for things that would otherwise happen in parallel20:08
clarkb writing out our zuul job vars files is also not quick. I think the templating is particularly slow? I may work on a change that moves non template files to copies instead of templates to see if that is any better and keep templating only when needed20:15
corvusclarkb: is the zuul rolling restart stuck again?20:16
corvusi see dev10 and dev18 components20:16
clarkbcorvus: it crashed, there is discussion about it in scrollback but the tldr is in
corvusclarkb: is fix to merge that and re-run?20:17
clarkband now I'm sorting out why the job in that change is consistently timing out which led to
clarkbcorvus: yes20:17
corvusclarkb: i thought we ran our service playbooks with more than 5 forks, so if we're doing 5 in the test job, seems like we should increase that20:23
clarkbcorvus: is that controllable from zuul?20:25
corvusi thought we were talking about the nested ansible20:26
clarkbI think it is affecting both20:26
clarkb that is run by zuul aiui.20:27
clarkbAh and later in that playbook it sets -f 50 on a few playbooks. The others should only run against a single host20:27
*** dviroel is now known as dviroel|afk20:28
corvusso it should only be the in-zuul part, like the vars templating you were looking at.  i don't think that's controllable in zuul.20:30
corvuss/should only be/should only be limited by/20:31
clarkbI think halving the runtime cost of setting up ssh known hosts via 857228 will make a quick big impact. And ya working on the shift of templating to copies if templating isn't needed to see if that helps too20:32
corvusclarkb: i'd be really surprised if the templating takes longer than copying20:34
corvusi think we're just looking at the iterating cost20:34
clarkbcorvus: I thought that initially too, but the normal iteration cost seems to be about 2.5-3 seconds but the templates all take about 6 seconds20:34
clarkbhowever, that could just be chance. I figure putting the change together and checking isn't too bad20:35
clarkbif it is about 3 seconds quicker we'll save noticeable time with the number of files being written20:36
corvusclarkb: i can see no discernable difference in a local test20:37
clarkbcorvus: shows the 6 second ish cost per template and if you look at the tasks above that those that don't template are closer to 320:39
clarkbbut that is only one data point20:39
opendevreviewJeremy Stanley proposed opendev/system-config master: Add a mailman3 list server
corvussure but there are other differences at play here?  restarting the host loop... different tasks....20:40
corvusi guess i'm saying two things: 1) it seems unlikely and probably not the lowest hanging fruit; 2) having those be templated is *really* valuable and if we're going to change that, i think we should be really sure it's making a difference.  i don't want to start propagating a "copy is faster than template" meme without really compelling evidence.20:41
corvusi mean, i just proved locally that template is faster than copy... so.. :)20:42
clarkbyes, and the way to collect that evidence is to write a change and have zuul run it in the CI system?20:42
clarkbI'm not saying we should merge any such chagne right now. Just that I think it is worth checking20:43
corvusclarkb: testing it that way is going to be super tricky -- you're introducing the load of the executors and the remote cloud chosen to run the job and the noisy neighbors in that cloud all as variables20:43
corvusi think a reliable test of that needs to be two runs in the exact same conditions20:44
opendevreviewClark Boylan proposed opendev/system-config master: WIP Check if file copies are quicker than templating
corvusso right now, i'm giving my local test of a playbook far more weight than i would give a change that just s/template/copy/ in our job20:44
clarkbcorvus: your local run excludes any networking and probably has the advantage of nvme storage and so on though. I can imagine a situation where your local run isn't representative of what happens in the CI system.20:45
corvusi ran it over the network20:45
corvusand of course it's not representative20:46
clarkbthat said I agree it is a long shot, but I mean ansible tasks should never take 3 seconds anyway. So I won't be surprised at all if the templating system has some weirdness that makes it take longer too20:46
clarkbthe change to remove unneeded tasks from the multi node known hosts role is definitely the one we should focus on right now I think20:46
corvusthe point is to determine if template is faster than copy, and i have a high degree of confidence it's not20:46
corvuser, strike that, reverse it.20:46
corvusclarkb: yeah, that one is already approved20:47
clarkboh cool20:47
corvusi was really just trying to save you the trouble of writing 85723220:47
corvuswhich is why i went to the trouble of doing the local tests20:47
corvusbut :(20:47
clarkbI don't see it as trouble right now. Ansible is slow and exhibiting some interesting behaviors and I think examining them is useful and interesting20:48
corvussure, but that test isn't going to tell us anything20:48
corvusi agree improvements and understanding would be great20:49
corvusclarkb: ^ that's the playbook i ran with the resulting times20:50
corvusquite likely they're within the margin of error20:50
clarkbin particular I think if we can understand why an inifile task at takes half the time as a template task at we may be able to improve our use of ansible in an impactful manner for many20:50
clarkbBoth are writing to the same filesystem on the same host (the fake bridge). It is possible we got lucky and noisy neighbors just happened to show up in between those two tasks. The inifile module may have a particularly efficient implementation. etc20:52
opendevreviewMerged zuul/zuul-jobs master: Remove redundant ssh known hosts prep
clarkbThe ansible async option is something we might want to look at on these loops too. But I seem to recally people having difficulty using that option20:55
clarkbcorvus: the run against 857232 does seem to show inifile is still half the runtime of template and copy is about the same as template. The consistent speedup between inifile and the rest make think we aren't getting lucky with noisy neighbors. Maybe we need to put everything in an inifile21:00
corvusclarkb: haha, i'm having a "cant tell if serious" moment. :)21:01
clarkbnot really serious, but it does show that ansible can be faster at similar tasks than it is using the typical approach. I'm wondering if inifile avoids copying any files over the wire and does all of its changes on the remote node which might explain it21:02
corvusyeah, it did occur to me it might have some radically different approach like that; i haven't opened up the modules to see21:03
corvus(my first thought was "haha ini files for everything!" then "wait could that work?" then "well, yes, but it would be silly, right?"  :)21:03
corvusi do with there were a way to do this without loops21:04
clarkb++ I was thinking that with the known hosts thing too. Like giving the known hosts module a list of keys to set21:06
clarkbA lot of the ansible modules seem to be written to accept a singleton input with the assumption you'll use loops to do multiples. Problem is loops are super expensive21:07
corvusi wonder if there's something about or callbacks or otherwise related to our env that slows it down.  since my local test is like 0.5s for a task21:09
clarkb I do think it is operating on the remote exclusively without doing file copies21:09
corvusthese are really small files though, so while it is a difference, i wouldn't expect transferring that data to take an extra 3 seconds... how does it transfer it anyway?  in the json blob, or does it open an ssh channel?21:10
clarkbparsing the copy module is a lot more involved :) trying to sort out where it gets the data21:14
clarkbit looks like the copy module receives source paths and not content unless you use teh content parameter. But I'm not immediately seeing how it converts a src path on the ansible control side to the remote side file copy21:18
clarkbah I think the action module portion of copy does the transfer and the regular ansible module code can ignore it21:23
clarkbyup the action module creates a new tmp_src value that it feeds into the ansible module when it calls it21:25
clarkbcorvus: there is what appears to be an undocumented "raw" argument to the copy module that seems to write the file straight to its destination rather than doing a copy to the remote node in a tmp location and then a copy from that tmp location to the final location21:29
opendevreviewClark Boylan proposed opendev/system-config master: WIP Check if file copies are quicker than templating
clarkbI really hope using raw ^ isn't faster21:32
clarkbcorvus: looking at it from the callback angle the command stuff doesn't apply and the console stream runs in another process. That would mean zuul_json would have to be at fault?21:40
clarkb(assuming it is something to do with callbacks21:40
clarkbthe task start callback handler is pretty straightforward. It accesses a few task attributes and gets a time stamp21:43
clarkbI have a hard time seeing how that would create an impact unless processing the callback at all was problematic21:43
corvusyeah, nothing is immediately occuring to me either.  also, these are small files.21:44
clarkbmy attempt at using raw failed with permissions errors. I suspect not raw figures those out automatically21:45
clarkbbut I think grepping the error message shows that it is using ssh to transfer the file21:46
corvusso maybe we're looking at the overhead for establishing a new ssh channel?21:50
clarkbya, it seems to be hooking up ssh to dd (and the dd is what ultimately failed)21:51
ianwfrickler: that error sounds exactly like ->
ianw was the workaround for that21:54
ianwi am guessing that booting cirros with nested virt hits the same problem in the cirros kernel?21:55
opendevreviewClark Boylan proposed opendev/system-config master: WIP Check if file copies are quicker than templating
clarkbcorvus: ^ that does appear to save about a minute of wall time compared to the original in the currently running test for it22:07
clarkbcorvus: the impact on the slower ovh gra1 nodes is bigger too. Basically switching to synchronize means each synchronize task is baout the same as a single template or copy22:08
clarkbin the worst case I think we're saving about 3x the case above. So about 3 minutes total saved? I'm not sure that is worthwhile but if we can chip away a minute here and a minute there by writing more efficient ansible the aggregate should be pretty decent22:14
clarkbianw: ^ related I think we should look into opportunities to change how the log files are encrypted22:15
corvusclarkb: tbh, the extra minute is worth it to me in order to make all of that more comprehensible.  i hate that it's slow, but i don't like the idea that we would be adding one more level to the flow chart of "how did this variable get into the final job".  i like that your change is conservative in that it syncs all the files over meaning that it should be impossible to end up with a file missing.  but it does mean that if we want to template,22:15
corvuswe have to remember to add it to the template list (otherwise, we'll end up with {string} values.  all said, i think i can deal with it, so if i'm in the minority, i won't object.22:15
clarkbianw: thats another place where we loop and maybe we don't need to. Perhaps we can have a shell task that loops over them all and does the encryption instead?22:15
corvusclarkb: (fwiw, i think that is the best possible change given the position ansible has put us in, so i mean, on that basis: nicely done :)22:17
clarkbya I'll work on cleaning the change up once testing shows its generally happy. Then we can decide if we want to merge it or not22:18
corvusclarkb: (i'm mostly just coming from "this is the most complicated and round-about part of the whole nested ansible process and i barely understand the original and i wrote it)22:18
corvusi'm always forgetting to add the fake template var files, so i know i'm going to forget to add them to the template list22:19
clarkbya I forget often myself. Related the mm3 change adds host_vars just for the list server vars becuse I couldn't figure out how to do bits that needed to be raw as raw when we needed nested raw tags
corvusbut hey, maybe they will fail in a way that suggests the obvious fix is to add them to the template list22:21
clarkbin particular it seems the parser ansible uses doesn't do proper push down and its doing a naive matching so the first end matches the outer rather than inner raw handling22:21
clarkbI gave up and just decided to copy the file as is22:21
clarkblines 50-54 are the probematic ones there22:22
opendevreviewMerged opendev/system-config master: Fixup zuul merger and executor graceful shutdowns
clarkbI'll work on running ^ in screen on bridge next22:24
clarkband it is running now22:30
clarkbI guess I should've checked if all the changes on teh zuul side we wanted in are in first. But its late enough in the day that it should be done sometime tomorrow and we can run it again :)22:30
corvusclarkb: like what changes?22:32
corvusyou mean the speedup, or any changes to get in for the restart?22:33
corvus(if the former, i think your zuul-jobs change landed; if the latter -- i think we had zuul where we wanted it for the restart for 6.4.0 on friday, so getting it unstuck as close to that as possible would be great)22:34
opendevreviewClark Boylan proposed opendev/system-config master: Use synchronize to copy test host/group vars
clarkbcorvus: the latter. Ok I wasn't sure if the nodeset changes had landed yet22:36
clarkbbut those are also less important for 6.4.0 I think22:36
clarkbcorvus: 857232 is cleaned up an in a reviewable state if you want to add your thoughts to it.22:36
corvusyeah, my preference would be to defer those (and anything else) until we complete the weekend restart so that we can release current master as
clarkbworks for me. That restart is in progress now22:37
clarkbit is logging to the normal file on bridge if you want tofollow along in more detail than the componets page22:37
* clarkb adds ansible optimization thoughts to the meeting agenda22:38
corvusclarkb: done, thanks!22:41
*** dasm is now known as dasm|off22:59
ianwclarkb: yeah, it's quite loop-y.  i guess it's nice to have the tasks split up in console/ara views, but equally a shell script gets the same thing done23:08
clarkbianw: ya the problem is 6 secnods for each task adds up quickly when you do more than 10 things in a loop23:13
clarkbI agree ansible should be better here, but dealing with that we've got seems pragmatic if we continue to hit job timeouts23:13
opendevreviewClark Boylan proposed opendev/system-config master: WIP Try ansible pipelining in our system-config-run jobs
clarkbafter pushing that I've realized we alerady set that in the ansible config so thats a noop23:23
ianwit looks like the translate mysql dump backup is failing due to which is a breaking security change.  one option is to add --no-tablespaces to the dump, or i guess adjust the privileges of the dump user23:30
ianwthe db's don't have a root user by default, you have to add one and that then warns the db will become unsupported.  so i can see a way to GRANT PROCESS to the zanata user.  i'll juts work around it in puppet23:35
opendevreviewIan Wienand proposed opendev/system-config master: translate: fix dump with MySQL 5.7
clarkbthe zuul reboot playbook has moved on to ze02. Thats a good indication I haven't broken anything horribly with that change23:41

Generated by 2.17.3 by Marius Gedminas - find it at!