*** ysandeep|out is now known as ysandeep | 04:36 | |
*** ysandeep is now known as ysandeep|afk | 05:54 | |
*** ysandeep|afk is now known as ysandeep | 06:30 | |
frickler | ianw: seems the translate01 mysql backup is failing since the recent update: mysqldump: Error: 'Access denied; you need (at least one of) the PROCESS privilege(s) for this operation' when trying to dump tablespaces | 06:35 |
---|---|---|
ianw | frickler: thanks, will look into it | 06:39 |
frickler | Clark[m]: seems "docker-compose ps" also lists exited containers, seems to be a bug afaict, because it does have the -a option that should do that, like for "docker ps" | 06:43 |
*** gibi_off is now known as gibi | 07:02 | |
*** jpena|off is now known as jpena | 07:10 | |
*** ysandeep is now known as ysandeep|lunch | 10:34 | |
opendevreview | Ghanshyam proposed openstack/project-config master: Make python version template unversioned https://review.opendev.org/c/openstack/project-config/+/856904 | 11:04 |
*** dviroel_ is now known as dviroel | 11:45 | |
frickler | gmann: I think I understand the idea now, but I'm not sure if it can actually word as planned. hoping others with more knowledge will jump in | 12:00 |
frickler | s/word/work/ | 12:00 |
fungi | frickler: gmann: if you're going to set branch matchers, you have to do it in the jobs which the template includes, not in the template itself | 12:20 |
fungi | "it is not possible to explicity set a branch matcher on a Project Template" https://zuul-ci.org/docs/zuul/latest/config/project.html#project-template | 12:20 |
fungi | i get the impression from the commit message that was the expectation | 12:20 |
fungi | ahh, yes i see the latest review comment points that out as well | 12:22 |
gmann | fungi: frickler I know template cannot have branches variant but we can do this way https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/856903/1/zuul.d/project-templates.yaml#1466 | 12:24 |
gmann | same way we did for stable periodic job template https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/856903/1/zuul.d/project-templates.yaml#2398 | 12:26 |
fungi | gmann: yes, if you do it in the jobs that will work. it may make sense to go ahead and add the master branch matcher to the jobs in the new template so that it's clearer | 12:28 |
gmann | fungi: sure, we can do that and once new stable branch is cut then add those explicitly on required jobs. I can update that in https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/856903 | 12:30 |
gmann | fungi: plan is if we can merge this then bot patch will propose the right template name otherwise more deliverables releases with stable branch will have different template https://review.opendev.org/c/openstack/project-config/+/856904 | 12:31 |
fungi | well, we can also (carefully) backport the template swap, but yes avoiding more unnecessary backports would be good | 12:32 |
gmann | yeah, let's leave those with versioned-named template and now onwards we can take care of branch and master on generic one | 12:34 |
fungi | gmann: you need 856903 merged first though, right? otherwise the release scripts are going to propose patches which can't merge | 12:34 |
fungi | oh, i see, the template name is being reused, so i guess they'll still run tests even before 856903 | 12:35 |
gmann | fungi: we have same name template 'openstack-python3-jobs' exist, so release script bot patches will be ok. but yes we will merge 856903 soon so that it run the latest set of jobs as per 2023.1 testing runtime | 12:35 |
gmann | https://github.com/openstack/openstack-zuul-jobs/blob/b61f7acddba26e5ac7c4ea7dbe1fe8cdc29fec7e/zuul.d/project-templates.yaml#L1597 | 12:36 |
gmann | this is helping me to transition from already merged bot patches to this new generic one | 12:36 |
fungi | right, makes sense | 12:36 |
*** ysandeep|lunch is now known as ysandeep | 12:44 | |
*** dasm|off is now known as dasm | 13:33 | |
gmann | fungi: can you review it, frickler is +2 now. https://review.opendev.org/c/openstack/project-config/+/856904 | 14:20 |
opendevreview | Merged openstack/project-config master: Make python version template unversioned https://review.opendev.org/c/openstack/project-config/+/856904 | 14:29 |
fungi | gmann: ^ | 14:38 |
gmann | fungi: thanks | 14:41 |
frickler | amorin: infra-root: I'm seeing issues with nested-kvm on ovh-gra1, I'm not sure we even intended to run these jobs there. strangely the issue only manifests with a new cirros version using the latest 5.15 kernel. Ubuntu 22.04 using the same kernel seems unaffected, as well as older versions of both ubuntu and cirros | 14:46 |
frickler | the same thing on vexxhost is working fine. https://zuul.opendev.org/t/openstack/build/b53d8543592f42da8b704a93ccab2e50/artifacts vexxhost working, https://zuul.opendev.org/t/openstack/build/9e75cbb7c1904c2cb39c4c001caed751 ovh broken | 14:47 |
frickler | the VM crashes shortly after boot with: unchecked MSR access error: WRMSR to 0x48 (tried to write 0x0000000000000004) at rIP: 0xffffffff9f296104 (native_write_msr+0x4/0x30) | 14:48 |
frickler | switching to qemu instead of kvm also solves the issue. seems pretty reproducible, but I've no idea what to do about it | 14:50 |
fungi | frickler: i see a lot of reports online of benign "unchecked MSR access error: WRMSR to ..." when the kernel is configured to apply microcode updates to the processor at boot or resume from suspend | 14:51 |
fungi | possible the crash is unrelated to that | 14:52 |
frickler | hmm, the kernel should be the same, I can try to check whether the ubuntu image uses some additional cmdline args | 14:59 |
fungi | well, the kernel in our job node and the guest created on it are, but you don't know what kernel is used for the underlying provider hypervisor | 15:02 |
fungi | for these nested virt related crashes, it often involves some interaction between the kernel versions at all three layers | 15:03 |
*** ysandeep is now known as ysandeep|out | 15:16 | |
clarkb | new debian libc6 package is now available that should fix the ansible thing without backporting. I'll work on updating my change to update our python images to incorporate that (as I think upstream may not have updated the base image yet). I'll also look at the zuul restart failure this morning. But first local package updates and reboots and breakfast | 15:16 |
*** ysandeep|out is now known as ysandeep | 15:16 | |
amorin | frickler, ack, first time I heard about such thing | 15:17 |
amorin | perhaps our CPU behavior is different that the one on vexxhost | 15:18 |
amorin | which flavor are you using? | 15:18 |
amorin | oh, its flavor from open infra I think | 15:18 |
fungi | amorin: seems to be called "ssd-osFoundation-3" | 15:19 |
amorin | ack, would you mind trying with a c2-7 or b2-7 for a one shot test? | 15:20 |
*** ysandeep is now known as ysandeep|out | 15:20 | |
fungi | frickler: ^ you'll probably need a refined reproducer you can run on a manually booted instance | 15:21 |
fungi | maybe set a job hold, then shutdown and snapshot the instance for the held node, then boot that snapshot on the other flavors, ssh in and try booting the failed guest in devstack again? | 15:23 |
fungi | (after restacking, i guess) | 15:23 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update python builder and base image https://review.opendev.org/c/opendev/system-config/+/856537 | 15:31 |
clarkb | I think if we land ^ we can remove the zuul libc workarounds | 15:31 |
clarkb | the docker issue on zm05 has the docker OOM'd return code | 15:40 |
clarkb | and it seems it was the exec to stop the container that OOM'd not the container itself | 15:40 |
fungi | ouch | 15:40 |
clarkb | however dmesg records no OOM either | 15:40 |
clarkb | in syslog docker records its shim disconnected so it cleaned things up | 15:43 |
frickler | I think I can manually boot a node an deploy the test there, but not today | 15:43 |
*** marios is now known as marios|out | 15:47 | |
clarkb | ok digging more I don't think OOMkiller was invoked. cacti and dmesg and syslog and so on all indicate no oomkiller would have happened. Docker inspect also shows "OOMKilled": false. Instead I htink this is a race between the merger exiting when asked to gracefully shutdown and the exec running in that container completing. The container itself reports it exited 0 and it is the | 15:53 |
clarkb | docker exec that requests graceful shutdown with a 137 return code | 15:53 |
clarkb | I think we can safely ignore rc 137 on the graceful shutdown command | 15:54 |
clarkb | I'll work on a patch | 15:54 |
*** dviroel is now known as dviroel|lunch | 15:55 | |
fungi | rc 137 == "fatal error: requested action already complete"? | 15:56 |
clarkb | fungi: its apparently what you get when you get kill -9'd | 15:56 |
clarkb | in this case the container stopping and being cleaned up as a runtime is causing the equivalent of a kill -9 against the exec'd command I think | 15:56 |
clarkb | since there is no longer a runtime for that | 15:56 |
fungi | ahh, okay | 15:57 |
clarkb | The executor theoretically ahs the same race but is far less likely to hit it simply due to the executor having a lot more stuff to do to shutdown. I won't update the executor as I'd like to observe if that ever happens | 15:59 |
opendevreview | Clark Boylan proposed opendev/system-config master: Handle zuul merger shutdown race in graceful stop https://review.opendev.org/c/opendev/system-config/+/857209 | 16:03 |
clarkb | something like that | 16:03 |
*** dhill is now known as Guest117 | 16:04 | |
clarkb | I expect https://review.opendev.org/c/zuul/zuul/+/855801/ to land real soon now and we can coordinate ^ to run after that | 16:05 |
clarkb | oh wait but there is another bug which is that docker-compose ps -q will list exited containers | 16:07 |
* clarkb works on a fix for that too | 16:07 | |
clarkb | I'm beginning to wonder if I should just do what corvus suggests and set failed_when false on that task and stop trying to check things ahead of time | 16:08 |
clarkb | that would address both issues | 16:08 |
*** jpena|off is now known as jpena | 16:09 | |
fungi | clarkb: in theory we could hit that same race with an idle executor, right? | 16:10 |
*** jpena is now known as jpena|off | 16:10 | |
clarkb | fungi: yes, but far less likely since even an idle executor has a bit more to shutdown | 16:11 |
clarkb | fwiw I think frickler is correct that docker-compose ps is buggy. `docker ps -q --filter status=running` does not report exited containers but docker-compose command does | 16:12 |
clarkb | if I flip it to status=exited then I see the containers indicating the filter is working on the docker side | 16:12 |
clarkb | The issue with corvus' suggestion that I need to think about is whether or not we can safely failed_when false on the docker wait command. Since having no running containers would cause that to error | 16:13 |
clarkb | maybe we need both things. Skip when no running containers, otherwise don't worry about errors too much | 16:13 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fixup zuul merger and executor graceful shutdowns https://review.opendev.org/c/opendev/system-config/+/857209 | 16:26 |
clarkb | That tries to check things rather than just doing a failed_when false | 16:26 |
clarkb | allows us to do a wait when there are containers listed (and waiting on an exited container is a noop it just returns) | 16:26 |
*** dviroel|lunch is now known as dviroel | 16:52 | |
clarkb | fungi: in the mm3 change were you also going to update the prod inventory file for opendev and then comment out the other bits https://review.opendev.org/c/opendev/system-config/+/851248/74/inventory/service/host_vars/lists01.opendev.org.yaml ? THne we can stack changes on top of that when we reach that point to uncomment each site as we want to deploy it | 17:08 |
fungi | oh, sure! i got tunnel vision on the test config | 17:15 |
fungi | thanks for catching that | 17:15 |
clarkb | the python image update change passes now and shows the new libc6 installation in the logs https://zuul.opendev.org/t/openstack/build/5a4589af96b04dc181a20f8a884de568/log/job-output.txt#1194 | 18:16 |
clarkb | I think if we land that we can drop the backport workaround in the zuul images | 18:16 |
clarkb | and if I can get reviews for https://review.opendev.org/c/opendev/system-config/+/857209 I can manually rerun that playbook when we're ready to update the zuul nodesets support | 18:25 |
clarkb | fungi: I guess at this point I can delete the older of the two mm3 autoholds. | 18:27 |
clarkb | fungi: and we still need to test the pipermail redirect? Have you had a chance to look at that yet? | 18:27 |
fungi | oh, no i haven't checked out the redirect yet | 18:29 |
clarkb | ok we can do that on the newer of the two holds. Any objection to deleting the older of the two now? | 18:29 |
fungi | no objection | 18:31 |
fungi | we can set another autohold once i revise the change to add the missing bits added to the production config | 18:32 |
clarkb | ok deleting now | 18:32 |
clarkb | and ya thats probably a good idea to retest and ensure the migration is happier with those lists pre existing | 18:33 |
fungi | yeah, now that i have it more scripted, it's easier to test that in bbulk | 18:33 |
fungi | bulk | 18:33 |
clarkb | I've just updated the meeting agenda wiki page. I've tried to add changes that are in flight and similar bits to the various topics. Please feel free to add more topics or info to existing topics | 18:38 |
clarkb | hrm the system-config-run-zuul job has timed out twice in a row | 19:39 |
fungi | odd | 19:39 |
clarkb | They both ran on ovh gra1 | 19:40 |
clarkb | and it seems like we may be seeing the high cost per ansible task here at about 3 seconds | 19:41 |
fungi | okay, maybe not so odd then if performance was similar for both | 19:41 |
clarkb | https://zuul.opendev.org/t/openstack/build/5dc476ff4e614e2fadefc7a9ff2255de/log/job-output.txt#1841-6175 | 19:43 |
clarkb | thats a good chunk of time just adding known hosts | 19:44 |
clarkb | about 13 minutes | 19:44 |
clarkb | another minute here https://zuul.opendev.org/t/openstack/build/5dc476ff4e614e2fadefc7a9ff2255de/log/job-output.txt#6295-6373 | 19:45 |
clarkb | oh my first example is for two ansible loops not one. Each individully take about 6.5 minutes | 19:48 |
clarkb | wow ok so we do this twice the second pass makes the first one redundant. According to the chagne log we do it redundantly beacuse some test checks the first one | 19:50 |
clarkb | This is an easy improvement by removing the first and fixing the test if I can track it down | 19:51 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Remove redundant ssh known hosts prep https://review.opendev.org/c/zuul/zuul-jobs/+/857228 | 19:58 |
clarkb | That would save 6.5 ish minutes | 19:58 |
clarkb | another thing I'm noticing from those logs is that it almost looks like things run sequentially when they can run in parallel | 20:03 |
clarkb | https://zuul.opendev.org/t/openstack/build/5dc476ff4e614e2fadefc7a9ff2255de/log/job-output.txt#7038-7044 is output by https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L46-L52 | 20:04 |
clarkb | it doesn't look like we set serial: 1 on that playbook | 20:06 |
clarkb | oh! could it be that we run with -f 5? so the first 5 happen quickly and then we have a straggler? | 20:06 |
clarkb | yes I think that may be what is happening | 20:07 |
clarkb | jobs with more nodes than ~5 will incur wall clock penalties for things that would otherwise happen in parallel | 20:08 |
clarkb | https://zuul.opendev.org/t/openstack/build/5dc476ff4e614e2fadefc7a9ff2255de/log/job-output.txt#7122-7162 writing out our zuul job vars files is also not quick. I think the templating is particularly slow? I may work on a change that moves non template files to copies instead of templates to see if that is any better and keep templating only when needed | 20:15 |
corvus | clarkb: is the zuul rolling restart stuck again? | 20:16 |
corvus | i see dev10 and dev18 components | 20:16 |
clarkb | corvus: it crashed, there is discussion about it in scrollback but the tldr is in https://review.opendev.org/c/opendev/system-config/+/857209 | 20:17 |
corvus | clarkb: is fix to merge that and re-run? | 20:17 |
clarkb | and now I'm sorting out why the job in that change is consistently timing out which led to https://review.opendev.org/c/zuul/zuul-jobs/+/857228 | 20:17 |
clarkb | corvus: yes | 20:17 |
corvus | clarkb: i thought we ran our service playbooks with more than 5 forks, so if we're doing 5 in the test job, seems like we should increase that | 20:23 |
clarkb | corvus: is that controllable from zuul? | 20:25 |
corvus | i thought we were talking about the nested ansible | 20:26 |
clarkb | I think it is affecting both | 20:26 |
clarkb | https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-base.yaml#L46-L52 that is run by zuul aiui. | 20:27 |
clarkb | Ah and later in that playbook it sets -f 50 on a few playbooks. The others should only run against a single host | 20:27 |
*** dviroel is now known as dviroel|afk | 20:28 | |
corvus | so it should only be the in-zuul part, like the vars templating you were looking at. i don't think that's controllable in zuul. | 20:30 |
corvus | s/should only be/should only be limited by/ | 20:31 |
clarkb | I think halving the runtime cost of setting up ssh known hosts via 857228 will make a quick big impact. And ya working on the shift of templating to copies if templating isn't needed to see if that helps too | 20:32 |
corvus | clarkb: i'd be really surprised if the templating takes longer than copying | 20:34 |
corvus | i think we're just looking at the iterating cost | 20:34 |
clarkb | corvus: I thought that initially too, but the normal iteration cost seems to be about 2.5-3 seconds but the templates all take about 6 seconds | 20:34 |
clarkb | however, that could just be chance. I figure putting the change together and checking isn't too bad | 20:35 |
clarkb | if it is about 3 seconds quicker we'll save noticeable time with the number of files being written | 20:36 |
corvus | clarkb: i can see no discernable difference in a local test | 20:37 |
clarkb | corvus: https://zuul.opendev.org/t/openstack/build/5dc476ff4e614e2fadefc7a9ff2255de/log/job-output.txt#7122-7162 shows the 6 second ish cost per template and if you look at the tasks above that those that don't template are closer to 3 | 20:39 |
clarkb | but that is only one data point | 20:39 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add a mailman3 list server https://review.opendev.org/c/opendev/system-config/+/851248 | 20:40 |
corvus | sure but there are other differences at play here? restarting the host loop... different tasks.... | 20:40 |
corvus | i guess i'm saying two things: 1) it seems unlikely and probably not the lowest hanging fruit; 2) having those be templated is *really* valuable and if we're going to change that, i think we should be really sure it's making a difference. i don't want to start propagating a "copy is faster than template" meme without really compelling evidence. | 20:41 |
corvus | i mean, i just proved locally that template is faster than copy... so.. :) | 20:42 |
clarkb | yes, and the way to collect that evidence is to write a change and have zuul run it in the CI system? | 20:42 |
clarkb | I'm not saying we should merge any such chagne right now. Just that I think it is worth checking | 20:43 |
corvus | clarkb: testing it that way is going to be super tricky -- you're introducing the load of the executors and the remote cloud chosen to run the job and the noisy neighbors in that cloud all as variables | 20:43 |
corvus | i think a reliable test of that needs to be two runs in the exact same conditions | 20:44 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Check if file copies are quicker than templating https://review.opendev.org/c/opendev/system-config/+/857232 | 20:44 |
corvus | so right now, i'm giving my local test of a playbook far more weight than i would give a change that just s/template/copy/ in our job | 20:44 |
clarkb | corvus: your local run excludes any networking and probably has the advantage of nvme storage and so on though. I can imagine a situation where your local run isn't representative of what happens in the CI system. | 20:45 |
corvus | i ran it over the network | 20:45 |
corvus | and of course it's not representative | 20:46 |
clarkb | that said I agree it is a long shot, but I mean ansible tasks should never take 3 seconds anyway. So I won't be surprised at all if the templating system has some weirdness that makes it take longer too | 20:46 |
clarkb | the change to remove unneeded tasks from the multi node known hosts role is definitely the one we should focus on right now I think | 20:46 |
corvus | the point is to determine if template is faster than copy, and i have a high degree of confidence it's not | 20:46 |
corvus | er, strike that, reverse it. | 20:46 |
corvus | clarkb: yeah, that one is already approved | 20:47 |
clarkb | oh cool | 20:47 |
corvus | i was really just trying to save you the trouble of writing 857232 | 20:47 |
corvus | which is why i went to the trouble of doing the local tests | 20:47 |
corvus | but :( | 20:47 |
clarkb | I don't see it as trouble right now. Ansible is slow and exhibiting some interesting behaviors and I think examining them is useful and interesting | 20:48 |
corvus | sure, but that test isn't going to tell us anything | 20:48 |
corvus | i agree improvements and understanding would be great | 20:49 |
corvus | https://etherpad.opendev.org/p/r0rl8o1XZ2SI1p1bugF0 | 20:49 |
corvus | clarkb: ^ that's the playbook i ran with the resulting times | 20:50 |
corvus | quite likely they're within the margin of error | 20:50 |
clarkb | in particular I think if we can understand why an inifile task at https://zuul.opendev.org/t/openstack/build/5dc476ff4e614e2fadefc7a9ff2255de/log/job-output.txt#7113 takes half the time as a template task at https://zuul.opendev.org/t/openstack/build/5dc476ff4e614e2fadefc7a9ff2255de/log/job-output.txt#7122 we may be able to improve our use of ansible in an impactful manner for many | 20:50 |
clarkb | jobs | 20:50 |
clarkb | Both are writing to the same filesystem on the same host (the fake bridge). It is possible we got lucky and noisy neighbors just happened to show up in between those two tasks. The inifile module may have a particularly efficient implementation. etc | 20:52 |
opendevreview | Merged zuul/zuul-jobs master: Remove redundant ssh known hosts prep https://review.opendev.org/c/zuul/zuul-jobs/+/857228 | 20:55 |
clarkb | The ansible async option is something we might want to look at on these loops too. But I seem to recally people having difficulty using that option | 20:55 |
clarkb | corvus: the run against 857232 does seem to show inifile is still half the runtime of template and copy is about the same as template. The consistent speedup between inifile and the rest make think we aren't getting lucky with noisy neighbors. Maybe we need to put everything in an inifile | 21:00 |
corvus | clarkb: haha, i'm having a "cant tell if serious" moment. :) | 21:01 |
clarkb | not really serious, but it does show that ansible can be faster at similar tasks than it is using the typical approach. I'm wondering if inifile avoids copying any files over the wire and does all of its changes on the remote node which might explain it | 21:02 |
corvus | yeah, it did occur to me it might have some radically different approach like that; i haven't opened up the modules to see | 21:03 |
corvus | (my first thought was "haha ini files for everything!" then "wait could that work?" then "well, yes, but it would be silly, right?" :) | 21:03 |
corvus | i do with there were a way to do this without loops | 21:04 |
clarkb | ++ I was thinking that with the known hosts thing too. Like giving the known hosts module a list of keys to set | 21:06 |
clarkb | A lot of the ansible modules seem to be written to accept a singleton input with the assumption you'll use loops to do multiples. Problem is loops are super expensive | 21:07 |
corvus | i wonder if there's something about or callbacks or otherwise related to our env that slows it down. since my local test is like 0.5s for a task | 21:09 |
clarkb | https://github.com/ansible-collections/community.general/blob/main/plugins/modules/files/ini_file.py I do think it is operating on the remote exclusively without doing file copies | 21:09 |
corvus | these are really small files though, so while it is a difference, i wouldn't expect transferring that data to take an extra 3 seconds... how does it transfer it anyway? in the json blob, or does it open an ssh channel? | 21:10 |
clarkb | parsing the copy module is a lot more involved :) trying to sort out where it gets the data | 21:14 |
clarkb | it looks like the copy module receives source paths and not content unless you use teh content parameter. But I'm not immediately seeing how it converts a src path on the ansible control side to the remote side file copy | 21:18 |
clarkb | ah I think the action module portion of copy does the transfer and the regular ansible module code can ignore it | 21:23 |
clarkb | yup the action module creates a new tmp_src value that it feeds into the ansible module when it calls it | 21:25 |
clarkb | corvus: there is what appears to be an undocumented "raw" argument to the copy module that seems to write the file straight to its destination rather than doing a copy to the remote node in a tmp location and then a copy from that tmp location to the final location | 21:29 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Check if file copies are quicker than templating https://review.opendev.org/c/opendev/system-config/+/857232 | 21:32 |
clarkb | I really hope using raw ^ isn't faster | 21:32 |
clarkb | corvus: looking at it from the callback angle the command stuff doesn't apply and the console stream runs in another process. That would mean zuul_json would have to be at fault? | 21:40 |
clarkb | (assuming it is something to do with callbacks | 21:40 |
clarkb | the task start callback handler is pretty straightforward. It accesses a few task attributes and gets a time stamp | 21:43 |
clarkb | I have a hard time seeing how that would create an impact unless processing the callback at all was problematic | 21:43 |
corvus | yeah, nothing is immediately occuring to me either. also, these are small files. | 21:44 |
clarkb | my attempt at using raw failed with permissions errors. I suspect not raw figures those out automatically | 21:45 |
clarkb | but I think grepping the error message shows that it is using ssh to transfer the file | 21:46 |
corvus | so maybe we're looking at the overhead for establishing a new ssh channel? | 21:50 |
clarkb | ya, it seems to be hooking up ssh to dd (and the dd is what ultimately failed) | 21:51 |
ianw | frickler: that error sounds exactly like -> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1973839 | 21:52 |
ianw | https://review.opendev.org/c/openstack/project-config/+/842654 was the workaround for that | 21:54 |
ianw | i am guessing that booting cirros with nested virt hits the same problem in the cirros kernel? | 21:55 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Check if file copies are quicker than templating https://review.opendev.org/c/opendev/system-config/+/857232 | 21:57 |
clarkb | corvus: ^ that does appear to save about a minute of wall time compared to the original in the currently running test for it | 22:07 |
clarkb | corvus: the impact on the slower ovh gra1 nodes is bigger too. Basically switching to synchronize means each synchronize task is baout the same as a single template or copy | 22:08 |
clarkb | in the worst case I think we're saving about 3x the case above. So about 3 minutes total saved? I'm not sure that is worthwhile but if we can chip away a minute here and a minute there by writing more efficient ansible the aggregate should be pretty decent | 22:14 |
clarkb | ianw: ^ related I think we should look into opportunities to change how the log files are encrypted | 22:15 |
corvus | clarkb: tbh, the extra minute is worth it to me in order to make all of that more comprehensible. i hate that it's slow, but i don't like the idea that we would be adding one more level to the flow chart of "how did this variable get into the final job". i like that your change is conservative in that it syncs all the files over meaning that it should be impossible to end up with a file missing. but it does mean that if we want to template, | 22:15 |
corvus | we have to remember to add it to the template list (otherwise, we'll end up with {string} values. all said, i think i can deal with it, so if i'm in the minority, i won't object. | 22:15 |
clarkb | ianw: thats another place where we loop and maybe we don't need to. Perhaps we can have a shell task that loops over them all and does the encryption instead? | 22:15 |
corvus | clarkb: (fwiw, i think that is the best possible change given the position ansible has put us in, so i mean, on that basis: nicely done :) | 22:17 |
clarkb | ya I'll work on cleaning the change up once testing shows its generally happy. Then we can decide if we want to merge it or not | 22:18 |
corvus | clarkb: (i'm mostly just coming from "this is the most complicated and round-about part of the whole nested ansible process and i barely understand the original and i wrote it) | 22:18 |
corvus | i'm always forgetting to add the fake template var files, so i know i'm going to forget to add them to the template list | 22:19 |
clarkb | ya I forget often myself. Related the mm3 change adds host_vars just for the list server vars becuse I couldn't figure out how to do bits that needed to be raw as raw when we needed nested raw tags https://review.opendev.org/c/opendev/system-config/+/851248/75/inventory/service/host_vars/lists01.opendev.org.yaml | 22:21 |
corvus | but hey, maybe they will fail in a way that suggests the obvious fix is to add them to the template list | 22:21 |
clarkb | in particular it seems the parser ansible uses doesn't do proper push down and its doing a naive matching so the first end matches the outer rather than inner raw handling | 22:21 |
clarkb | I gave up and just decided to copy the file as is | 22:21 |
clarkb | lines 50-54 are the probematic ones there | 22:22 |
opendevreview | Merged opendev/system-config master: Fixup zuul merger and executor graceful shutdowns https://review.opendev.org/c/opendev/system-config/+/857209 | 22:24 |
clarkb | I'll work on running ^ in screen on bridge next | 22:24 |
clarkb | and it is running now | 22:30 |
clarkb | I guess I should've checked if all the changes on teh zuul side we wanted in are in first. But its late enough in the day that it should be done sometime tomorrow and we can run it again :) | 22:30 |
corvus | clarkb: like what changes? | 22:32 |
corvus | you mean the speedup, or any changes to get in for the restart? | 22:33 |
corvus | (if the former, i think your zuul-jobs change landed; if the latter -- i think we had zuul where we wanted it for the restart for 6.4.0 on friday, so getting it unstuck as close to that as possible would be great) | 22:34 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use synchronize to copy test host/group vars https://review.opendev.org/c/opendev/system-config/+/857232 | 22:35 |
clarkb | corvus: the latter. Ok I wasn't sure if the nodeset changes had landed yet | 22:36 |
clarkb | but those are also less important for 6.4.0 I think | 22:36 |
clarkb | corvus: 857232 is cleaned up an in a reviewable state if you want to add your thoughts to it. | 22:36 |
corvus | yeah, my preference would be to defer those (and anything else) until we complete the weekend restart so that we can release current master as 6.4.0. | 22:37 |
clarkb | works for me. That restart is in progress now | 22:37 |
clarkb | it is logging to the normal file on bridge if you want tofollow along in more detail than the componets page | 22:37 |
* clarkb adds ansible optimization thoughts to the meeting agenda | 22:38 | |
corvus | clarkb: done, thanks! | 22:41 |
*** dasm is now known as dasm|off | 22:59 | |
ianw | clarkb: yeah, it's quite loop-y. i guess it's nice to have the tasks split up in console/ara views, but equally a shell script gets the same thing done | 23:08 |
clarkb | ianw: ya the problem is 6 secnods for each task adds up quickly when you do more than 10 things in a loop | 23:13 |
clarkb | I agree ansible should be better here, but dealing with that we've got seems pragmatic if we continue to hit job timeouts | 23:13 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Try ansible pipelining in our system-config-run jobs https://review.opendev.org/c/opendev/system-config/+/857239 | 23:21 |
clarkb | after pushing that I've realized we alerady set that in the ansible config so thats a noop | 23:23 |
ianw | it looks like the translate mysql dump backup is failing due to https://bugs.mysql.com/bug.php?id=100219 which is a breaking security change. one option is to add --no-tablespaces to the dump, or i guess adjust the privileges of the dump user | 23:30 |
ianw | the db's don't have a root user by default, you have to add one and that then warns the db will become unsupported. so i can see a way to GRANT PROCESS to the zanata user. i'll juts work around it in puppet | 23:35 |
opendevreview | Ian Wienand proposed opendev/system-config master: translate: fix dump with MySQL 5.7 https://review.opendev.org/c/opendev/system-config/+/857241 | 23:39 |
clarkb | the zuul reboot playbook has moved on to ze02. Thats a good indication I haven't broken anything horribly with that change | 23:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!