*** ministry is now known as __ministry | 03:31 | |
*** yadnesh|away is now known as yadnesh | 04:03 | |
*** akekane is now known as abhishekk | 05:16 | |
*** jpena|off is now known as jpena | 08:29 | |
*** yadnesh is now known as yadnesh|afk | 08:47 | |
opendevreview | Merged openstack/project-config master: Add Allow-Post-Review flag to OpenStackSDK project https://review.opendev.org/c/openstack/project-config/+/859976 | 09:04 |
---|---|---|
opendevreview | Merged openstack/project-config master: Add post-review pipeline https://review.opendev.org/c/openstack/project-config/+/859977 | 09:07 |
*** yadnesh|afk is now known as yadnesh | 09:22 | |
Tengu | hello there! We're seeing this task taking a "nice" amount of time in tripleo jobs: https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/roles/configure-swap/tasks/ephemeral.yaml#L108 | 09:43 |
Tengu | I'm wondering about the reasons of its presence: what's already in /opt at this point, and isn't there any way to get the volume mounted there before writing anything? | 09:44 |
opendevreview | Cedric Jeanneret proposed openstack/openstack-zuul-jobs master: Add some output to the `find' command https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/865383 | 10:03 |
Tengu | -^ this should help understand the actual thing. | 10:04 |
*** rlandy|out is now known as rlandy|rover | 11:10 | |
*** dviroel|afk is now known as dviroel | 11:15 | |
*** soniya29 is now known as soniya29|afk | 11:23 | |
fungi | Tengu: it's because some of our donor clouds give us too small of a rootfs (20gb), but provide an unformatted "ephemeral disk" we can format and mount. in those providers we use part of the ephemeral disk for swap (so as not to take up more space on the rootfs with a swapfile) and mount the rest at /opt but need to shuffle the git cache out of the way and then put it back since it's in | 12:34 |
fungi | /opt on our images | 12:34 |
Tengu | fungi: ah, so the /opt is indeed pre-provisionned... | 12:34 |
fungi | i agree we ought to look at possible ways to speed those tasks up | 12:35 |
Tengu | fungi: are the git repositories there really that useful? I mean, does it represent a real gain of time compared to having to move them? | 12:35 |
fungi | it can take several minutes to clone nova over the network, for example, and puts a significant strain on our git servers if every build does it (we've seen it result in an instant self-imposed denial of service against opendev.org in the past when we've accidentally uploaded images without a cache there) | 12:38 |
Tengu | hmm ok. but it also take a long time to then move the content :/ | 12:38 |
Tengu | fungi: /opt is replaced in order to free space for /home/zuul, I guess? | 12:39 |
fungi | yes, and also jobs like devstack/tempest do most of their work in /opt in order to be assured of enough available space | 12:40 |
fungi | the git refs in /home/zuul are populated from the cache in /opt too | 12:40 |
fungi | so that the executor only needs to push refs which aren't already in the node's cache | 12:41 |
Tengu | hmm ok. wouldn't it be possible to connect as root first, check for ephemeral, switch the /home directory (and, why not, create symlinks in /opt for tempest/others), ensure zuul's home is present, and then only run as zuul ? | 12:41 |
fungi | well, there's still a need to mount the extra space on /opt unless the job is really going to be able to operate within a 20gb rootfs on some providers and with no swap | 12:42 |
Tengu | hmm. can't we "just" remove the /opt/git (and, if anything is left in /opt, then only move the content to the new partition)? | 12:44 |
fungi | it's possible we can shuffle some of this around, though we've set expectations for projects that 1. if you need extra space do your work in /opt and 2. there's a cache of all git repos in /opt | 12:44 |
Tengu | I think "rm" would be faster than "mv" in such case. | 12:44 |
Tengu | i.e. once the cache is used, of course. | 12:44 |
fungi | i'm not sure there's necessarily a "one the cache is used" unless that's "once the job is over" | 12:45 |
fungi | at least devstack and grenade used to rely on the cache in /opt, but it's possible they no longer do | 12:46 |
Tengu | oh, so there are other things than https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-git/tasks/main.yaml hitting on that cache? | 12:46 |
fungi | there was at one time, i don't know if there still is so that's what we'd need to find out | 12:46 |
Tengu | hmm ok. *maybe* we can do a middle-step, by passing a parameter that switches the "mv" to "rm/cleaning"? That way, we may be able to test on some jobs? | 12:47 |
fungi | one example that springs to mind is the tag-releases job, which is triggered by changes merging to the openstack/releases repo but could create and push tags for any project in openstack so can't really add every single project as a required-projects entry for the job | 12:48 |
Tengu | fair enough. So maybe a parameter would help. | 12:48 |
Tengu | I can propose something | 12:48 |
Tengu | for instance, within tripleo, we have a tripleo-ci repository, and that one is calling the configure-swap role. We could pass some params at this point, allowing to just empty /opt/git instead of moving everything? | 12:49 |
fungi | obviously we'll want input from some other folks as well. i won't pretend to have the best grasp of what might break (which may not become apparent within the job if it's side effects like overloading our git servers at peak activity), and also i just woke up so still a little groggy | 12:50 |
Tengu | :)) | 12:50 |
* Tengu pours some fresh coffee | 12:50 | |
fungi | i have a bad habit of rolling over in bed and instantly opening my computer | 12:51 |
Tengu | I have a dog in order to avoid that: first take the dog out, then coffee, and then only laptop :) | 12:51 |
fungi | the cats got fed right away, since i didn't want them realizing i'm also made of meat | 12:54 |
Tengu | :D | 12:54 |
Tengu | though they don't have any thumb - meaning they may get some issues if the Might Thumbs Owner isn't there anymore | 12:55 |
Tengu | *Mighty | 12:55 |
Tengu | (yeah - also have a cat, I know how it is with them ;)) | 12:55 |
Tengu | anyway - have to jump on a meeting, but that's an interesting discussion/topic: optimizing things | 12:56 |
Tengu | fungi: and, once you get some coffee and all, I'll also have a question about unbound, networkmanager and, well, dnsmasq :). | 12:56 |
fungi | ask your dns/network question at your convenience | 13:00 |
Tengu | fungi: (still in a call, but...) so, I see unbound gets configured, but actually not really used - that is, NetworkManager doesn't know about it, and usually this means we'll get the network nameservers (provided by dhclient) instead of the 127.0.0.1 in the /etc/resolv.conf. that's a first thing. | 13:14 |
Tengu | now, after some searches, it seems NetworkManager support for unbound is ending. Would it make sense to install dnsmasq instead, and properly configure it so that NetworkManager actually uses it for dns caching? | 13:14 |
Tengu | and, if not, why isn't NetworkManager properly configured to not override the /etc/resolv.conf? | 13:15 |
fungi | Tengu: i'm not sure what direct support nm needs for unbound, it just needs to know to query 127.0.0.1 as its dns server. i guess the question is how do we correctly configure nm to not overwrite the resolver hint when you're restarting it. ianw did a lot of work on nm integration/configuration for recent fedora and centos so probably has a better handle on what's going on there | 13:16 |
Tengu | well, I actually knows what to do in nm in order to make it use dnsmasq/unbound. | 13:17 |
fungi | unbound is really used at the beginning of your jobs, but at some point over the course of the tripleo jobs something triggers nm to overwrite the configuration and after that point you're no longer resolving through unbound's cache | 13:17 |
Tengu | basically, you can configure main.dns value in its config file and it will do the magic - for instance, if we set the value to "dnsmasq", it will start dnsmasq with the nameservers provided by the DNS, plus additional config we may push in /etc/NetworkManager/dnsmasq.d/ | 13:18 |
Tengu | unbound also used to have the same support, but it was removed lately. | 13:18 |
Tengu | (sorry, has to focus on the call - back in a few) | 13:18 |
fungi | and yeah, we really don't want the nameservers provided by dhcp to ever be used from test nodes, even as forwarded resolvers from the local caching resolver | 13:19 |
Tengu | even as forwader? why so? | 13:19 |
fungi | so nm "integration" to provide that would be a problem | 13:19 |
fungi | it's a long story, but basically the resolvers our cloud donors provide are often broken | 13:19 |
Tengu | erf... | 13:20 |
Tengu | that's why it's using google/cloudflare by default then. OK. | 13:20 |
Tengu | rlandy|rover: -^^ guess we'll just go with Sandeep work on ensuring it's using unbound as resolver, without the networkmanager integration, and using the public forwarders we currently have. | 13:20 |
Tengu | fungi: may I then propose a patch against the configure-unbound in order to configure NetworkManager to NOT override the /etc/resolve.conf? | 13:21 |
Tengu | so that unbound will be used all the way.. | 13:22 |
fungi | a good example is rackspace. they've had problems with abusers/compromised machines flooding their resolvers in attempts to use them in ddos attacks, so they have some sort of security device which adds rules blocking client ip addresses which appear to be a problem. but that doesn't take into account that the blocked addresses might get recycled to another tenant's vm and so we | 13:22 |
fungi | constantly ended up with test nodes which couldn't resolve anything because their addresses had been previously blocked from reaching the resolvers | 13:22 |
Tengu | ah, yeah, security at its finest :) | 13:22 |
rlandy|rover | Tengu: ack- that is the current plan | 13:23 |
fungi | anyway, yes a patch to configure unbound not to accidentally undo our configuration would be useful | 13:23 |
Tengu | fungi: on it - I'll push it today | 13:23 |
Tengu | I can make it optional. | 13:23 |
Tengu | (default true, but if we don't want to touch networkmanager config, we can set it to false) | 13:24 |
fungi | my best guess is that https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/nodepool-base/finalise.d/89-boot-settings is where we'd want to do it | 13:26 |
Tengu | fungi: not here? https://opendev.org/opendev/base-jobs/src/branch/master/roles/configure-unbound/tasks | 13:28 |
Tengu | (using ansible would be easier, since it's an ini file, and there's a module for that) | 13:30 |
fungi | i guess there's the question of how early would be best to apply the additional configuration | 13:31 |
Tengu | apparently the configure-unbound is called anyway during the job | 13:32 |
Tengu | but yeah, seeing the /etc/resolv.conf is overridden at boot, it's probably better to edit that 89-boot-settings. | 13:33 |
fungi | diskimage-builder sets up some of this when creating the node images, then glean does boot-time configuration based on the detected network info from configdrive/dhcp, then ansible configures things at job runtime | 13:33 |
Tengu | yeah, DIB modification is probably safer. | 13:33 |
Tengu | fine. I'll propose a patch shortly. | 13:33 |
Tengu | (after my current call) | 13:34 |
fungi | there's no rush, things are slow this week and a lot of us aren't around the computer much (and the rest of us are juggling a lot of stuff as always) | 13:34 |
Tengu | :) | 13:35 |
Tengu | fungi: I'll also propose something against the ansible role in order to be consistent. And this would allow tripleo to depends-on the change for some more testing. | 13:36 |
fungi | Tengu: depends-on to opendev/base-jobs won't work since it's a trusted repo where speculative execution isn't performed | 13:40 |
fungi | to properly exercise that we merge changes to a copy of the role and modify the base-test job to use it instead of the real one, then you can have a proposed change in an untrusted repo which is parented to base-test explicitly (instead of implicitly to base) | 13:41 |
Tengu | ah, right, we could test with another repo though. There are so many of them involved :/ | 13:42 |
Tengu | openstack-zuul-jobs isn't trusted - I'm mixing things. | 13:42 |
Tengu | (that was for the /opt thingy) | 13:42 |
fungi | correct, ozj is an untrusted repo, so you can depends-on to changes in review for it no problem | 13:43 |
Tengu | yeah - I wanted to get some insight https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/865383 | 13:43 |
Tengu | maybe it's still worth to get in - up to you folks. | 13:43 |
fungi | zuul's security model prohibits speculatively executing proposed changes for config repos though, because that could be leveraged to expose or exfiltrate secrets and perform other sensitive actions | 13:44 |
Tengu | of course | 13:44 |
Tengu | fungi: do we have crudini in DIB env? | 13:50 |
Tengu | humpf. apparently not. | 13:52 |
fungi | crudini the magnificent, prestidigitation extraordinaire | 13:54 |
Tengu | it's a good tool when it comes to ini_file things :) | 13:54 |
Tengu | would it be OK to get it in the nodepool-base? | 13:54 |
Tengu | dependecy-wise, it's apparently relying on python3-iniparse | 13:54 |
Tengu | which isn't heavy at this point.. | 13:55 |
fungi | we could install it outside the chroot. the problem with installing things inside the image which have python library dependencies is that they're very hard to uninstall when they may end up conflicting with libs from pip later. putting crudini into a venv might be another solution since that would isolate it from the system python library path and avoid causing problems for jobs | 13:56 |
Tengu | hmmmm so here I don't know how to do either of those two better solutions :/ | 13:57 |
fungi | it's one of the reasons we use glean instead of cloud-init, because the latter drags in python modules | 13:57 |
Tengu | I could of course override the wole file - but while I'm sure nothing will be lost on a standard fc-35 and cs-9, I don't know about other distros such as opensuse | 13:58 |
Tengu | and doing multi-line regexp is a bad idea imho. | 13:59 |
fungi | basically we just want to avoid polluting the resulting server image with unnecessary libs that end up conflicting with what jobs may want to install on those nodes later | 13:59 |
*** dasm|off is now known as dasm | 13:59 | |
Tengu | I understand this - and am all for "keep it clean". | 13:59 |
fungi | take a look at nodepool/elements/infra-package-needs/install.d/40-install-bindep for an example of the venv approach | 13:59 |
Tengu | ah | 14:00 |
fungi | then anything needing to run bindep can just call /usr/bindep-env/bin/bindep directly or symlink to it from somewhere in the path like /usr/local/bin | 14:01 |
Tengu | sooooo.... am I to add such a file in the nodepool-base/install.d so that I can install crudini, and then call crudini from wihtin the venv in the 89-boot-settings ? | 14:01 |
fungi | that would be one option, yes. the other option would be to install crudini into the nodepool-builder container image and then use whatever the dib mechanism is to run that from outside the chroot instead of from inside | 14:02 |
Tengu | ah, there's already a 91-venv-os-testr in nodepool-base | 14:02 |
Tengu | I'm not fluent enough with dib for that. Since there's already a venv available in the nodepool-base, I can "just" install crudini in it.. | 14:03 |
fungi | but again, all this is on the very edge of my experience so when others are around they may have superior suggestions | 14:03 |
Tengu | sure | 14:03 |
Tengu | I'll propose something, so that it's not "just" theory | 14:04 |
Tengu | guess we'll get more ppl next week, after thanksgiving. | 14:04 |
fungi | likely, or perhaps later today when other parts of the world wake up | 14:04 |
fungi | ianw is probably our foremost expert on this particular topic, but he's on apac time | 14:05 |
Tengu | ah, so too late | 14:05 |
fungi | or too early, depending on how you look at a globe | 14:06 |
Tengu | :) I'll try to catch him tomorrow during my morning (EMEA) | 14:08 |
Tengu | should be fine. | 14:08 |
opendevreview | Cedric Jeanneret proposed openstack/project-config master: Ensure NetworkManager doesn't override /etc/resolv.conf https://review.opendev.org/c/openstack/project-config/+/865433 | 14:12 |
vishalmanchanda | clarkb: fungi : hello, a query regarding bug https://bugs.launchpad.net/horizon/+bug/1996638 | 14:15 |
vishalmanchanda | clarkb: fungi : here is link of logs what we discussed in the past https://etherpad.opendev.org/p/migrate-to-jammy#L33 | 14:15 |
vishalmanchanda | right now after migration to ubuntu-jammy, more horizon jobs like horizon-selenium-headless, horizon-integration-tests start failing. | 14:17 |
fungi | vishalmanchanda: this is the issue with using snap-installed browsers in a headless xvfb right? | 14:18 |
vishalmanchanda | In the latest P.S. you can find error logs for both of these job https://review.opendev.org/c/openstack/horizon/+/861140 | 14:18 |
vishalmanchanda | fungi: yes. | 14:18 |
vishalmanchanda | I am getting the same error in my local env. deployed on ubuntu-focal. | 14:18 |
fungi | were you able to see if coreycb or tinwood had ideas? this may get deep into ubuntu dbus/login design choices | 14:19 |
vishalmanchanda | fungi: not yet. | 14:19 |
vishalmanchanda | fungi: but I found one fix which works in my local env. | 14:19 |
vishalmanchanda | fungi: now wants to know how can we apply the same in our openstack CI job. | 14:20 |
fungi | an alternative might be to run the jobs on debian, where chromium and firefox are available as normal deb packages | 14:20 |
fungi | what's the workaround? | 14:20 |
vishalmanchanda | fungi: For e.g. If install firefox following all these steps mentioned here https://www.omgubuntu.co.uk/2022/04/how-to-install-firefox-deb-apt-ubuntu-22-04#:~:text=Installing%20Firefox%20via%20Apt%20(Not%20Snap)&text=You%20fist%20add%20the%20Mozilla,reinstalled%20at%20a%20later%20date. , then 2 jobs works fine | 14:21 |
fungi | ahh, okay, so similar to running the test on debian instead of ubuntu | 14:24 |
vishalmanchanda | Install Firefox as a .Deb on Ubuntu 22.04 (Not a Snap) | 14:25 |
fungi | but in this case getting a deb of the browser from outside ubuntu's archive rather than switching to a distro which provides something along those lines | 14:25 |
fungi | yeah, that's fairly straightforward. i should be able to find you an example, i think your jobs already do something similar to install newer nodejs packages | 14:25 |
vishalmanchanda | ok | 14:27 |
fungi | vishalmanchanda: i think this is the basic template for it... https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-nodejs/tasks/main.yaml | 14:28 |
fungi | oh, though maybe easier since this is in a ppa on lp and ubuntu already provides some slightly nicer integration for those | 14:30 |
vishalmanchanda | fungi: ok, thanks for the reference, I will try to do the same thing to install Firefox as a .deb package | 14:30 |
fungi | vishalmanchanda: here's an example of consuming packages from a ppa in a role: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-skopeo/tasks/Ubuntu.yaml | 14:31 |
vishalmanchanda | fungi: So I should update this task https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/nodejs-test-dependencies/tasks/main.yaml#L6 ? | 14:33 |
vishalmanchanda | because this task is responsible for installing firefox and use snap by default to install firefox | 14:34 |
fungi | vishalmanchanda: changing it in the zuul-jobs repo is probably a bigger question, since you're wanting to apply an ubuntu-specific workaround and that role is generic for a variety of different distros, so we'd need an ubuntu-specific variant of it similar to how the ensure-skopeo role does it (you'll see there's an Ubuntu-22.04.yaml in it which does just a normal package install of | 14:39 |
fungi | skopeo vs the Ubuntu.yaml which installs skopeo from a ppa) | 14:39 |
fungi | you'll see https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-skopeo/tasks/main.yaml does include_tasks: "{{ zj_distro_os }}" in order to allow us to switch behavior based on the platform where it's run | 14:40 |
vishalmanchanda | fungi: Could you help me in writing tasks and roles to fix this issue if you have time, because I have very basic knowledge on anisble side. | 14:49 |
fungi | vishalmanchanda: this is a very bad week for me to commit to more work, i'm already drowning and at the moment i seem to be the only one around to answer questions | 14:50 |
vishalmanchanda | fungi: ok np. | 14:50 |
vishalmanchanda | fungi: let me just hit and try then. | 14:50 |
vishalmanchanda | fungi: anyway thanks for providing all these references, these are very helpful:) | 14:51 |
fungi | vishalmanchanda: happy to help, and i might be able to help more i just don't want to commit to anything at least until i get caught up on the pile of pending things in my inbox | 14:56 |
*** dviroel is now known as dviroel|lunch | 15:07 | |
*** soniya29|afk is now known as soniya29 | 15:48 | |
*** akekane is now known as abhishekk | 16:09 | |
*** dviroel_ is now known as dviroel | 16:15 | |
*** yadnesh is now known as yadnesh|away | 16:41 | |
clarkb | Tengu: fungi: while the /opt setup in that cloud does take some time I really feel like optimizing that is poor application of optimization. We've got jobs that spend more than half an hour curating logs for example. Some of that is ansible per task slowness, some of it is poor log layout for swift object storage, and some of it is just slow runtime. Tempest for example is | 16:59 |
clarkb | installed three times in tempest jobs one of which deletes a previous install and reinstalls. Basically all that to say we have a lot of low hanging performance fruit and almost all of it is safer than changing base node expectations. | 16:59 |
Tengu | clarkb: I'd be happy to help chasing those fruits and making some nice salad with them :) | 17:00 |
clarkb | Tengu: fungi: and yes to be extra extra clear unbound is used by our nodes when we hand them over to you. removing unbound would breakt hings | 17:00 |
fungi | however, configuring nm to know not to blow away the resolver settings we've supplied would probably be good | 17:01 |
clarkb | fungi: yup Iagree however I'm confused as to why this is an issue if things boot just fine | 17:02 |
clarkb | Tengu: fungi: ^ I looked into this the other day quite a bit and everything I could see is that NM and unbound were working as expected and the tripleo jobs do something to break it. Can we just stop doing the something instead? | 17:02 |
clarkb | or maybe the something should be smarter about what it does. But as far as I can tell there wasn't anything wrong with NM and unbound as configured at boot | 17:03 |
Tengu | clarkb: I'd rather be smart with NetworkManager and make it understand "don't touch that file" | 17:03 |
Tengu | since it's something we can do.... | 17:03 |
clarkb | Tengu: right but why is it touching the file. I think that is my confusion | 17:03 |
fungi | there may be a startup race, since we're setting the resolver in rc.local which probably runs after nm has settled | 17:03 |
Tengu | basically, nm will touch it whenever there's a reload (nmcli reload, systemctl restart networkmanager), lease refresh and the like | 17:04 |
clarkb | huh my local NM doesn't do that (in fact its a problem on my laptop beacuse I can't get it to reset thingsafter tethering with my phone) | 17:04 |
fungi | it might also be specific to centos stream 9 | 17:05 |
clarkb | specifically if I usb tether and then stop NM refuses to update networking settings when switching back to wifi or similar. Its really annoying | 17:05 |
Tengu | clarkb: check the NetworkManager config to see what's going on in there. Also, the default behavior of that service changes depending on the release, availability of some tools and so on | 17:05 |
Tengu | in any cases - being specific and precise in the config may (and will) avoid headaches. | 17:06 |
clarkb | ack so this may be specific to the installation and config. In that case having NM ignore /etc/resolv.conf does make sense. Can we be sure to document why we're making that change pretty clearly in the commit/comments | 17:06 |
clarkb | I'm mostly just surprised because NM refuses to act that way on my laptop where it would actually be helpful :) | 17:06 |
clarkb | Tengu: re crudini you don't need it in the resulting image so I'd install it and then remove it | 17:07 |
clarkb | but another option (and maybe preferable option) is to just use a small python script | 17:07 |
clarkb | use config parser, set the flag, and done | 17:07 |
Tengu | clarkb: I can check for that tomorrow (getting late EMEA here). Care to comment the thing about crudini? Maybe installing it on the builder itself and calling it from outside the chroot is a thing? not really sure what's best... | 17:09 |
Tengu | or I can, indeed, drop the venv. | 17:09 |
clarkb | Tengu: for low hanging performance fruit in tripleo specifically I would look at optimizing the log uploads. Tripleo log upalods are extremely slow and I think a coupel of things can be done to improve them. The first is to ensure that we're avoiding ansible task loops with large numbers of inputs (each ansible task can take a second or more to bootstrap, process 600 log | 17:09 |
clarkb | files and that is 10 minutes). Second is to remove files that never change (lots of /etc type content etc) and flatten dir structures as each dir entry is a swift object that neesd to be created with an index file | 17:09 |
Tengu | (also, sorry, I'm working on another issue in //...) | 17:09 |
clarkb | Tengu: I would not install it in the builder image. Personally I would just make a tiny python script that uses config parser to set whatever ini values you need. | 17:09 |
Tengu | clarkb: like, calling "python 'import config_parser ....'" ? | 17:10 |
clarkb | but to be clear deleting things from /opt is the very last optimization I would look at once everything else that is easy has been done | 17:10 |
Tengu | or a plain file | 17:10 |
clarkb | Tengu: either way. YOu can have a script file or inline it. Probably depends on how complicated and large the python needs to be | 17:11 |
Tengu | clarkb: "re: flatten dir structure" maybe replacing / by _ ? but that would make the whole thing terrible to browse. | 17:11 |
clarkb | Tengu: in tripleos case the logs often have like 10 levels or more with no file contents then only the leaf has any contents | 17:11 |
Tengu | clarkb: basically, it's only one option to add in the [main] section. | 17:11 |
clarkb | In cases where you are deeply nested like that it seems rare that the full path provides any value | 17:12 |
* Tengu takes note for the log thing | 17:12 | |
clarkb | just store the log file somewhere without the deep nesting | 17:12 |
Tengu | logs are collected from within a dedicated ansible project, I can have a look at it. | 17:12 |
Tengu | indeed, if we can make the log collection faster, that may already be a thing | 17:13 |
clarkb | But also (imo) tripleo collects a lot of useless log files | 17:13 |
clarkb | files that never ever change and can be collected only when specifically needed | 17:13 |
Tengu | I can more than probably make a review with the CI folks (ping rlandy|rover ) | 17:14 |
clarkb | https://zuul.opendev.org/t/openstack/build/91f108985637462c9ea5d868c77f9378/log/logs/undercloud/etc/rsyncd.conf is a good example | 17:14 |
clarkb | but there are many like that | 17:14 |
Tengu | thing is, at some point, that one was actually used :). | 17:15 |
Tengu | but yeah, that's for an old, deprecated/dead release iirc. | 17:15 |
clarkb | sure, but it isn't useful today and log upload for tripleo jobs is slow | 17:15 |
Tengu | sooo. yeah. cleaning might help as well. | 17:15 |
clarkb | I'm just saying deleting useless stuff first is what I would do before touching /opt which many jobs rely on and itself is an optmiation tool | 17:15 |
clarkb | you can very easily make jobs run longer or start failing by editing /opt incorrectly | 17:15 |
clarkb | not collecting a log file that can be retrieved from the package associated with its installation hurts nothing | 17:16 |
Tengu | sure | 17:16 |
rlandy|rover | Tengu: pls add whatever needs to review to our review list and we'll take care of it | 17:17 |
Tengu | rlandy|rover: I'll create a jira card for that and come back with some more meat. | 17:17 |
clarkb | on the ansible task cost side of things it would be great if we could convince ansible that per task costs are actually important, but unfortunately we've not been able to make headway there | 17:17 |
rlandy|rover | sure | 17:17 |
Tengu | clarkb: same here... and it will get even worts with the AEE... | 17:18 |
clarkb | anytime you have a loop with more than a handful of items you now need to consider writing a python module... | 17:18 |
Tengu | a former colleague counted over 10s to just bootstrap the env, before anything was actually started with the playbook and all. | 17:18 |
Tengu | and with tripleo being composed of many playbook, you can see how bad it can be. | 17:19 |
clarkb | I've written at least one module to address some problematic loops in base jobs. There are more all over our jobs though because as someone writing a job it isn't apparent that a loop is dangerous | 17:19 |
Tengu | yeah, we also did some of that work here. Maybe the log collection may benefit of some python love as well. | 17:20 |
clarkb | (also I think the task bootstrap time has steadily gotten worse over time so when some of these were written it was fine but now under ansible 5/6 its a huge problem) | 17:20 |
clarkb | also ansible 5 broke pipelining, but I haven't seen ansible 6 be particularly quick with pipelining fixed | 17:21 |
fungi | also ansible-lint contributes to this problem by insisting that lots of things which could be done efficiently by shelling out to an external command should instead use existing ansible modules which may need to be executed in loops where the shell commands would have operated efficiently on multiple files in one execution | 17:22 |
clarkb | fungi: exactly. Its a core part of the ansible language and one that is strongly encouraged by the ecosystem. There is no warning that says "you have more than 5 elements here consider a different tool" | 17:23 |
clarkb | vishalmanchanda: fungi: I wonder if the easiest thing is to switch the job definition to debian and see if it just works? | 17:24 |
clarkb | vishalmanchanda: fungi: considering the jobs install firefox and chromium from pacakges and then run tox I half expect that to just work | 17:25 |
clarkb | but then youdon't need to sort out PPAs and special packages. | 17:25 |
vishalmanchanda | clarkb: yeah we can try that, how can I do that? | 17:26 |
clarkb | vishalmanchanda: in the jobs you want to run with selenium change nodeset: ubuntu-jammy to nodeset: debian-bullseye | 17:27 |
vishalmanchanda | clarkb: ok let me quickly try that. | 17:28 |
*** jpena is now known as jpena|off | 17:30 | |
clarkb | Tengu: just talking out loud here for other optimization ideas: One along the lines of /opt optimizing is looking for redundant actions (like installing tempest multiple times or editing /etc/ssh/known_hosts and ~zuul/.ssh/known_hosts with the same content). I know a number of projects continue to install python2 related stuff even though they no longer support python2 | 17:34 |
clarkb | anymore (this happens via bindep entries but could happen in other ways). In Zuul's testsuite the performance difference between python 3.8 and 3.10 was dramatic, using newer python interpreters can have a pretty big impact. In the past its been common to trace job slowness and timeouts back to heavy use of swap. Double checking jobs are avoiding swap might be helpful. As | 17:35 |
clarkb | would improving known memory hogs (cinder-backup, privsep, etc). Apparently running python in containers can slow things down due to seccomp :( it may not be appropriate to avoid that though. There are so many things that we could do better if we had a concerted effort around performance, but it seems the desire just often isn't there and I end up poking at things that are | 17:35 |
clarkb | non controversial and relatively easy to try and help. | 17:35 |
clarkb | oh! I can't forget replacing osc with dedicated scripts that can reuse tokens (though apparently we can configure osc to do this too anyone know if progress has been made on that?) and avoid python startup time finding entrypoints (which is costly) | 17:41 |
Tengu | clarkb: I'll do some listing of the potential things we can do within tripleo. but first, I want that unbound vs networkmanager situation solved for good | 17:46 |
clarkb | ++ I don't think any of this is urgent, but it is something that I periodically look into and so have a bunch of random ideas for things that can help. | 17:48 |
Tengu | := | 17:48 |
Tengu | :) | 17:48 |
Tengu | for now - EOD is hitting hard, wide will be angry if I extend more.. | 17:49 |
Tengu | clarkb, fungi thanks for the pointers and help! | 17:49 |
vishalmanchanda | clarkb: here is patch https://review.opendev.org/c/openstack/horizon/+/865453/2 chnages nodeset from ubuntu-focal->debian-bullseye | 17:53 |
clarkb | vishalmanchanda: cool lets see if that is any happier. Its possible there are debian vs ubuntu differences we need to handle, but in this case because the testing should be very self contained I expect it may just work | 17:54 |
vishalmanchanda | clarkb: it's failing:( | 17:54 |
vishalmanchanda | https://zuul.openstack.org/status#865453 | 17:55 |
clarkb | looks like the debian package for chromium is just chromium and not chromium-browser | 17:56 |
vishalmanchanda | clarkb: hmm ok some are not available... | 17:56 |
clarkb | and firefox is firefox-esr? I didn't expect those differences. But I guess if they have diverged on how to pacakge them different names isn't surprising | 17:57 |
clarkb | vishalmanchanda: I've pushed https://review.opendev.org/c/zuul/zuul-jobs/+/865459 and once testing for that change looks good we can update your change to depends on it | 18:15 |
vishalmanchanda | clarkb: ok thanks | 18:15 |
fungi | clarkb: there is a "firefox" package in debian too but not in stable because... stability | 18:30 |
fungi | so firefox-esr tracks the firefox extended stable release versions instead of latest | 18:30 |
fungi | and i guess the maintainers thought it best to provide those as different package names | 18:31 |
fungi | (on unstable you can choose between esr or latest that way) | 18:31 |
clarkb | gotcha | 18:32 |
fungi | might have made more sense to make "firefox" be a virtual package which depends: firefox-latest|firefox-esr and then on stable you'd get whichever was available, but yeah | 18:32 |
clarkb | ++ | 18:32 |
clarkb | it looks like my zuul-jobs edit and vishalmanchanda's change to switch the jobs to bullseye is working for at least some of the jobs | 18:32 |
clarkb | the selenium-headless job is rerunning for some reason but nodejs16-run-test ran | 18:33 |
clarkb | its a bindep issue in horizon, one sec | 18:33 |
fungi | neat | 18:34 |
clarkb | interestingly I thought bindep ran in the job that succeeded so not sure why it didn't break too. | 18:34 |
fungi | different profile maybe? | 18:39 |
clarkb | ya there is a selenium profile in use | 18:40 |
clarkb | change updated to distinguish between the two firefox packages across ubuntu and debian | 18:40 |
clarkb | I think there may be another issue because horizon-integration-tests is restarting. But the good news is the nodejs16-run-test job seems to show it talking to firefox successfully | 18:52 |
clarkb | looks like the integration-tests job runs devstack? Which implies that devstack isn't working on bullseye? I thought I fixed that recently.. | 18:56 |
clarkb | oh I see its a nodeset issue | 18:57 |
clarkb | vishalmanchanda: it looks like there are errors finding the firefox binary now. hopefully those are addressabel though. It does look like the nodejs suite has figured it out at least but not python | 19:15 |
clarkb | vishalmanchanda: I've just pushed a new update that bumps the geckodriver version. One thing I notice is they have done work to improve snap support with newer geckodrivers. It is possible this may be part of the puzzle for ubuntu too? | 19:33 |
clarkb | vishalmanchanda: see https://github.com/mozilla/geckodriver/releases/ | 19:33 |
vishalmanchanda | clarkb: yeah I done that too in my local env. to fix selenium-headless job | 19:36 |
vishalmanchanda | clarkb: but I just tried with older version v0.27.0 and it also works if install firefox as deb. package on ubuntu-jammy. | 19:43 |
vishalmanchanda | clarkb: I also updated geckodriver version here https://review.opendev.org/c/openstack/horizon/+/861140 but selenium-headless job still fails. | 19:44 |
fungi | well, yeah because that entirely sidesteps the snap situation | 19:44 |
fungi | i mean, the reason it works when installing an actual deb of the browser on ubuntu | 19:45 |
vishalmanchanda | one question, I have in my mind. | 19:47 |
vishalmanchanda | Is there any cons of running these jobs on debian-bullseye instead of ubuntu-jammy? | 19:48 |
fungi | vishalmanchanda: other than potential inconsistency with other jobs, not really. debian is one of the target distros for this cycle: https://governance.openstack.org/tc/reference/runtimes/2023.1.html | 19:49 |
vishalmanchanda | fungi: ok | 19:50 |
clarkb | right I strongly suspect that part of the solution here is a newer geckodriver. That may not be sufficient though | 20:01 |
clarkb | vishalmanchanda: the selenium headless job passes now after updating geckodriver | 20:02 |
clarkb | so ya that was a piece of the puzzle | 20:02 |
vishalmanchanda | clarkb: cool, thanks for the help. | 20:03 |
*** dviroel is now known as dviroel|afk | 21:25 | |
*** rlandy|rover is now known as rlandy|out | 22:25 | |
*** dasm is now known as dasm|off | 23:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!