Wednesday, 2022-11-23

*** ministry is now known as __ministry		03:31
*** yadnesh\|away is now known as yadnesh		04:03
*** akekane is now known as abhishekk		05:16
*** jpena\|off is now known as jpena		08:29
*** yadnesh is now known as yadnesh\|afk		08:47
opendevreview	Merged openstack/project-config master: Add Allow-Post-Review flag to OpenStackSDK project https://review.opendev.org/c/openstack/project-config/+/859976	09:04
opendevreview	Merged openstack/project-config master: Add post-review pipeline https://review.opendev.org/c/openstack/project-config/+/859977	09:07
*** yadnesh\|afk is now known as yadnesh		09:22
Tengu	hello there! We're seeing this task taking a "nice" amount of time in tripleo jobs: https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/roles/configure-swap/tasks/ephemeral.yaml#L108	09:43
Tengu	I'm wondering about the reasons of its presence: what's already in /opt at this point, and isn't there any way to get the volume mounted there before writing anything?	09:44
opendevreview	Cedric Jeanneret proposed openstack/openstack-zuul-jobs master: Add some output to the `find' command https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/865383	10:03
Tengu	-^ this should help understand the actual thing.	10:04
*** rlandy\|out is now known as rlandy\|rover		11:10
*** dviroel\|afk is now known as dviroel		11:15
*** soniya29 is now known as soniya29\|afk		11:23
fungi	Tengu: it's because some of our donor clouds give us too small of a rootfs (20gb), but provide an unformatted "ephemeral disk" we can format and mount. in those providers we use part of the ephemeral disk for swap (so as not to take up more space on the rootfs with a swapfile) and mount the rest at /opt but need to shuffle the git cache out of the way and then put it back since it's in	12:34
fungi	/opt on our images	12:34
Tengu	fungi: ah, so the /opt is indeed pre-provisionned...	12:34
fungi	i agree we ought to look at possible ways to speed those tasks up	12:35
Tengu	fungi: are the git repositories there really that useful? I mean, does it represent a real gain of time compared to having to move them?	12:35
fungi	it can take several minutes to clone nova over the network, for example, and puts a significant strain on our git servers if every build does it (we've seen it result in an instant self-imposed denial of service against opendev.org in the past when we've accidentally uploaded images without a cache there)	12:38
Tengu	hmm ok. but it also take a long time to then move the content :/	12:38
Tengu	fungi: /opt is replaced in order to free space for /home/zuul, I guess?	12:39
fungi	yes, and also jobs like devstack/tempest do most of their work in /opt in order to be assured of enough available space	12:40
fungi	the git refs in /home/zuul are populated from the cache in /opt too	12:40
fungi	so that the executor only needs to push refs which aren't already in the node's cache	12:41
Tengu	hmm ok. wouldn't it be possible to connect as root first, check for ephemeral, switch the /home directory (and, why not, create symlinks in /opt for tempest/others), ensure zuul's home is present, and then only run as zuul ?	12:41
fungi	well, there's still a need to mount the extra space on /opt unless the job is really going to be able to operate within a 20gb rootfs on some providers and with no swap	12:42
Tengu	hmm. can't we "just" remove the /opt/git (and, if anything is left in /opt, then only move the content to the new partition)?	12:44
fungi	it's possible we can shuffle some of this around, though we've set expectations for projects that 1. if you need extra space do your work in /opt and 2. there's a cache of all git repos in /opt	12:44
Tengu	I think "rm" would be faster than "mv" in such case.	12:44
Tengu	i.e. once the cache is used, of course.	12:44
fungi	i'm not sure there's necessarily a "one the cache is used" unless that's "once the job is over"	12:45
fungi	at least devstack and grenade used to rely on the cache in /opt, but it's possible they no longer do	12:46
Tengu	oh, so there are other things than https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-git/tasks/main.yaml hitting on that cache?	12:46
fungi	there was at one time, i don't know if there still is so that's what we'd need to find out	12:46
Tengu	hmm ok. maybe we can do a middle-step, by passing a parameter that switches the "mv" to "rm/cleaning"? That way, we may be able to test on some jobs?	12:47
fungi	one example that springs to mind is the tag-releases job, which is triggered by changes merging to the openstack/releases repo but could create and push tags for any project in openstack so can't really add every single project as a required-projects entry for the job	12:48
Tengu	fair enough. So maybe a parameter would help.	12:48
Tengu	I can propose something	12:48
Tengu	for instance, within tripleo, we have a tripleo-ci repository, and that one is calling the configure-swap role. We could pass some params at this point, allowing to just empty /opt/git instead of moving everything?	12:49
fungi	obviously we'll want input from some other folks as well. i won't pretend to have the best grasp of what might break (which may not become apparent within the job if it's side effects like overloading our git servers at peak activity), and also i just woke up so still a little groggy	12:50
Tengu	:))	12:50
* Tengu pours some fresh coffee		12:50
fungi	i have a bad habit of rolling over in bed and instantly opening my computer	12:51
Tengu	I have a dog in order to avoid that: first take the dog out, then coffee, and then only laptop :)	12:51
fungi	the cats got fed right away, since i didn't want them realizing i'm also made of meat	12:54
Tengu	:D	12:54
Tengu	though they don't have any thumb - meaning they may get some issues if the Might Thumbs Owner isn't there anymore	12:55
Tengu	*Mighty	12:55
Tengu	(yeah - also have a cat, I know how it is with them ;))	12:55
Tengu	anyway - have to jump on a meeting, but that's an interesting discussion/topic: optimizing things	12:56
Tengu	fungi: and, once you get some coffee and all, I'll also have a question about unbound, networkmanager and, well, dnsmasq :).	12:56
fungi	ask your dns/network question at your convenience	13:00
Tengu	fungi: (still in a call, but...) so, I see unbound gets configured, but actually not really used - that is, NetworkManager doesn't know about it, and usually this means we'll get the network nameservers (provided by dhclient) instead of the 127.0.0.1 in the /etc/resolv.conf. that's a first thing.	13:14
Tengu	now, after some searches, it seems NetworkManager support for unbound is ending. Would it make sense to install dnsmasq instead, and properly configure it so that NetworkManager actually uses it for dns caching?	13:14
Tengu	and, if not, why isn't NetworkManager properly configured to not override the /etc/resolv.conf?	13:15
fungi	Tengu: i'm not sure what direct support nm needs for unbound, it just needs to know to query 127.0.0.1 as its dns server. i guess the question is how do we correctly configure nm to not overwrite the resolver hint when you're restarting it. ianw did a lot of work on nm integration/configuration for recent fedora and centos so probably has a better handle on what's going on there	13:16
Tengu	well, I actually knows what to do in nm in order to make it use dnsmasq/unbound.	13:17
fungi	unbound is really used at the beginning of your jobs, but at some point over the course of the tripleo jobs something triggers nm to overwrite the configuration and after that point you're no longer resolving through unbound's cache	13:17
Tengu	basically, you can configure main.dns value in its config file and it will do the magic - for instance, if we set the value to "dnsmasq", it will start dnsmasq with the nameservers provided by the DNS, plus additional config we may push in /etc/NetworkManager/dnsmasq.d/	13:18
Tengu	unbound also used to have the same support, but it was removed lately.	13:18
Tengu	(sorry, has to focus on the call - back in a few)	13:18
fungi	and yeah, we really don't want the nameservers provided by dhcp to ever be used from test nodes, even as forwarded resolvers from the local caching resolver	13:19
Tengu	even as forwader? why so?	13:19
fungi	so nm "integration" to provide that would be a problem	13:19
fungi	it's a long story, but basically the resolvers our cloud donors provide are often broken	13:19
Tengu	erf...	13:20
Tengu	that's why it's using google/cloudflare by default then. OK.	13:20
Tengu	rlandy\|rover: -^^ guess we'll just go with Sandeep work on ensuring it's using unbound as resolver, without the networkmanager integration, and using the public forwarders we currently have.	13:20
Tengu	fungi: may I then propose a patch against the configure-unbound in order to configure NetworkManager to NOT override the /etc/resolve.conf?	13:21
Tengu	so that unbound will be used all the way..	13:22
fungi	a good example is rackspace. they've had problems with abusers/compromised machines flooding their resolvers in attempts to use them in ddos attacks, so they have some sort of security device which adds rules blocking client ip addresses which appear to be a problem. but that doesn't take into account that the blocked addresses might get recycled to another tenant's vm and so we	13:22
fungi	constantly ended up with test nodes which couldn't resolve anything because their addresses had been previously blocked from reaching the resolvers	13:22
Tengu	ah, yeah, security at its finest :)	13:22
rlandy\|rover	Tengu: ack- that is the current plan	13:23
fungi	anyway, yes a patch to configure unbound not to accidentally undo our configuration would be useful	13:23
Tengu	fungi: on it - I'll push it today	13:23
Tengu	I can make it optional.	13:23
Tengu	(default true, but if we don't want to touch networkmanager config, we can set it to false)	13:24
fungi	my best guess is that https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements/nodepool-base/finalise.d/89-boot-settings is where we'd want to do it	13:26
Tengu	fungi: not here? https://opendev.org/opendev/base-jobs/src/branch/master/roles/configure-unbound/tasks	13:28
Tengu	(using ansible would be easier, since it's an ini file, and there's a module for that)	13:30
fungi	i guess there's the question of how early would be best to apply the additional configuration	13:31
Tengu	apparently the configure-unbound is called anyway during the job	13:32
Tengu	but yeah, seeing the /etc/resolv.conf is overridden at boot, it's probably better to edit that 89-boot-settings.	13:33
fungi	diskimage-builder sets up some of this when creating the node images, then glean does boot-time configuration based on the detected network info from configdrive/dhcp, then ansible configures things at job runtime	13:33
Tengu	yeah, DIB modification is probably safer.	13:33
Tengu	fine. I'll propose a patch shortly.	13:33
Tengu	(after my current call)	13:34
fungi	there's no rush, things are slow this week and a lot of us aren't around the computer much (and the rest of us are juggling a lot of stuff as always)	13:34
Tengu	:)	13:35
Tengu	fungi: I'll also propose something against the ansible role in order to be consistent. And this would allow tripleo to depends-on the change for some more testing.	13:36
fungi	Tengu: depends-on to opendev/base-jobs won't work since it's a trusted repo where speculative execution isn't performed	13:40
fungi	to properly exercise that we merge changes to a copy of the role and modify the base-test job to use it instead of the real one, then you can have a proposed change in an untrusted repo which is parented to base-test explicitly (instead of implicitly to base)	13:41
Tengu	ah, right, we could test with another repo though. There are so many of them involved :/	13:42
Tengu	openstack-zuul-jobs isn't trusted - I'm mixing things.	13:42
Tengu	(that was for the /opt thingy)	13:42
fungi	correct, ozj is an untrusted repo, so you can depends-on to changes in review for it no problem	13:43
Tengu	yeah - I wanted to get some insight https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/865383	13:43
Tengu	maybe it's still worth to get in - up to you folks.	13:43
fungi	zuul's security model prohibits speculatively executing proposed changes for config repos though, because that could be leveraged to expose or exfiltrate secrets and perform other sensitive actions	13:44
Tengu	of course	13:44
Tengu	fungi: do we have crudini in DIB env?	13:50
Tengu	humpf. apparently not.	13:52
fungi	crudini the magnificent, prestidigitation extraordinaire	13:54
Tengu	it's a good tool when it comes to ini_file things :)	13:54
Tengu	would it be OK to get it in the nodepool-base?	13:54
Tengu	dependecy-wise, it's apparently relying on python3-iniparse	13:54
Tengu	which isn't heavy at this point..	13:55
fungi	we could install it outside the chroot. the problem with installing things inside the image which have python library dependencies is that they're very hard to uninstall when they may end up conflicting with libs from pip later. putting crudini into a venv might be another solution since that would isolate it from the system python library path and avoid causing problems for jobs	13:56
Tengu	hmmmm so here I don't know how to do either of those two better solutions :/	13:57
fungi	it's one of the reasons we use glean instead of cloud-init, because the latter drags in python modules	13:57
Tengu	I could of course override the wole file - but while I'm sure nothing will be lost on a standard fc-35 and cs-9, I don't know about other distros such as opensuse	13:58
Tengu	and doing multi-line regexp is a bad idea imho.	13:59
fungi	basically we just want to avoid polluting the resulting server image with unnecessary libs that end up conflicting with what jobs may want to install on those nodes later	13:59
*** dasm\|off is now known as dasm		13:59
Tengu	I understand this - and am all for "keep it clean".	13:59
fungi	take a look at nodepool/elements/infra-package-needs/install.d/40-install-bindep for an example of the venv approach	13:59
Tengu	ah	14:00
fungi	then anything needing to run bindep can just call /usr/bindep-env/bin/bindep directly or symlink to it from somewhere in the path like /usr/local/bin	14:01
Tengu	sooooo.... am I to add such a file in the nodepool-base/install.d so that I can install crudini, and then call crudini from wihtin the venv in the 89-boot-settings ?	14:01
fungi	that would be one option, yes. the other option would be to install crudini into the nodepool-builder container image and then use whatever the dib mechanism is to run that from outside the chroot instead of from inside	14:02
Tengu	ah, there's already a 91-venv-os-testr in nodepool-base	14:02
Tengu	I'm not fluent enough with dib for that. Since there's already a venv available in the nodepool-base, I can "just" install crudini in it..	14:03
fungi	but again, all this is on the very edge of my experience so when others are around they may have superior suggestions	14:03
Tengu	sure	14:03
Tengu	I'll propose something, so that it's not "just" theory	14:04
Tengu	guess we'll get more ppl next week, after thanksgiving.	14:04
fungi	likely, or perhaps later today when other parts of the world wake up	14:04
fungi	ianw is probably our foremost expert on this particular topic, but he's on apac time	14:05
Tengu	ah, so too late	14:05
fungi	or too early, depending on how you look at a globe	14:06
Tengu	:) I'll try to catch him tomorrow during my morning (EMEA)	14:08
Tengu	should be fine.	14:08
opendevreview	Cedric Jeanneret proposed openstack/project-config master: Ensure NetworkManager doesn't override /etc/resolv.conf https://review.opendev.org/c/openstack/project-config/+/865433	14:12
vishalmanchanda	clarkb: fungi : hello, a query regarding bug https://bugs.launchpad.net/horizon/+bug/1996638	14:15
vishalmanchanda	clarkb: fungi : here is link of logs what we discussed in the past https://etherpad.opendev.org/p/migrate-to-jammy#L33	14:15
vishalmanchanda	right now after migration to ubuntu-jammy, more horizon jobs like horizon-selenium-headless, horizon-integration-tests start failing.	14:17
fungi	vishalmanchanda: this is the issue with using snap-installed browsers in a headless xvfb right?	14:18
vishalmanchanda	In the latest P.S. you can find error logs for both of these job https://review.opendev.org/c/openstack/horizon/+/861140	14:18
vishalmanchanda	fungi: yes.	14:18
vishalmanchanda	I am getting the same error in my local env. deployed on ubuntu-focal.	14:18
fungi	were you able to see if coreycb or tinwood had ideas? this may get deep into ubuntu dbus/login design choices	14:19
vishalmanchanda	fungi: not yet.	14:19
vishalmanchanda	fungi: but I found one fix which works in my local env.	14:19
vishalmanchanda	fungi: now wants to know how can we apply the same in our openstack CI job.	14:20
fungi	an alternative might be to run the jobs on debian, where chromium and firefox are available as normal deb packages	14:20
fungi	what's the workaround?	14:20
vishalmanchanda	fungi: For e.g. If install firefox following all these steps mentioned here https://www.omgubuntu.co.uk/2022/04/how-to-install-firefox-deb-apt-ubuntu-22-04#:~:text=Installing%20Firefox%20via%20Apt%20(Not%20Snap)&text=You%20fist%20add%20the%20Mozilla,reinstalled%20at%20a%20later%20date. , then 2 jobs works fine	14:21
fungi	ahh, okay, so similar to running the test on debian instead of ubuntu	14:24
vishalmanchanda	Install Firefox as a .Deb on Ubuntu 22.04 (Not a Snap)	14:25
fungi	but in this case getting a deb of the browser from outside ubuntu's archive rather than switching to a distro which provides something along those lines	14:25
fungi	yeah, that's fairly straightforward. i should be able to find you an example, i think your jobs already do something similar to install newer nodejs packages	14:25
vishalmanchanda	ok	14:27
fungi	vishalmanchanda: i think this is the basic template for it... https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-nodejs/tasks/main.yaml	14:28
fungi	oh, though maybe easier since this is in a ppa on lp and ubuntu already provides some slightly nicer integration for those	14:30
vishalmanchanda	fungi: ok, thanks for the reference, I will try to do the same thing to install Firefox as a .deb package	14:30
fungi	vishalmanchanda: here's an example of consuming packages from a ppa in a role: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-skopeo/tasks/Ubuntu.yaml	14:31
vishalmanchanda	fungi: So I should update this task https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/nodejs-test-dependencies/tasks/main.yaml#L6 ?	14:33
vishalmanchanda	because this task is responsible for installing firefox and use snap by default to install firefox	14:34
fungi	vishalmanchanda: changing it in the zuul-jobs repo is probably a bigger question, since you're wanting to apply an ubuntu-specific workaround and that role is generic for a variety of different distros, so we'd need an ubuntu-specific variant of it similar to how the ensure-skopeo role does it (you'll see there's an Ubuntu-22.04.yaml in it which does just a normal package install of	14:39
fungi	skopeo vs the Ubuntu.yaml which installs skopeo from a ppa)	14:39
fungi	you'll see https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-skopeo/tasks/main.yaml does include_tasks: "{{ zj_distro_os }}" in order to allow us to switch behavior based on the platform where it's run	14:40
vishalmanchanda	fungi: Could you help me in writing tasks and roles to fix this issue if you have time, because I have very basic knowledge on anisble side.	14:49
fungi	vishalmanchanda: this is a very bad week for me to commit to more work, i'm already drowning and at the moment i seem to be the only one around to answer questions	14:50
vishalmanchanda	fungi: ok np.	14:50
vishalmanchanda	fungi: let me just hit and try then.	14:50
vishalmanchanda	fungi: anyway thanks for providing all these references, these are very helpful:)	14:51
fungi	vishalmanchanda: happy to help, and i might be able to help more i just don't want to commit to anything at least until i get caught up on the pile of pending things in my inbox	14:56
*** dviroel is now known as dviroel\|lunch		15:07
*** soniya29\|afk is now known as soniya29		15:48
*** akekane is now known as abhishekk		16:09
*** dviroel_ is now known as dviroel		16:15
*** yadnesh is now known as yadnesh\|away		16:41
clarkb	Tengu: fungi: while the /opt setup in that cloud does take some time I really feel like optimizing that is poor application of optimization. We've got jobs that spend more than half an hour curating logs for example. Some of that is ansible per task slowness, some of it is poor log layout for swift object storage, and some of it is just slow runtime. Tempest for example is	16:59
clarkb	installed three times in tempest jobs one of which deletes a previous install and reinstalls. Basically all that to say we have a lot of low hanging performance fruit and almost all of it is safer than changing base node expectations.	16:59
Tengu	clarkb: I'd be happy to help chasing those fruits and making some nice salad with them :)	17:00
clarkb	Tengu: fungi: and yes to be extra extra clear unbound is used by our nodes when we hand them over to you. removing unbound would breakt hings	17:00
fungi	however, configuring nm to know not to blow away the resolver settings we've supplied would probably be good	17:01
clarkb	fungi: yup Iagree however I'm confused as to why this is an issue if things boot just fine	17:02
clarkb	Tengu: fungi: ^ I looked into this the other day quite a bit and everything I could see is that NM and unbound were working as expected and the tripleo jobs do something to break it. Can we just stop doing the something instead?	17:02
clarkb	or maybe the something should be smarter about what it does. But as far as I can tell there wasn't anything wrong with NM and unbound as configured at boot	17:03
Tengu	clarkb: I'd rather be smart with NetworkManager and make it understand "don't touch that file"	17:03
Tengu	since it's something we can do....	17:03
clarkb	Tengu: right but why is it touching the file. I think that is my confusion	17:03
fungi	there may be a startup race, since we're setting the resolver in rc.local which probably runs after nm has settled	17:03
Tengu	basically, nm will touch it whenever there's a reload (nmcli reload, systemctl restart networkmanager), lease refresh and the like	17:04
clarkb	huh my local NM doesn't do that (in fact its a problem on my laptop beacuse I can't get it to reset thingsafter tethering with my phone)	17:04
fungi	it might also be specific to centos stream 9	17:05
clarkb	specifically if I usb tether and then stop NM refuses to update networking settings when switching back to wifi or similar. Its really annoying	17:05
Tengu	clarkb: check the NetworkManager config to see what's going on in there. Also, the default behavior of that service changes depending on the release, availability of some tools and so on	17:05
Tengu	in any cases - being specific and precise in the config may (and will) avoid headaches.	17:06
clarkb	ack so this may be specific to the installation and config. In that case having NM ignore /etc/resolv.conf does make sense. Can we be sure to document why we're making that change pretty clearly in the commit/comments	17:06
clarkb	I'm mostly just surprised because NM refuses to act that way on my laptop where it would actually be helpful :)	17:06
clarkb	Tengu: re crudini you don't need it in the resulting image so I'd install it and then remove it	17:07
clarkb	but another option (and maybe preferable option) is to just use a small python script	17:07
clarkb	use config parser, set the flag, and done	17:07
Tengu	clarkb: I can check for that tomorrow (getting late EMEA here). Care to comment the thing about crudini? Maybe installing it on the builder itself and calling it from outside the chroot is a thing? not really sure what's best...	17:09
Tengu	or I can, indeed, drop the venv.	17:09
clarkb	Tengu: for low hanging performance fruit in tripleo specifically I would look at optimizing the log uploads. Tripleo log upalods are extremely slow and I think a coupel of things can be done to improve them. The first is to ensure that we're avoiding ansible task loops with large numbers of inputs (each ansible task can take a second or more to bootstrap, process 600 log	17:09
clarkb	files and that is 10 minutes). Second is to remove files that never change (lots of /etc type content etc) and flatten dir structures as each dir entry is a swift object that neesd to be created with an index file	17:09
Tengu	(also, sorry, I'm working on another issue in //...)	17:09
clarkb	Tengu: I would not install it in the builder image. Personally I would just make a tiny python script that uses config parser to set whatever ini values you need.	17:09
Tengu	clarkb: like, calling "python 'import config_parser ....'" ?	17:10
clarkb	but to be clear deleting things from /opt is the very last optimization I would look at once everything else that is easy has been done	17:10
Tengu	or a plain file	17:10
clarkb	Tengu: either way. YOu can have a script file or inline it. Probably depends on how complicated and large the python needs to be	17:11
Tengu	clarkb: "re: flatten dir structure" maybe replacing / by _ ? but that would make the whole thing terrible to browse.	17:11
clarkb	Tengu: in tripleos case the logs often have like 10 levels or more with no file contents then only the leaf has any contents	17:11
Tengu	clarkb: basically, it's only one option to add in the [main] section.	17:11
clarkb	In cases where you are deeply nested like that it seems rare that the full path provides any value	17:12
* Tengu takes note for the log thing		17:12
clarkb	just store the log file somewhere without the deep nesting	17:12
Tengu	logs are collected from within a dedicated ansible project, I can have a look at it.	17:12
Tengu	indeed, if we can make the log collection faster, that may already be a thing	17:13
clarkb	But also (imo) tripleo collects a lot of useless log files	17:13
clarkb	files that never ever change and can be collected only when specifically needed	17:13
Tengu	I can more than probably make a review with the CI folks (ping rlandy\|rover )	17:14
clarkb	https://zuul.opendev.org/t/openstack/build/91f108985637462c9ea5d868c77f9378/log/logs/undercloud/etc/rsyncd.conf is a good example	17:14
clarkb	but there are many like that	17:14
Tengu	thing is, at some point, that one was actually used :).	17:15
Tengu	but yeah, that's for an old, deprecated/dead release iirc.	17:15
clarkb	sure, but it isn't useful today and log upload for tripleo jobs is slow	17:15
Tengu	sooo. yeah. cleaning might help as well.	17:15
clarkb	I'm just saying deleting useless stuff first is what I would do before touching /opt which many jobs rely on and itself is an optmiation tool	17:15
clarkb	you can very easily make jobs run longer or start failing by editing /opt incorrectly	17:15
clarkb	not collecting a log file that can be retrieved from the package associated with its installation hurts nothing	17:16
Tengu	sure	17:16
rlandy\|rover	Tengu: pls add whatever needs to review to our review list and we'll take care of it	17:17
Tengu	rlandy\|rover: I'll create a jira card for that and come back with some more meat.	17:17
clarkb	on the ansible task cost side of things it would be great if we could convince ansible that per task costs are actually important, but unfortunately we've not been able to make headway there	17:17
rlandy\|rover	sure	17:17
Tengu	clarkb: same here... and it will get even worts with the AEE...	17:18
clarkb	anytime you have a loop with more than a handful of items you now need to consider writing a python module...	17:18
Tengu	a former colleague counted over 10s to just bootstrap the env, before anything was actually started with the playbook and all.	17:18
Tengu	and with tripleo being composed of many playbook, you can see how bad it can be.	17:19
clarkb	I've written at least one module to address some problematic loops in base jobs. There are more all over our jobs though because as someone writing a job it isn't apparent that a loop is dangerous	17:19
Tengu	yeah, we also did some of that work here. Maybe the log collection may benefit of some python love as well.	17:20
clarkb	(also I think the task bootstrap time has steadily gotten worse over time so when some of these were written it was fine but now under ansible 5/6 its a huge problem)	17:20
clarkb	also ansible 5 broke pipelining, but I haven't seen ansible 6 be particularly quick with pipelining fixed	17:21
fungi	also ansible-lint contributes to this problem by insisting that lots of things which could be done efficiently by shelling out to an external command should instead use existing ansible modules which may need to be executed in loops where the shell commands would have operated efficiently on multiple files in one execution	17:22
clarkb	fungi: exactly. Its a core part of the ansible language and one that is strongly encouraged by the ecosystem. There is no warning that says "you have more than 5 elements here consider a different tool"	17:23
clarkb	vishalmanchanda: fungi: I wonder if the easiest thing is to switch the job definition to debian and see if it just works?	17:24
clarkb	vishalmanchanda: fungi: considering the jobs install firefox and chromium from pacakges and then run tox I half expect that to just work	17:25
clarkb	but then youdon't need to sort out PPAs and special packages.	17:25
vishalmanchanda	clarkb: yeah we can try that, how can I do that?	17:26
clarkb	vishalmanchanda: in the jobs you want to run with selenium change nodeset: ubuntu-jammy to nodeset: debian-bullseye	17:27
vishalmanchanda	clarkb: ok let me quickly try that.	17:28
*** jpena is now known as jpena\|off		17:30
clarkb	Tengu: just talking out loud here for other optimization ideas: One along the lines of /opt optimizing is looking for redundant actions (like installing tempest multiple times or editing /etc/ssh/known_hosts and ~zuul/.ssh/known_hosts with the same content). I know a number of projects continue to install python2 related stuff even though they no longer support python2	17:34
clarkb	anymore (this happens via bindep entries but could happen in other ways). In Zuul's testsuite the performance difference between python 3.8 and 3.10 was dramatic, using newer python interpreters can have a pretty big impact. In the past its been common to trace job slowness and timeouts back to heavy use of swap. Double checking jobs are avoiding swap might be helpful. As	17:35
clarkb	would improving known memory hogs (cinder-backup, privsep, etc). Apparently running python in containers can slow things down due to seccomp :( it may not be appropriate to avoid that though. There are so many things that we could do better if we had a concerted effort around performance, but it seems the desire just often isn't there and I end up poking at things that are	17:35
clarkb	non controversial and relatively easy to try and help.	17:35
clarkb	oh! I can't forget replacing osc with dedicated scripts that can reuse tokens (though apparently we can configure osc to do this too anyone know if progress has been made on that?) and avoid python startup time finding entrypoints (which is costly)	17:41
Tengu	clarkb: I'll do some listing of the potential things we can do within tripleo. but first, I want that unbound vs networkmanager situation solved for good	17:46
clarkb	++ I don't think any of this is urgent, but it is something that I periodically look into and so have a bunch of random ideas for things that can help.	17:48
Tengu	:=	17:48
Tengu	:)	17:48
Tengu	for now - EOD is hitting hard, wide will be angry if I extend more..	17:49
Tengu	clarkb, fungi thanks for the pointers and help!	17:49
vishalmanchanda	clarkb: here is patch https://review.opendev.org/c/openstack/horizon/+/865453/2 chnages nodeset from ubuntu-focal->debian-bullseye	17:53
clarkb	vishalmanchanda: cool lets see if that is any happier. Its possible there are debian vs ubuntu differences we need to handle, but in this case because the testing should be very self contained I expect it may just work	17:54
vishalmanchanda	clarkb: it's failing:(	17:54
vishalmanchanda	https://zuul.openstack.org/status#865453	17:55
clarkb	looks like the debian package for chromium is just chromium and not chromium-browser	17:56
vishalmanchanda	clarkb: hmm ok some are not available...	17:56
clarkb	and firefox is firefox-esr? I didn't expect those differences. But I guess if they have diverged on how to pacakge them different names isn't surprising	17:57
clarkb	vishalmanchanda: I've pushed https://review.opendev.org/c/zuul/zuul-jobs/+/865459 and once testing for that change looks good we can update your change to depends on it	18:15
vishalmanchanda	clarkb: ok thanks	18:15
fungi	clarkb: there is a "firefox" package in debian too but not in stable because... stability	18:30
fungi	so firefox-esr tracks the firefox extended stable release versions instead of latest	18:30
fungi	and i guess the maintainers thought it best to provide those as different package names	18:31
fungi	(on unstable you can choose between esr or latest that way)	18:31
clarkb	gotcha	18:32
fungi	might have made more sense to make "firefox" be a virtual package which depends: firefox-latest\|firefox-esr and then on stable you'd get whichever was available, but yeah	18:32
clarkb	++	18:32
clarkb	it looks like my zuul-jobs edit and vishalmanchanda's change to switch the jobs to bullseye is working for at least some of the jobs	18:32
clarkb	the selenium-headless job is rerunning for some reason but nodejs16-run-test ran	18:33
clarkb	its a bindep issue in horizon, one sec	18:33
fungi	neat	18:34
clarkb	interestingly I thought bindep ran in the job that succeeded so not sure why it didn't break too.	18:34
fungi	different profile maybe?	18:39
clarkb	ya there is a selenium profile in use	18:40
clarkb	change updated to distinguish between the two firefox packages across ubuntu and debian	18:40
clarkb	I think there may be another issue because horizon-integration-tests is restarting. But the good news is the nodejs16-run-test job seems to show it talking to firefox successfully	18:52
clarkb	looks like the integration-tests job runs devstack? Which implies that devstack isn't working on bullseye? I thought I fixed that recently..	18:56
clarkb	oh I see its a nodeset issue	18:57
clarkb	vishalmanchanda: it looks like there are errors finding the firefox binary now. hopefully those are addressabel though. It does look like the nodejs suite has figured it out at least but not python	19:15
clarkb	vishalmanchanda: I've just pushed a new update that bumps the geckodriver version. One thing I notice is they have done work to improve snap support with newer geckodrivers. It is possible this may be part of the puzzle for ubuntu too?	19:33
clarkb	vishalmanchanda: see https://github.com/mozilla/geckodriver/releases/	19:33
vishalmanchanda	clarkb: yeah I done that too in my local env. to fix selenium-headless job	19:36
vishalmanchanda	clarkb: but I just tried with older version v0.27.0 and it also works if install firefox as deb. package on ubuntu-jammy.	19:43
vishalmanchanda	clarkb: I also updated geckodriver version here https://review.opendev.org/c/openstack/horizon/+/861140 but selenium-headless job still fails.	19:44
fungi	well, yeah because that entirely sidesteps the snap situation	19:44
fungi	i mean, the reason it works when installing an actual deb of the browser on ubuntu	19:45
vishalmanchanda	one question, I have in my mind.	19:47
vishalmanchanda	Is there any cons of running these jobs on debian-bullseye instead of ubuntu-jammy?	19:48
fungi	vishalmanchanda: other than potential inconsistency with other jobs, not really. debian is one of the target distros for this cycle: https://governance.openstack.org/tc/reference/runtimes/2023.1.html	19:49
vishalmanchanda	fungi: ok	19:50
clarkb	right I strongly suspect that part of the solution here is a newer geckodriver. That may not be sufficient though	20:01
clarkb	vishalmanchanda: the selenium headless job passes now after updating geckodriver	20:02
clarkb	so ya that was a piece of the puzzle	20:02
vishalmanchanda	clarkb: cool, thanks for the help.	20:03
*** dviroel is now known as dviroel\|afk		21:25
*** rlandy\|rover is now known as rlandy\|out		22:25
*** dasm is now known as dasm\|off		23:49

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!