Tuesday, 2022-11-29

*** rlandy is now known as rlandy\|out		00:29
*** clarkb is now known as Guest298		01:19
*** Guest298 is now known as clarkb		01:20
*** atmark is now known as Guest305		02:10
*** yadnesh\|away is now known as yadnesh		04:14
Tengu	clarkb: need to read some doc about what's done for pypi in the proxy thing, but I think I get it, more or less. basically I'll have to get the S3 URI, and call the "substitute" in order to rewrite it to some "ansible-galaxy-files" location, to match a new "endpoint" in the proxy config.	07:57
Tengu	I'll work on that.	07:57
Tengu	hey. wait.	07:59
Tengu	actually.... there's ALREADY an endpoint!	07:59
* Tengu dumb for not checking beforehand		07:59
Tengu	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror/templates/mirror.vhost.j2#L127-L133	08:00
Tengu	fungi: :) you actually created -^ via change-id Ib5664e5588f7237a19a2cdb6eec3109452e8a107	08:01
*** yadnesh is now known as yadnesh\|afk		08:11
*** jpena\|off is now known as jpena		08:23
*** yadnesh\|afk is now known as yadnesh		08:27
*** rlandy\|out is now known as rlandy		11:05
*** dviroel\|afk is now known as dviroel		11:12
*** frenzy_friday is now known as frenzy_friday\|rover		12:15
fungi	Tengu: somehow this doesn't surprise me	12:42
Tengu	fungi: same :)	12:43
Tengu	fungi: go get your coffee first :]	12:43
fungi	aha, yes i guess the tripleo team asked to have it added roughly a year ago	12:43
Tengu	sounds like something matching the votes :)	12:43
Tengu	and they never put it to use.	12:43
fungi	so this means they didn't end up using it? i wonder why	12:43
opendevreview	Merged openstack/project-config master: Use kolla.config for kolla-ansible in gerrit https://review.opendev.org/c/openstack/project-config/+/865686	12:50
fungi	Tengu: i guess test it and make sure it's working, so we can adjust it	12:51
Tengu	fungi: yeah, I'll talk with them today during the community call :)	12:51
Tengu	fungi: I've pushed this https://review.opendev.org/c/opendev/base-jobs/+/865970 to make the ansible proxy more "visible"	12:52
fungi	Tengu: would https work better? i have no idea if the ansible-galaxy tool cares either way	12:57
Tengu	fungi: I didn't hit such issue over the testing, but maybe switching to tls would be better.	12:57
Tengu	especially since the certificate is valid	12:58
fungi	we added let's encrypt to our mirrors more recently than we set those existing envvars in the base job, but if the tool is happy either way it probably doesn't matter	12:58
Tengu	bah, let's switch to TLS	12:58
Tengu	it's always better imho.	12:58
Tengu	and future-proof	12:59
Tengu	TLS is in the 4443, isn't it?	12:59
fungi	no, just the regular 443	12:59
Tengu	really?	12:59
Tengu	fungi: the comment in the mirror config seems to state otherwise... ?	13:01
Tengu	# Dedicated port for proxy caching, as not to affect afs mirrors. and 8080, 4443	13:01
Tengu	(among things)	13:01
Tengu	fun...	13:02
Tengu	oh. ok. /galaxy/ is defined in the BaseMirror	13:02
fungi	the test_galaxy_mirror test added in the change you referred to just connects to "https://%s/galaxy/" % addr where addr is just a raw ip address	13:02
Tengu	yep	13:02
Tengu	I wanted to double-check with the apache config itself.	13:03
Tengu	now I get it: BaseMirror macro defines the galaxy, and is called for 80 and 443	13:03
fungi	the reason we use that BaseMirror macro is so that we can serve the same things through http on 80 and https on 443 without duplicating the configuration	13:03
Tengu	same goes for the ProxyMirror macro, but for other ports.	13:03
Tengu	it's a nice feature from httpd	13:04
fungi	all the higher numbered ports are for "special" things which can't have subpaths relative to the root path	13:04
fungi	we try not to add those when we can help it	13:04
fungi	but some tools are a bit braindead in their assumptions	13:04
Tengu	heh - no wonder.	13:05
Tengu	I updated my patch to reference https:// and removing the port.	13:05
Tengu	good catch anyway, because the :8080 would fail anyway.	13:05
*** dasm\|off is now known as dasm		13:05
fungi	ahh, yeah, i didn't even spot the :8080!	13:06
* fungi takes another gulp of coffee		13:06
Tengu	;)	13:08
Tengu	and I guess we can merge my NetworkManager thingy? 3x +2 is good	13:08
Tengu	ah, thanks fungi :). Also thanks for the "-print" vote.	13:18
Tengu	I forgot about that one actually :]	13:18
fungi	yeah, i still think the df that one adds won't tell you much new since we already log a df (and df -i) at the start of every job	13:19
Tengu	I can remove it	13:19
opendevreview	Merged openstack/project-config master: Ensure NetworkManager doesn't override /etc/resolv.conf https://review.opendev.org/c/openstack/project-config/+/865433	13:19
fungi	running a df after the mv might give you more insight	13:19
Tengu	let's do that!	13:20
fungi	since then you can compare against the one from job start	13:20
Tengu	lemme correct/amend.	13:20
opendevreview	Cedric Jeanneret proposed openstack/openstack-zuul-jobs master: Add some output to the `find' command https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/865383	13:22
Tengu	better.	13:22
Tengu	fungi: also updated the commit message to mention the zuul-info	13:22
*** frenzy_friday\|rover is now known as frenzy_friday\|rover\|food		13:43
Tengu	fungi: what's the ETA to get the first nodepool images built with the NetworkManager config running in the CI?	13:45
fungi	Tengu: images are rebuilt ~daily, and you can see the list of built images at http://nl01.opendev.org/dib-image-list while the list of uploaded images in each provider is http://nl01.opendev.org/image-list	13:49
Tengu	ah, cool! thanks	13:49
fungi	Tengu: if you want to see the build logs for a particular image, identify the builder it was built on from the dib-image-list and then go to it in a browser, like https://nb01.opendev.org/	13:50
Tengu	wow. that's neat!	13:51
fungi	i think the zuul info we log from each build may also embed image ids for the nodes, looking now...	13:51
Tengu	I think I've seen it in the zuul-info/	13:51
Tengu	fungi: the "age" is Day:Hours:Minutes:Seconds I guess?	13:52
Tengu	yep, looks like so	13:52
fungi	correct	13:53
Tengu	seems there are some stalled in "deleting" state :/	13:53
fungi	and no, i can't seem to find the image id in the logged zuul-info, but if i'm not overlooking it then maybe that's something worth adding	13:54
fungi	Tengu: a fun fact about image deletion. if you use boot from volume for a server instance, you can't delete the image while the server is still running. if a node is held in such a provider, or stuck deleting, then the image it was booted from can't be deleted	13:55
fungi	we go through and try to clean them up manually from time to time	13:55
Tengu	fungi: erf..	13:55
Tengu	fungi: so we have the "image-hostname" alongside dib-builddate: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_be0/863872/14/check/tripleo-ci-centos-9-standalone/be07f2c/zuul-info/zuul-info.primary.txt	13:55
Tengu	that's the colses I seem to be able to find.	13:56
Tengu	*closest	13:56
fungi	right, the dib-builddate could be used to get us close enough to identifying the image used	13:57
fungi	though actually logging the image id would be even better	13:57
Tengu	i.e. generate the image-id before the actual build, inject it, and use that id while uploading?	13:57
fungi	more likely plumb it back through the node request to the zuul scheduler and add it to the inventory	13:59
Tengu	'k. well - I don't know how things are piped in there ;)	14:09
*** dviroel is now known as dviroel\|lunch		16:11
*** frenzy_friday\|rover\|food is now known as frenzy_friday\|rover		16:21
clarkb	vishalmanchanda: ok updated zuul-jobs patch pushed. We can recheck your change once that comes back green	16:29
vishalmanchanda	clarkb: sure, thanks.	16:29
Tengu	clarkb: heya! just saw your comment about the env var for ansible-galaxy proxy - there are some ansible variables already available somewhere?	16:40
clarkb	Tengu: not for galaxy as far as I know. But other things like distro packages mirrors and pypi mirror and so on have roles that configure them	16:41
clarkb	Tengu: there is the base mirror fqdn and the nthe roles tack on the service specific bits and configure them	16:41
clarkb	let me find an example of that	16:42
Tengu	hmm. care to show me? if it's just a matter of adding a role somewhere, and call it, I'd be more than happy	16:42
Tengu	note that tripleo is also using RDO, so maybe that's why our jobs are relying on that "old" file exposing env vars?	16:42
clarkb	https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/defaults/main.yaml#L2-L3	16:43
Tengu	oh, and then it's used in https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/tasks/mirror.yaml	16:44
Tengu	oook.	16:44
Tengu	and, provided configure-mirror role is called from within the job, we'll get the proper config directly.. ?	16:44
clarkb	for the things that role configures	16:45
clarkb	I don't think galaxy should be configured by that role	16:45
Tengu	i.e. I can ini_file /etc/ansible/ansible.cfg, and add the galaxy.server key and be off with that?	16:45
clarkb	but I wanted to show you an example how you can use the base mirror fqdn to construct a mirror location in an ansible role	16:45
Tengu	'k	16:45
clarkb	(reall I wish pypi wasn't configured by that role and it only did distro mirrors, but that is a historical artifact that is difficult to change now)	16:46
Tengu	zuul_site_mirror_fqdn is something that exists and is available then?	16:46
clarkb	yes, we set it in opendev. That role is expected to be generic enough to run when it isn't set though hence the omit check	16:46
Tengu	ok. I'll consider it then	16:46
Tengu	just need to make thing that's compatible with RDO infra as well	16:46
clarkb	vishalmanchanda: the zuul-jobs update is green	16:52
vishalmanchanda	clarkb: ack.	16:53
*** dviroel\|lunch is now known as dviroel		17:12
*** jpena is now known as jpena\|off		17:17
clarkb	Tengu: fungi: I've been looking at the /opt move and supposedly rsync might be quicker? That doesn't delete on the source which we also want though	17:55
clarkb	I wonder if the speed ends up equivalent once you add in the delete step after copying	17:55
clarkb	we can test this	17:57
fungi	yeah, that change was more just to get some indication of the current performance before experimenting with alternatives	17:58
clarkb	oh is there an existing change?	17:58
opendevreview	Clark Boylan proposed openstack/openstack-zuul-jobs master: Test /opt move using rsync https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/866054	18:08
clarkb	fungi: Tengu ^ more debugging	18:08
clarkb	is there a change depending on the parent that I can update?	18:09
clarkb	https://review.opendev.org/c/openstack/devstack/+/858996 is a devstack change I already had for similar purposes I've updated it	18:14
frickler	just note that in general performance of our nodes seems to vary by +/- 50%, so comparing performance needs a large sample size	18:22
clarkb	yup	18:24
clarkb	mtreinish had good data on this once upon a time too. And the variance is crazy	18:24
clarkb	even when you only look at nodes in a single provider	18:24
fungi	clarkb: Tengu's change is https://review.opendev.org/865383 Add some output to the `find' command	18:24
frickler	the other question is do we really need to free the space on /? otherwise we could consider moving /opt/git to /srv/git or whatever and just symlink to that?	18:28
clarkb	frickler: jobs hit the 20gb limit on rax all the time	18:29
clarkb	even with clening out the 10gb of /opt	18:29
clarkb	the problem is that /var is used by journald and docker and so on	18:30
clarkb	makes it really easy to fill a few gigabytes on /	18:30
frickler	hmm, from the flavor I see we should have 40G as root disk, where do you see 20G?	18:40
clarkb	hrm I thought it was 20GB maybe that is what we end up with free and not total size	18:41
clarkb	another thing we can/should look at is trimming the contents of /opt	18:41
clarkb	the bulk of the data there is git repos and maybe we've got some git repos we can prune out	18:41
clarkb	also maybe the cirros images and friends can be reduced (they are very small already)	18:41
frickler	on a random node I see 29G of 37G free after the move to /opt has happened. /opt has 13G used. if 16G on / aren't enough (without rming), then IMO jobs need to be fixed	18:47
frickler	or we need to declare rax unusable for that kind of jobs	18:47
clarkb	"/bin/sh: 5: time: not found" fyi	18:48
clarkb	frickler: its 16GB after rming though right?	18:48
opendevreview	Clark Boylan proposed openstack/openstack-zuul-jobs master: Test /opt move using rsync https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/866054	18:49
frickler	no, after rming we have 29G free on /. not sure about the original usage, but it can have been at most the 13G now used on /opt	18:50
clarkb	fwiw the /opt move is limited to openstack jobs. Its not something we do globally in base jobs	18:59
clarkb	I guess with a bit of testing for explosions we might be able to remove it for openstack as well. But the potential blast radius is quite large	19:00
mtreinish	clarkb: I think I have a subunit2sql db archived somewhere if people want hard numbers from like 4-5 yrs ago :)	19:33
mtreinish	looking through old presentations on the topic I had this image in a slide: https://blog.kortar.org/wp-content/uploads/2022/11/runtime_variance.png	19:47
mtreinish	but I don't remember the context of exactly it was graphing (and the details aren't in the slide besides just saying "Runtime variance")	19:47
mtreinish	I assume it's just of a random tempest test across all gate runs based on the y axis	19:48
Tengu	clarkb: ah, i was thinking about rsync as well. though I think find might have been used for potential hidden directories?	20:09
Tengu	we can of course discuss tomorrow if you want, I'm on a private device with no acxess but irc	20:10
Tengu	clarkb: "time" not found?! errrr.. is it embeded in bash? will check that out tomorrow.	20:21
fungi	Tengu: or we're not installing the package needed to make it available	20:22
*** swalladge is now known as Guest401		21:24
*** dasm is now known as dasm\|off		21:39
*** dviroel is now known as dviroel\|out		22:00
clarkb	fwiw I think my test change has failed to land on rax but I need to double check that before rechecking	22:18
clarkb	caught one https://zuul.opendev.org/t/openstack/stream/37dffb86cfad4ae3b3717f86ed294efc?logfile=console.log	22:36
clarkb	its not looking any quicker	22:39
clarkb	(granted sample size of one)	22:39
clarkb	I'm not super surprised by that. The bottleneck is almost certainly disk io	22:39
*** rlandy is now known as rlandy\|out		23:51

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!