Thursday, 2023-02-16

clarkb	The promised launch node is running. Once it is done I'll get changes up to system-config and dns	00:10
clarkb	one thing I didn't thnk to check until just now is quotas.	00:14
clarkb	I thin kwe're ok on that front	00:15
opendevreview	Clark Boylan proposed opendev/system-config master: Add gitea09 to our inventory https://review.opendev.org/c/opendev/system-config/+/874043	00:29
clarkb	infra-root ^ I think I'd like to be around when that lands just to make sure the end results are happy. But also there issomething weird that unattended upgrades is complaining about. Seems a few things might be unexpectedly held back. Can someone who groks apt better than me take a look before we commit to this host?	00:30
clarkb	now to get the DNS change up	00:30
clarkb	curious we don't have AAAA records for the old giteas	00:32
fungi	because the lb is dual-stack, the backends didn't need to be	00:32
clarkb	oh also the dns change needs to land first so that LE can happy problem	00:34
clarkb	heh problem. *properly	00:34
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add gitea09 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/874044	00:36
clarkb	I went ahead and added the sshfp records too since we seem to have not cleaned those up and they can't hurt (they don't help much either though)	00:37
clarkb	oh and the autoamted collection of host keys didn't work. I just hopped on the server and catted the pubkeys	00:38
clarkb	I've crosschecked those values against what ssh-keyscan emits and they all seem to line up so I think I got it correct.	00:39
johnsom	Just an observation, I have seen a few jobs today that took a long time on the "Copy old /opt" step today.	00:58
johnsom	2023-02-15 23:53:07.219972 \| TASK [configure-swap : Copy old /opt]	00:58
johnsom	2023-02-16 00:19:06.969689 \| controller \| ok: Runtime: 0:25:58.751848	00:59
johnsom	This one is currently running: https://review.opendev.org/726334	00:59
johnsom	Here is a different job: https://2319feb14e831f87c231-4854ed941a7816d1225c43ae9b456d0e.ssl.cf1.rackcdn.com/726334/52/gate/designate-bind9-scoped-tokens/913b892/job-output.txt	01:01
johnsom	Jobs that are not timing out seem to take ~11 minutes	01:02
fungi	that's typically an indicator of i/o contention	01:03
fungi	since it's copying tons of small files from one block device to another	01:03
Clark[m]	That also only happens on tax because we mount the ephemeral drive there	01:11
fungi	rax	01:11
fungi	but yeah	01:11
Clark[m]	On other clouds that cost happens when you write zeros for the swap file	01:12
Clark[m]	Ya rax. Auto correct on the phone getting it wrong	01:12
fungi	but yes, comparing it to non-rax nodes is pointless because they just no-op that step	01:12
johnsom	I have seen it on Designate jobs and tempest repo jobs today.	01:13
Clark[m]	I think it's only for devstack based jobs. The unittest jobs don't bother	01:13
Clark[m]	But yes it's normal and expected	01:14
johnsom	Yeah, it's just a bummer it's causing TIMED_OUT jobs	01:14
Clark[m]	Well it's not the only thing causing jobs to be slow :)	01:14
Clark[m]	Devstack could run in like half the time if we didn't use osc for example	01:15
Clark[m]	There are also a number of steps that run large Ansible loops and those are slow	01:15
johnsom	Yeah, these aren't even getting to tempest before they timeout	01:15
Clark[m]	Jobs swap regularly which may create our own noisy neighbor io contention problem. I think dansmith is looking at tuning MySQL back to hopefully free up some memory though. Anyway we suffer from a thousand cuts in the devstack jobs and that is one of them. One that is more difficult to undo due to the need for disk space	01:16
Clark[m]	I suppose someone could sort out running jobs with the ephemeral drive mounted on another location.	01:16
Clark[m]	But that's likely to be more painful than running devstack with virtualenvs	01:17
Clark[m]	Fwiw if I end up with some time I try to pick a long running job and identify ways to speed it up. Haven't had time recently but last year a number of speedups went into writing better ansible. Unfortunately, that sort of Ansible is not the type of Ansible ansible-lint suggests you write so there is slow Ansible all over	01:22
johnsom	Well, if you are serious that there might be something we can do to eliminate that "copy old /opt" step, I would be interested to learn more and maybe poke at it. Even at 11 minutes, it is one of the longer steps I have seen.	01:25
fungi	ansible-lint favors maintainability over performance, this just isn't the sort of stuff it was designed for	01:25
fungi	johnsom: definitely worth brainstorming. basically the git repository cache is baked into /opt in our node images. on rackspace we don't get a large enough rootfs to run devstack and tempest tests but they have a large ephemeral disk, so we format it, copy the git cache into it, and run devstack there	01:27
fungi	i guess an alternative would be to tell devstack to clone repos from another drive, which would make devstack slower but would skip the copy step and maybe avoid copying things devstack doesn't need	01:28
fungi	the challenge is handing off the git cache to devstack efficiently, while providing enough disk space for the tests to run	01:29
johnsom	Hmm, that helps me understand better what is going on.	01:30
fungi	as Clark[m] points out, the up side to rackspace giving us an extra block device is that we can mkswap a partition for swap space, whereas on other providers we have to preallocate a swap file by writing out zeroes	01:31
* johnsom wonders how many jobs need swap space		01:32
fungi	an unfortunate number	01:32
Clark[m]	Most tempest jobs I imagine. Maybe all	01:32
fungi	having a little bit of swap does help provide more main memory for things like filesystem cache, which can improve job efficiency	01:32
fungi	but the 1gb of swap we were recommending for that was quickly upped to something like 8gb because "my jobs need more memory"	01:33
Clark[m]	It's back to 1gb by default iirc because 8gbs of zeros is slow	01:34
fungi	more swap was seen as a way to "fix" oom-related job failures (and then increased job timeouts, because the swap thrash made them run 2x as long)	01:35
johnsom	SwapTotal: 4193276 kB SwapFree: 4192508 kB from the memory-tracker.txt on one of my tempest based jobs.	01:38
johnsom	Ok, thanks for the details. I need to sign off, but may noodle more on this tomorrow.	01:38
Clark[m]	I think you need to look at dstat because memory tracker is a single point in time?	01:39
fungi	that's presumably a snapshot of swap utilization at a specific point in time	01:39
fungi	a high water mark would be more useful	01:39
johnsom	It's snapshots over time	01:39
johnsom	https://2319feb14e831f87c231-4854ed941a7816d1225c43ae9b456d0e.ssl.cf1.rackcdn.com/726334/52/gate/designate-bind9-scoped-tokens/913b892/controller/logs/screen-memory_tracker.txt	01:39
Clark[m]	Ah ok that's not the file I thought it was.	01:41
Clark[m]	I must be thinking world dump or something	01:41
*** dmitriis is now known as Guest4995		03:31
opendevreview	Ian Wienand proposed opendev/system-config master: make-tarball: role to archive directories https://review.opendev.org/c/opendev/system-config/+/865784	05:57
opendevreview	Ian Wienand proposed opendev/system-config master: tools/make-backup-key.sh https://review.opendev.org/c/opendev/system-config/+/866430	05:57
mnasiadka	good morning	06:54
mnasiadka	I have a question around mails to openstack-stable-maint - Kolla uses periodic and periodic-weekly pipelines for publish jobs - and their failures are not getting mailed - should we move them to periodic-stable?	06:58
fungi	mnasiadka: if they're not stable branch jobs we probably want to come up with a different solution. also you can check the builds tab on the zuul dashboard, it supports url parameters so easy to bookmark specific queries	07:11
mnasiadka	yeah, just did bookmark it	07:11
mnasiadka	that works for now ;)	07:12
*** jpena\|off is now known as jpena		08:28
mnasiadka	fungi: is there a way to recover an encrypted secret? our docker hub account (that nobody apart zuul knows the password for) has stopped accepting Kolla images pushes 2 months ago (permission denied)	09:35
kopecmartin	corvus: hi, i wanted to check with you whether you're ok with a cleanup of additional maintainers in PYPI - https://etherpad.opendev.org/p/openstack-pypi-maintainers-cleanup , I see you're a maintainer of bashate package currently, thanks	09:49
frickler	mnasiadka: technically this would be possible, but our policy explicitly prohibits it, see https://docs.opendev.org/opendev/infra-manual/latest/testing.html#handling-zuul-secrets	10:08
frickler	maybe there is a way to do password recovery for the docker hub site?	10:08
mnasiadka	email for password recovery surely isn't mine (probably some PTL years back)	10:12
mnasiadka	I've sent mail to Docker Support	10:12
mnasiadka	If they won't help me - I'll come back...	10:12
fungi	to be clear, the primary reasons we have that policy are: 1. to encourage users to coordinate succession and secrets management on their own rather than relying on zuul admins as their fallback plan, and 2. because it's very hard to be absolutely certain we're disclosing that secret to someone who actually should have access to it	13:04
fungi	we don't ourselves rely on our ability to decrypt our own zuul secrets either, we have a shared secure locker we keep information like that in	13:05
*** dasm\|off is now known as dasm		14:00
corvus	kopecmartin: i can remove myself from that package if that is desired. i don't see that on the etherpad. who's in charge of that now?	14:38
kopecmartin	corvus: yeah, that'd be great if you can do that, i understand it's easier if you do that ... maybe gmann knows more	14:45
fungi	if the question is who's responsible for bashate these days, it's the qa team: https://governance.openstack.org/tc/reference/projects/quality-assurance.html	14:52
corvus	well that would seem to be kopecmartin then! this seems pretty legit. :)	14:54
kopecmartin	corvus: in a sense, yes :) .. openstackci should stay as the only maintainer in the pypi if the question was whom to put there	14:55
corvus	kopecmartin: great. i sent you an email as an out-of-band confirmation of our chat in irc. if you could reply to that, i'll drop myself. thanks :)	14:56
kopecmartin	corvus: absolutely, thanks	14:56
mnasiadka	fungi: let's see, seems Docker Hub is cooperating, once I recover the account - I'll set up an organisation and a team	15:04
fungi	that sounds more effective long-term	15:05
mnasiadka	that's what we have for quay.io, but we never knew the password for docker hub ;-)	15:35
clarkb	slow start for me today. The dentist gets a visit from me today	16:12
fungi	leave nothing behind. that's my motto for dentist visits	16:14
clarkb	its actually a last minute appointment but not due to any known issues. My hygienist quit recently so my old appointment got cancelled and they called yesterday about a slot that opened up today	16:15
frickler	https://docs.opendev.org/opendev/infra-manual/latest/drivers.html#using-secrets links to https://zuul-ci.org/docs/zuul/user/encryption.html which is a 404, if someone has time to find the new URL, that'd be nice	16:38
*** dasm is now known as Guest5046		16:46
fungi	oh, right, i think i had a line about that rotting in my to do list, now buried so deep i'll never actually see it	17:16
fungi	seems that moved from user to lib	17:17
fungi	er, no	17:18
fungi	i bet it got folded into another page, i'll try to find a substring from it in the git history and track it down that way	17:18
fungi	it moved from user to discussions to discussion, and then got folded into project-config.rst	17:22
fungi	frickler: https://zuul-ci.org/docs/zuul/latest/project-config.html#encryption	17:22
fungi	that's where it eventually wound up	17:23
*** jpena is now known as jpena\|off		17:40
frickler	wow, I haven't seen such a busy gate in quite a while	18:02
fungi	per discussion in #openstack-nova there's been a scramble to fix a cadre of different problems causing random test failures (seems to come down to bugs in cirros and ovn), on top of feature freeze tomorrow	18:04
clarkb	fungi: any chance you have time to look at that unattended upgrades issue on gitea09?	18:07
clarkb	I was hoping to undersatnd that better before landing changes to begin a deploent on it	18:08
frickler	plus requirements gate has been broken for a while and they are merging a big bunch of queued up u-c updates now	18:08
clarkb	also there was the pkg_resources issue that hit some xstatic repos	18:09
*** dasm is now known as Guest5052		18:10
fungi	clarkb: i missed that issue, will take a look now	18:10
fungi	2023-02-16 06:01:07,225 INFO Package sosreport is kept back because a related package is kept back or due to local apt_preferences(5).	18:12
fungi	clarkb: that ^ error?	18:12
clarkb	fungi: ya there are a few other packages being held too a couple libs iirc	18:12
clarkb	but as far as I know we don't have any local settings holding anything back?	18:13
clarkb	also I'm catching up on the mailman site creatin thread now. I would say the docs are incredibly confusing if they don't intend for you to use a db migration	18:13
clarkb	no where does it say "use the django admin shell and run these commands"	18:14
fungi	we have 20auto-upgrades and 50unattended-upgrades installed. is it normal to have both of those?	18:14
clarkb	fungi: I think one comes from the package install of unattended-upgradse and the other we write out	18:14
clarkb	I seem to recall at one time we had some changes to work on converging that?	18:15
clarkb	maybe that didn't reach a conclusion	18:15
fungi	looks like one is what causes upgrades to run and the other is configuration for it	18:15
fungi	`apt show sosreport -a` gives information on the versions of that package known to the system	18:16
clarkb	fwiw the image we booted on is one I grabbed from ubuntu and uplaoded to vexxhost for the gitea loadbalancer too. SO may need to check that host? Our launch node system does a full unattended upgrade and then reboots so only these handful of packages are the ones that are sad	18:18
fungi	sosreport 4.3-1ubuntu2.1 is currently installed from jammy-security/main but there's a newer 4.4-1ubuntu1.22.04.1 in jammy-updates/main	18:19
clarkb	but why would that not be installable? Do we not allow updates from updates in unattended-upgrades?	18:21
fungi	trying to upgrade from the command, it looks like the newer version of sosreport now requires python3-magic so unattended-upgrades is probably refusing to install new packages by default	18:21
clarkb	aha thats the bit I think I was missing	18:22
fungi	yeah, compare the Depends: lines	18:22
fungi	`apt show sosreport -a\|grep -e '^Version:' -e '^Depends:'`	18:22
clarkb	I guess knowing that proceeding seems fine?	18:23
clarkb	both with updating the package by hand and landing the dns change then the system-config change?	18:23
fungi	it added two new dependencies, one of which was already installed but the other is not yet	18:23
clarkb	I just want to make sure it wasn't in some super bad state where starting over would be better	18:23
fungi	normally package updates in debian stable releases would not be allowed to add dependencies like that, i agree it's surprising	18:23
clarkb	and i guess a manual apt-get dist-upgrade -y or whatever would unstick it?	18:24
clarkb	and apt-get upgrade would not?	18:27
clarkb	anyway if you think thats is safe and the preferred next step I can do that	18:27
clarkb	and then https://review.opendev.org/c/opendev/zone-opendev.org/+/874044 needs to land before https://review.opendev.org/c/opendev/system-config/+/874043 if you have time to review those .I can do approves in the correct order	18:30
fungi	i simply did `sudo apt upgrade` and it prompted to download the additional package. i didn't follow through with it, just wanted to see what it would propose	18:36
clarkb	I think getting it up to date is a good idea. Should I go ahead and run that then?	18:37
clarkb	(I don't want to start the brand new server with some debt)	18:37
fungi	yeah, it's fine to do that	18:39
clarkb	its going to upgrade: "grub-efi-amd64-bin grub-efi-amd64-signed python-apt-common python3-apt python3-distupgrade shim-signed sosreport ubuntu-release-upgrader-core" and also install python3-magic running `sudo apt upgrade`	18:40
fungi	the reason we're not seeing this on the other servers is probably that they don't have sosreport preinstalled (it's an ubuntu phone-home thing)	18:40
clarkb	ya maybe we want to add it to the remove list	18:40
clarkb	given the grub packages are also going to upgrade I'll do a reboot after this to double check it is happy	18:41
clarkb	heh it didn't auto restart unattended-upgrades.service. Which is just a cronjob I thought. But also I'm rebooting now so it will get restarted then	18:42
clarkb	grub-install: warning: EFI variables cannot be set on this system.	18:42
clarkb	I think beacuse we aren't efi booting?	18:42
clarkb	anyway reboot now	18:43
clarkb	and that went fine. So ya I guess review the two changes and if all looks well and you think we're good to proceed generally I can approve the two chagnes in the correct order	18:44
fungi	yep, taking a look now. thanks for the added diligence there!	18:47
opendevreview	Clark Boylan proposed opendev/system-config master: Remove sosreport from our servers https://review.opendev.org/c/opendev/system-config/+/874125	18:49
clarkb	There's the package removal. We already remove whoopsie which I'm guessing sosreport has replaced?	18:50
opendevreview	Merged opendev/zone-opendev.org master: Add gitea09 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/874044	18:53
fungi	looks like they're slightly different but related tools	18:54
fungi	whoopsie: Ubuntu error tracker submission	18:54
fungi	sosreport: Set of tools to gather troubleshooting data from a system	18:54
clarkb	for the system-config change: my expectation is that ansible will deploy the containers and then beacuse I modified host_vars/gitea09.opendev.org.yaml it will also run manage projects. Manage projects should create empty projects on gitea09	19:02
clarkb	then we can figure out the db surgery	19:02
clarkb	I believe that should all be safe as the server is not added to gerrit replication or the load balancer pool	19:03
clarkb	worst case we edit the load balancer to remove the server but it shouldn't have any way of getting in there so I think we're good	19:03
clarkb	I'll sneak out for early lunch today as that should eat up the gating and deploy time	19:06
opendevreview	Merged opendev/system-config master: Add gitea09 to our inventory https://review.opendev.org/c/opendev/system-config/+/874043	19:51
clarkb	zuul is working through the long list of jobs for ^	20:24
clarkb	anyone know if we manage the known hosts for gitea servers on gerrit? I'm trying to sort htat out and prep the followup changes for gitea09 while I wait for ansible to do its thing. But I'm not finding it if it exists	21:29
clarkb	nevermind its in private vars I think because we don't want teh test env to have any trust to prod	21:30
fungi	that sounds right	21:37
clarkb	the gitea job is running now. Assuming that goes ok I guess I need to copy the db from 01 to 09 and restore on 09. I'll dump 09's db prior to that so that we can compare if we find that is necessary later	21:38
clarkb	I forgot we process the giteas serially to ensure service uptime. I was wondering why nothing had happened on 09 yet	21:41
clarkb	Iv'e also just realized I should've updated the gitea system-config-run job to deploy a jammy node. Oh well if this fails we'll figure it out	21:43
clarkb	oh I should actually wait until the manage-projects job runs later just to avoid manipulating the db when something else could be modifying settings	21:45
clarkb	the steps are assuming gitea deployment goes well (it looks happy to me right now), then wait for manage projects, then transplant databases	21:46
clarkb	and that won't be done for an hour or two?	21:47
fungi	makessense	21:49
clarkb	also just to remind people this isn't a bfv node. I was planning to use regular disk for these new nodes. I think mnaser warned that data you care about should be in volumes but since these are mirrors and we have several of them I think this is ok?	21:50
clarkb	it simplifies management of things	21:50
clarkb	I guess we can change that later if we decide that is a good idea too	21:50
fungi	yeah, i think it's fine	22:01
clarkb	ok manage projects is done. Lets see bout that db surgery	23:45
clarkb	I've got a leaf blower singing its song outsdie :(	23:46
*** dasm_ is now known as dasm\|off		23:55
clarkb	ok after transplanting the db I see the redirects table has content and the users table has old orgs like osf in it. I'm going to try starting gitea now	23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!