Thursday, 2023-02-16

clarkbThe promised launch node is running. Once it is done I'll get changes up to system-config and dns00:10
clarkbone thing I didn't thnk to check until just now is quotas.00:14
clarkbI thin kwe're ok on that front00:15
opendevreviewClark Boylan proposed opendev/system-config master: Add gitea09 to our inventory
clarkbinfra-root ^ I think I'd like to be around when that lands just to make sure the end results are happy. But also there issomething weird that unattended upgrades is complaining about. Seems a few things might be unexpectedly held back. Can someone who groks apt better than me take a look before we commit to this host?00:30
clarkbnow to get the DNS change up00:30
clarkbcurious we don't have AAAA records for the old giteas00:32
fungibecause the lb is dual-stack, the backends didn't need to be00:32
clarkboh also the dns change needs to land first so that LE can happy problem00:34
clarkbheh problem. *properly00:34
opendevreviewClark Boylan proposed opendev/ master: Add gitea09 to DNS
clarkbI went ahead and added the sshfp records too since we seem to have not cleaned those up and they can't hurt (they don't help much either though)00:37
clarkboh and the autoamted collection of host keys didn't work. I just hopped on the server and catted the pubkeys00:38
clarkbI've crosschecked those values against what ssh-keyscan emits and they all seem to line up so I think I got it correct.00:39
johnsomJust an observation, I have seen a few jobs today that took a long time on the "Copy old /opt" step today.00:58
johnsom2023-02-15 23:53:07.219972 | TASK [configure-swap : Copy old /opt]00:58
johnsom2023-02-16 00:19:06.969689 | controller | ok: Runtime: 0:25:58.75184800:59
johnsomThis one is currently running:
johnsomHere is a different job:
johnsomJobs that are not timing out seem to take ~11 minutes01:02
fungithat's typically an indicator of i/o contention01:03
fungisince it's copying tons of small files from one block device to another01:03
Clark[m]That also only happens on tax because we mount the ephemeral drive there01:11
fungibut yeah01:11
Clark[m]On other clouds that cost happens when you write zeros for the swap file01:12
Clark[m]Ya rax. Auto correct on the phone getting it wrong01:12
fungibut yes, comparing it to non-rax nodes is pointless because they just no-op that step01:12
johnsomI have seen it on Designate jobs and tempest repo jobs today.01:13
Clark[m]I think it's only for devstack based jobs. The unittest jobs don't bother01:13
Clark[m]But yes it's normal and expected01:14
johnsomYeah, it's just a bummer it's causing TIMED_OUT jobs01:14
Clark[m]Well it's not the only thing causing jobs to be slow :)01:14
Clark[m]Devstack could run in like half the time if we didn't use osc for example01:15
Clark[m]There are also a number of steps that run large Ansible loops and those are slow01:15
johnsomYeah, these aren't even getting to tempest before they timeout01:15
Clark[m]Jobs swap regularly which may create our own noisy neighbor io contention problem. I think dansmith is looking at tuning MySQL back to hopefully free up some memory though. Anyway we suffer from a thousand cuts in the devstack jobs and that is one of them. One that is more difficult to undo due to the need for disk space 01:16
Clark[m]I suppose someone could sort out running jobs with the ephemeral drive mounted on another location.01:16
Clark[m]But that's likely to be more painful than running devstack with virtualenvs01:17
Clark[m]Fwiw if I end up with some time I try to pick a long running job and identify ways to speed it up. Haven't had time recently but last year a number of speedups went into writing better ansible. Unfortunately, that sort of Ansible is not the type of Ansible ansible-lint suggests you write so there is slow Ansible all over01:22
johnsomWell, if you are serious that there might be something we can do to eliminate that "copy old /opt" step, I would be interested to learn more and maybe poke at it. Even at 11 minutes, it is one of the longer steps I have seen.01:25
fungiansible-lint favors maintainability over performance, this just isn't the sort of stuff it was designed for01:25
fungijohnsom: definitely worth brainstorming. basically the git repository cache is baked into /opt in our node images. on rackspace we don't get a large enough rootfs to run devstack and tempest tests but they have a large ephemeral disk, so we format it, copy the git cache into it, and run devstack there01:27
fungii guess an alternative would be to tell devstack to clone repos from another drive, which would make devstack slower but would skip the copy step and maybe avoid copying things devstack doesn't need01:28
fungithe challenge is handing off the git cache to devstack efficiently, while providing enough disk space for the tests to run01:29
johnsomHmm, that helps me understand better what is going on.01:30
fungias Clark[m] points out, the up side to rackspace giving us an extra block device is that we can mkswap a partition for swap space, whereas on other providers we have to preallocate a swap file by writing out zeroes01:31
* johnsom wonders how many jobs need swap space01:32
fungian unfortunate number01:32
Clark[m]Most tempest jobs I imagine. Maybe all01:32
fungihaving a little bit of swap does help provide more main memory for things like filesystem cache, which can improve job efficiency01:32
fungibut the 1gb of swap we were recommending for that was quickly upped to something like 8gb because "my jobs need more memory"01:33
Clark[m]It's back to 1gb by default iirc because 8gbs of zeros is slow01:34
fungimore swap was seen as a way to "fix" oom-related job failures (and then increased job timeouts, because the swap thrash made them run 2x as long)01:35
johnsomSwapTotal:       4193276 kB   SwapFree:        4192508 kB from the memory-tracker.txt on one of my tempest based jobs.01:38
johnsomOk, thanks for the details. I need to sign off, but may noodle more on this tomorrow.01:38
Clark[m]I think you need to look at dstat because memory tracker is a single point in time?01:39
fungithat's presumably a snapshot of swap utilization at a specific point in time01:39
fungia high water mark would be more useful01:39
johnsomIt's snapshots over time01:39
Clark[m]Ah ok that's not the file I thought it was.01:41
Clark[m]I must be thinking world dump or something 01:41
opendevreviewIan Wienand proposed opendev/system-config master: make-tarball: role to archive directories
opendevreviewIan Wienand proposed opendev/system-config master: tools/
mnasiadkagood morning06:54
mnasiadkaI have a question around mails to openstack-stable-maint - Kolla uses periodic and periodic-weekly pipelines for publish jobs - and their failures are not getting mailed - should we move them to periodic-stable?06:58
fungimnasiadka: if they're not stable branch jobs we probably want to come up with a different solution. also you can check the builds tab on the zuul dashboard, it supports url parameters so easy to bookmark specific queries07:11
mnasiadkayeah, just did bookmark it07:11
mnasiadkathat works for now ;)07:12
mnasiadkafungi: is there a way to recover an encrypted secret? our docker hub account (that nobody apart zuul knows the password for) has stopped accepting Kolla images pushes 2 months ago (permission denied)09:35
kopecmartincorvus: hi, i wanted to check with you whether you're ok with a cleanup of additional maintainers in PYPI - , I see you're a maintainer of bashate package currently, thanks09:49
fricklermnasiadka: technically this would be possible, but our policy explicitly prohibits it, see
fricklermaybe there is a way to do password recovery for the docker hub site?10:08
mnasiadkaemail for password recovery surely isn't mine (probably some PTL years back)10:12
mnasiadkaI've sent mail to Docker Support10:12
mnasiadkaIf they won't help me - I'll come back...10:12
fungito be clear, the primary reasons we have that policy are: 1. to encourage users to coordinate succession and secrets management on their own rather than relying on zuul admins as their fallback plan, and 2. because it's very hard to be absolutely certain we're disclosing that secret to someone who actually should have access to it13:04
fungiwe don't ourselves rely on our ability to decrypt our own zuul secrets either, we have a shared secure locker we keep information like that in13:05
*** dasm|off is now known as dasm14:00
corvuskopecmartin: i can remove myself from that package if that is desired.  i don't see that on the etherpad.  who's in charge of that now?14:38
kopecmartincorvus: yeah, that'd be great if you can do that, i understand it's easier if you do that ... maybe gmann knows more14:45
fungiif the question is who's responsible for bashate these days, it's the qa team:
corvuswell that would seem to be kopecmartin then!  this seems pretty legit.  :)14:54
kopecmartincorvus: in a sense, yes :) .. openstackci should stay as the only maintainer in the pypi if the question was whom to put there 14:55
corvuskopecmartin: great.  i sent you an email as an out-of-band confirmation of our chat in irc.  if you could reply to that, i'll drop myself.  thanks :)14:56
kopecmartincorvus: absolutely, thanks 14:56
mnasiadkafungi: let's see, seems Docker Hub is cooperating, once I recover the account - I'll set up an organisation and a team15:04
fungithat sounds more effective long-term15:05
mnasiadkathat's what we have for, but we never knew the password for docker hub ;-)15:35
clarkbslow start for me today. The dentist gets a visit from me today16:12
fungileave nothing behind. that's my motto for dentist visits16:14
clarkbits actually a last minute appointment but not due to any known issues. My hygienist quit recently so my old appointment got cancelled and they called yesterday about a slot that opened up today16:15
frickler links to which is a 404, if someone has time to find the new URL, that'd be nice16:38
fungioh, right, i think i had a line about that rotting in my to do list, now buried so deep i'll never actually see it17:16
fungiseems that moved from user to lib17:17
fungier, no17:18
fungii bet it got folded into another page, i'll try to find a substring from it in the git history and track it down that way17:18
fungiit moved from user to discussions to discussion, and then got folded into project-config.rst17:22
fungithat's where it eventually wound up17:23
fricklerwow, I haven't seen such a busy gate in quite a while18:02
fungiper discussion in #openstack-nova there's been a scramble to fix a cadre of different problems causing random test failures (seems to come down to bugs in cirros and ovn), on top of feature freeze tomorrow18:04
clarkbfungi: any chance you have time to look at that unattended upgrades issue on gitea09?18:07
clarkbI was hoping to undersatnd that better before landing changes to begin a deploent on it18:08
fricklerplus requirements gate has been broken for a while and they are merging a big bunch of queued up u-c updates now18:08
clarkbalso there was the pkg_resources issue that hit some xstatic repos18:09
fungiclarkb: i missed that issue, will take a look now18:10
fungi2023-02-16 06:01:07,225 INFO Package sosreport is kept back because a related package is kept back or due to local apt_preferences(5).18:12
fungiclarkb: that ^ error?18:12
clarkbfungi: ya there are a few other packages being held too a couple libs iirc18:12
clarkbbut as far as I know we don't have any local settings holding anything back?18:13
clarkbalso I'm catching up on the mailman site creatin thread now. I would say the docs are incredibly confusing if they don't intend for you to use a db migration18:13
clarkbno where does it say "use the django admin shell and run these commands"18:14
fungiwe have 20auto-upgrades and 50unattended-upgrades installed. is it normal to have both of those?18:14
clarkbfungi: I think one comes from the package install of unattended-upgradse and the other we write out18:14
clarkbI seem to recall at one time we had some changes to work on converging that?18:15
clarkbmaybe that didn't reach a conclusion18:15
fungilooks like one is what causes upgrades to run and the other is configuration for it18:15
fungi`apt show sosreport -a` gives information on the versions of that package known to the system18:16
clarkbfwiw the image we booted on is one I grabbed from ubuntu and uplaoded to vexxhost for the gitea loadbalancer too. SO may need to check that host? Our launch node system does a full unattended upgrade and then reboots so only these handful of packages are the ones that are sad18:18
fungisosreport 4.3-1ubuntu2.1 is currently installed from jammy-security/main but there's a newer 4.4-1ubuntu1.22.04.1 in jammy-updates/main18:19
clarkbbut why would that not be installable? Do we not allow updates from updates in unattended-upgrades?18:21
fungitrying to upgrade from the command, it looks like the newer version of sosreport now requires python3-magic so unattended-upgrades is probably refusing to install new packages by default18:21
clarkbaha thats the bit I think I was missing18:22
fungiyeah, compare the Depends: lines18:22
fungi`apt show sosreport -a|grep -e '^Version:' -e '^Depends:'`18:22
clarkbI guess knowing that proceeding seems fine?18:23
clarkbboth with updating the package by hand and landing the dns change then the system-config change?18:23
fungiit added two new dependencies, one of which was already installed but the other is not yet18:23
clarkbI just want to make sure it wasn't in some super bad state where starting over would be better18:23
funginormally package updates in debian stable releases would not be allowed to add dependencies like that, i agree it's surprising18:23
clarkband i guess a manual apt-get dist-upgrade -y or whatever would unstick it?18:24
clarkband apt-get upgrade would not?18:27
clarkbanyway if you think thats is safe and the preferred next step I can do that18:27
clarkband then needs to land before if you have time to review those .I can do approves in the correct order18:30
fungii simply did `sudo apt upgrade` and it prompted to download the additional package. i didn't follow through with it, just wanted to see what it would propose18:36
clarkbI think getting it up to date is a good idea. Should I go ahead and run that then?18:37
clarkb(I don't want to start the brand new server with some debt)18:37
fungiyeah, it's fine to do that18:39
clarkbits going to upgrade: "grub-efi-amd64-bin grub-efi-amd64-signed python-apt-common python3-apt python3-distupgrade shim-signed sosreport ubuntu-release-upgrader-core" and also install python3-magic running `sudo apt upgrade`18:40
fungithe reason we're not seeing this on the other servers is probably that they don't have sosreport preinstalled (it's an ubuntu phone-home thing)18:40
clarkbya maybe we want to add it to the remove list18:40
clarkbgiven the grub packages are also going to upgrade I'll do a reboot after this to double check it is happy18:41
clarkbheh it didn't auto restart unattended-upgrades.service. Which is just a cronjob I thought. But also I'm rebooting now so it will get restarted then18:42
clarkbgrub-install: warning: EFI variables cannot be set on this system.18:42
clarkbI think beacuse we aren't efi booting?18:42
clarkbanyway reboot now18:43
clarkband that went fine. So ya I guess review the two changes and if all looks well and you think we're good to proceed generally I can approve the two chagnes in the correct order18:44
fungiyep, taking a look now. thanks for the added diligence there!18:47
opendevreviewClark Boylan proposed opendev/system-config master: Remove sosreport from our servers
clarkbThere's the package removal. We already remove whoopsie which I'm guessing sosreport has replaced?18:50
opendevreviewMerged opendev/ master: Add gitea09 to DNS
fungilooks like they're slightly different but related tools18:54
fungiwhoopsie: Ubuntu error tracker submission18:54
fungisosreport: Set of tools to gather troubleshooting data from a system18:54
clarkbfor the system-config change: my expectation is that ansible will deploy the containers and then beacuse I modified host_vars/ it will also run manage projects. Manage projects should create empty projects on gitea0919:02
clarkbthen we can figure out the db surgery19:02
clarkbI believe that should all be safe as the server is not added to gerrit replication or the load balancer pool19:03
clarkbworst case we edit the load balancer to remove the server but it shouldn't have any way of getting in there so I think we're good19:03
clarkbI'll sneak out for early lunch today as that should eat up the gating and deploy time19:06
opendevreviewMerged opendev/system-config master: Add gitea09 to our inventory
clarkbzuul is working through the long list of jobs for ^20:24
clarkbanyone know if we manage the known hosts for gitea servers on gerrit? I'm trying to sort htat out and prep the followup changes for gitea09 while I wait for ansible to do its thing. But I'm not finding it if it exists21:29
clarkbnevermind its in private vars I think because we don't want teh test env to have any trust to prod21:30
fungithat sounds right21:37
clarkbthe gitea job is running now. Assuming that goes ok I guess I need to copy the db from 01 to 09 and restore on 09. I'll dump 09's db prior to that so that we can compare if we find that is necessary later21:38
clarkbI forgot we process the giteas serially to ensure service uptime. I was wondering why nothing had happened on 09 yet21:41
clarkbIv'e also just realized I should've updated the gitea system-config-run job to deploy a jammy node. Oh well if this fails we'll figure it out21:43
clarkboh I should actually wait until the manage-projects job runs later just to avoid manipulating the db when something else could be modifying settings21:45
clarkbthe steps are assuming gitea deployment goes well (it looks happy to me right now), then wait for manage projects, then transplant databases21:46
clarkband that won't be done for an hour or two?21:47
clarkbalso just to remind people this isn't a bfv node. I was planning to use regular disk for these new nodes. I think mnaser warned that data you care about should be in volumes but since these are mirrors and we have several of them I think this is ok?21:50
clarkbit simplifies management of things21:50
clarkbI guess we can change that later if we decide that is a good idea too21:50
fungiyeah, i think it's fine22:01
clarkbok manage projects is done. Lets see bout that db surgery23:45
clarkbI've got a leaf blower singing its song outsdie :(23:46
clarkbok after transplanting the db I see the redirects table has content and the users table has old orgs like osf in it. I'm going to try starting gitea now23:58

