clarkb | The promised launch node is running. Once it is done I'll get changes up to system-config and dns | 00:10 |
---|---|---|
clarkb | one thing I didn't thnk to check until just now is quotas. | 00:14 |
clarkb | I thin kwe're ok on that front | 00:15 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add gitea09 to our inventory https://review.opendev.org/c/opendev/system-config/+/874043 | 00:29 |
clarkb | infra-root ^ I think I'd like to be around when that lands just to make sure the end results are happy. But also there issomething weird that unattended upgrades is complaining about. Seems a few things might be unexpectedly held back. Can someone who groks apt better than me take a look before we commit to this host? | 00:30 |
clarkb | now to get the DNS change up | 00:30 |
clarkb | curious we don't have AAAA records for the old giteas | 00:32 |
fungi | because the lb is dual-stack, the backends didn't need to be | 00:32 |
clarkb | oh also the dns change needs to land first so that LE can happy problem | 00:34 |
clarkb | heh problem. *properly | 00:34 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add gitea09 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/874044 | 00:36 |
clarkb | I went ahead and added the sshfp records too since we seem to have not cleaned those up and they can't hurt (they don't help much either though) | 00:37 |
clarkb | oh and the autoamted collection of host keys didn't work. I just hopped on the server and catted the pubkeys | 00:38 |
clarkb | I've crosschecked those values against what ssh-keyscan emits and they all seem to line up so I think I got it correct. | 00:39 |
johnsom | Just an observation, I have seen a few jobs today that took a long time on the "Copy old /opt" step today. | 00:58 |
johnsom | 2023-02-15 23:53:07.219972 | TASK [configure-swap : Copy old /opt] | 00:58 |
johnsom | 2023-02-16 00:19:06.969689 | controller | ok: Runtime: 0:25:58.751848 | 00:59 |
johnsom | This one is currently running: https://review.opendev.org/726334 | 00:59 |
johnsom | Here is a different job: https://2319feb14e831f87c231-4854ed941a7816d1225c43ae9b456d0e.ssl.cf1.rackcdn.com/726334/52/gate/designate-bind9-scoped-tokens/913b892/job-output.txt | 01:01 |
johnsom | Jobs that are not timing out seem to take ~11 minutes | 01:02 |
fungi | that's typically an indicator of i/o contention | 01:03 |
fungi | since it's copying tons of small files from one block device to another | 01:03 |
Clark[m] | That also only happens on tax because we mount the ephemeral drive there | 01:11 |
fungi | rax | 01:11 |
fungi | but yeah | 01:11 |
Clark[m] | On other clouds that cost happens when you write zeros for the swap file | 01:12 |
Clark[m] | Ya rax. Auto correct on the phone getting it wrong | 01:12 |
fungi | but yes, comparing it to non-rax nodes is pointless because they just no-op that step | 01:12 |
johnsom | I have seen it on Designate jobs and tempest repo jobs today. | 01:13 |
Clark[m] | I think it's only for devstack based jobs. The unittest jobs don't bother | 01:13 |
Clark[m] | But yes it's normal and expected | 01:14 |
johnsom | Yeah, it's just a bummer it's causing TIMED_OUT jobs | 01:14 |
Clark[m] | Well it's not the only thing causing jobs to be slow :) | 01:14 |
Clark[m] | Devstack could run in like half the time if we didn't use osc for example | 01:15 |
Clark[m] | There are also a number of steps that run large Ansible loops and those are slow | 01:15 |
johnsom | Yeah, these aren't even getting to tempest before they timeout | 01:15 |
Clark[m] | Jobs swap regularly which may create our own noisy neighbor io contention problem. I think dansmith is looking at tuning MySQL back to hopefully free up some memory though. Anyway we suffer from a thousand cuts in the devstack jobs and that is one of them. One that is more difficult to undo due to the need for disk space | 01:16 |
Clark[m] | I suppose someone could sort out running jobs with the ephemeral drive mounted on another location. | 01:16 |
Clark[m] | But that's likely to be more painful than running devstack with virtualenvs | 01:17 |
Clark[m] | Fwiw if I end up with some time I try to pick a long running job and identify ways to speed it up. Haven't had time recently but last year a number of speedups went into writing better ansible. Unfortunately, that sort of Ansible is not the type of Ansible ansible-lint suggests you write so there is slow Ansible all over | 01:22 |
johnsom | Well, if you are serious that there might be something we can do to eliminate that "copy old /opt" step, I would be interested to learn more and maybe poke at it. Even at 11 minutes, it is one of the longer steps I have seen. | 01:25 |
fungi | ansible-lint favors maintainability over performance, this just isn't the sort of stuff it was designed for | 01:25 |
fungi | johnsom: definitely worth brainstorming. basically the git repository cache is baked into /opt in our node images. on rackspace we don't get a large enough rootfs to run devstack and tempest tests but they have a large ephemeral disk, so we format it, copy the git cache into it, and run devstack there | 01:27 |
fungi | i guess an alternative would be to tell devstack to clone repos from another drive, which would make devstack slower but would skip the copy step and maybe avoid copying things devstack doesn't need | 01:28 |
fungi | the challenge is handing off the git cache to devstack efficiently, while providing enough disk space for the tests to run | 01:29 |
johnsom | Hmm, that helps me understand better what is going on. | 01:30 |
fungi | as Clark[m] points out, the up side to rackspace giving us an extra block device is that we can mkswap a partition for swap space, whereas on other providers we have to preallocate a swap file by writing out zeroes | 01:31 |
* johnsom wonders how many jobs need swap space | 01:32 | |
fungi | an unfortunate number | 01:32 |
Clark[m] | Most tempest jobs I imagine. Maybe all | 01:32 |
fungi | having a little bit of swap does help provide more main memory for things like filesystem cache, which can improve job efficiency | 01:32 |
fungi | but the 1gb of swap we were recommending for that was quickly upped to something like 8gb because "my jobs need more memory" | 01:33 |
Clark[m] | It's back to 1gb by default iirc because 8gbs of zeros is slow | 01:34 |
fungi | more swap was seen as a way to "fix" oom-related job failures (and then increased job timeouts, because the swap thrash made them run 2x as long) | 01:35 |
johnsom | SwapTotal: 4193276 kB SwapFree: 4192508 kB from the memory-tracker.txt on one of my tempest based jobs. | 01:38 |
johnsom | Ok, thanks for the details. I need to sign off, but may noodle more on this tomorrow. | 01:38 |
Clark[m] | I think you need to look at dstat because memory tracker is a single point in time? | 01:39 |
fungi | that's presumably a snapshot of swap utilization at a specific point in time | 01:39 |
fungi | a high water mark would be more useful | 01:39 |
johnsom | It's snapshots over time | 01:39 |
johnsom | https://2319feb14e831f87c231-4854ed941a7816d1225c43ae9b456d0e.ssl.cf1.rackcdn.com/726334/52/gate/designate-bind9-scoped-tokens/913b892/controller/logs/screen-memory_tracker.txt | 01:39 |
Clark[m] | Ah ok that's not the file I thought it was. | 01:41 |
Clark[m] | I must be thinking world dump or something | 01:41 |
*** dmitriis is now known as Guest4995 | 03:31 | |
opendevreview | Ian Wienand proposed opendev/system-config master: make-tarball: role to archive directories https://review.opendev.org/c/opendev/system-config/+/865784 | 05:57 |
opendevreview | Ian Wienand proposed opendev/system-config master: tools/make-backup-key.sh https://review.opendev.org/c/opendev/system-config/+/866430 | 05:57 |
mnasiadka | good morning | 06:54 |
mnasiadka | I have a question around mails to openstack-stable-maint - Kolla uses periodic and periodic-weekly pipelines for publish jobs - and their failures are not getting mailed - should we move them to periodic-stable? | 06:58 |
fungi | mnasiadka: if they're not stable branch jobs we probably want to come up with a different solution. also you can check the builds tab on the zuul dashboard, it supports url parameters so easy to bookmark specific queries | 07:11 |
mnasiadka | yeah, just did bookmark it | 07:11 |
mnasiadka | that works for now ;) | 07:12 |
*** jpena|off is now known as jpena | 08:28 | |
mnasiadka | fungi: is there a way to recover an encrypted secret? our docker hub account (that nobody apart zuul knows the password for) has stopped accepting Kolla images pushes 2 months ago (permission denied) | 09:35 |
kopecmartin | corvus: hi, i wanted to check with you whether you're ok with a cleanup of additional maintainers in PYPI - https://etherpad.opendev.org/p/openstack-pypi-maintainers-cleanup , I see you're a maintainer of bashate package currently, thanks | 09:49 |
frickler | mnasiadka: technically this would be possible, but our policy explicitly prohibits it, see https://docs.opendev.org/opendev/infra-manual/latest/testing.html#handling-zuul-secrets | 10:08 |
frickler | maybe there is a way to do password recovery for the docker hub site? | 10:08 |
mnasiadka | email for password recovery surely isn't mine (probably some PTL years back) | 10:12 |
mnasiadka | I've sent mail to Docker Support | 10:12 |
mnasiadka | If they won't help me - I'll come back... | 10:12 |
fungi | to be clear, the primary reasons we have that policy are: 1. to encourage users to coordinate succession and secrets management on their own rather than relying on zuul admins as their fallback plan, and 2. because it's very hard to be absolutely certain we're disclosing that secret to someone who actually should have access to it | 13:04 |
fungi | we don't ourselves rely on our ability to decrypt our own zuul secrets either, we have a shared secure locker we keep information like that in | 13:05 |
*** dasm|off is now known as dasm | 14:00 | |
corvus | kopecmartin: i can remove myself from that package if that is desired. i don't see that on the etherpad. who's in charge of that now? | 14:38 |
kopecmartin | corvus: yeah, that'd be great if you can do that, i understand it's easier if you do that ... maybe gmann knows more | 14:45 |
fungi | if the question is who's responsible for bashate these days, it's the qa team: https://governance.openstack.org/tc/reference/projects/quality-assurance.html | 14:52 |
corvus | well that would seem to be kopecmartin then! this seems pretty legit. :) | 14:54 |
kopecmartin | corvus: in a sense, yes :) .. openstackci should stay as the only maintainer in the pypi if the question was whom to put there | 14:55 |
corvus | kopecmartin: great. i sent you an email as an out-of-band confirmation of our chat in irc. if you could reply to that, i'll drop myself. thanks :) | 14:56 |
kopecmartin | corvus: absolutely, thanks | 14:56 |
mnasiadka | fungi: let's see, seems Docker Hub is cooperating, once I recover the account - I'll set up an organisation and a team | 15:04 |
fungi | that sounds more effective long-term | 15:05 |
mnasiadka | that's what we have for quay.io, but we never knew the password for docker hub ;-) | 15:35 |
clarkb | slow start for me today. The dentist gets a visit from me today | 16:12 |
fungi | leave nothing behind. that's my motto for dentist visits | 16:14 |
clarkb | its actually a last minute appointment but not due to any known issues. My hygienist quit recently so my old appointment got cancelled and they called yesterday about a slot that opened up today | 16:15 |
frickler | https://docs.opendev.org/opendev/infra-manual/latest/drivers.html#using-secrets links to https://zuul-ci.org/docs/zuul/user/encryption.html which is a 404, if someone has time to find the new URL, that'd be nice | 16:38 |
*** dasm is now known as Guest5046 | 16:46 | |
fungi | oh, right, i think i had a line about that rotting in my to do list, now buried so deep i'll never actually see it | 17:16 |
fungi | seems that moved from user to lib | 17:17 |
fungi | er, no | 17:18 |
fungi | i bet it got folded into another page, i'll try to find a substring from it in the git history and track it down that way | 17:18 |
fungi | it moved from user to discussions to discussion, and then got folded into project-config.rst | 17:22 |
fungi | frickler: https://zuul-ci.org/docs/zuul/latest/project-config.html#encryption | 17:22 |
fungi | that's where it eventually wound up | 17:23 |
*** jpena is now known as jpena|off | 17:40 | |
frickler | wow, I haven't seen such a busy gate in quite a while | 18:02 |
fungi | per discussion in #openstack-nova there's been a scramble to fix a cadre of different problems causing random test failures (seems to come down to bugs in cirros and ovn), on top of feature freeze tomorrow | 18:04 |
clarkb | fungi: any chance you have time to look at that unattended upgrades issue on gitea09? | 18:07 |
clarkb | I was hoping to undersatnd that better before landing changes to begin a deploent on it | 18:08 |
frickler | plus requirements gate has been broken for a while and they are merging a big bunch of queued up u-c updates now | 18:08 |
clarkb | also there was the pkg_resources issue that hit some xstatic repos | 18:09 |
*** dasm is now known as Guest5052 | 18:10 | |
fungi | clarkb: i missed that issue, will take a look now | 18:10 |
fungi | 2023-02-16 06:01:07,225 INFO Package sosreport is kept back because a related package is kept back or due to local apt_preferences(5). | 18:12 |
fungi | clarkb: that ^ error? | 18:12 |
clarkb | fungi: ya there are a few other packages being held too a couple libs iirc | 18:12 |
clarkb | but as far as I know we don't have any local settings holding anything back? | 18:13 |
clarkb | also I'm catching up on the mailman site creatin thread now. I would say the docs are incredibly confusing if they don't intend for you to use a db migration | 18:13 |
clarkb | no where does it say "use the django admin shell and run these commands" | 18:14 |
fungi | we have 20auto-upgrades and 50unattended-upgrades installed. is it normal to have both of those? | 18:14 |
clarkb | fungi: I think one comes from the package install of unattended-upgradse and the other we write out | 18:14 |
clarkb | I seem to recall at one time we had some changes to work on converging that? | 18:15 |
clarkb | maybe that didn't reach a conclusion | 18:15 |
fungi | looks like one is what causes upgrades to run and the other is configuration for it | 18:15 |
fungi | `apt show sosreport -a` gives information on the versions of that package known to the system | 18:16 |
clarkb | fwiw the image we booted on is one I grabbed from ubuntu and uplaoded to vexxhost for the gitea loadbalancer too. SO may need to check that host? Our launch node system does a full unattended upgrade and then reboots so only these handful of packages are the ones that are sad | 18:18 |
fungi | sosreport 4.3-1ubuntu2.1 is currently installed from jammy-security/main but there's a newer 4.4-1ubuntu1.22.04.1 in jammy-updates/main | 18:19 |
clarkb | but why would that not be installable? Do we not allow updates from updates in unattended-upgrades? | 18:21 |
fungi | trying to upgrade from the command, it looks like the newer version of sosreport now requires python3-magic so unattended-upgrades is probably refusing to install new packages by default | 18:21 |
clarkb | aha thats the bit I think I was missing | 18:22 |
fungi | yeah, compare the Depends: lines | 18:22 |
fungi | `apt show sosreport -a|grep -e '^Version:' -e '^Depends:'` | 18:22 |
clarkb | I guess knowing that proceeding seems fine? | 18:23 |
clarkb | both with updating the package by hand and landing the dns change then the system-config change? | 18:23 |
fungi | it added two new dependencies, one of which was already installed but the other is not yet | 18:23 |
clarkb | I just want to make sure it wasn't in some super bad state where starting over would be better | 18:23 |
fungi | normally package updates in debian stable releases would not be allowed to add dependencies like that, i agree it's surprising | 18:23 |
clarkb | and i guess a manual apt-get dist-upgrade -y or whatever would unstick it? | 18:24 |
clarkb | and apt-get upgrade would not? | 18:27 |
clarkb | anyway if you think thats is safe and the preferred next step I can do that | 18:27 |
clarkb | and then https://review.opendev.org/c/opendev/zone-opendev.org/+/874044 needs to land before https://review.opendev.org/c/opendev/system-config/+/874043 if you have time to review those .I can do approves in the correct order | 18:30 |
fungi | i simply did `sudo apt upgrade` and it prompted to download the additional package. i didn't follow through with it, just wanted to see what it would propose | 18:36 |
clarkb | I think getting it up to date is a good idea. Should I go ahead and run that then? | 18:37 |
clarkb | (I don't want to start the brand new server with some debt) | 18:37 |
fungi | yeah, it's fine to do that | 18:39 |
clarkb | its going to upgrade: "grub-efi-amd64-bin grub-efi-amd64-signed python-apt-common python3-apt python3-distupgrade shim-signed sosreport ubuntu-release-upgrader-core" and also install python3-magic running `sudo apt upgrade` | 18:40 |
fungi | the reason we're not seeing this on the other servers is probably that they don't have sosreport preinstalled (it's an ubuntu phone-home thing) | 18:40 |
clarkb | ya maybe we want to add it to the remove list | 18:40 |
clarkb | given the grub packages are also going to upgrade I'll do a reboot after this to double check it is happy | 18:41 |
clarkb | heh it didn't auto restart unattended-upgrades.service. Which is just a cronjob I thought. But also I'm rebooting now so it will get restarted then | 18:42 |
clarkb | grub-install: warning: EFI variables cannot be set on this system. | 18:42 |
clarkb | I think beacuse we aren't efi booting? | 18:42 |
clarkb | anyway reboot now | 18:43 |
clarkb | and that went fine. So ya I guess review the two changes and if all looks well and you think we're good to proceed generally I can approve the two chagnes in the correct order | 18:44 |
fungi | yep, taking a look now. thanks for the added diligence there! | 18:47 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove sosreport from our servers https://review.opendev.org/c/opendev/system-config/+/874125 | 18:49 |
clarkb | There's the package removal. We already remove whoopsie which I'm guessing sosreport has replaced? | 18:50 |
opendevreview | Merged opendev/zone-opendev.org master: Add gitea09 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/874044 | 18:53 |
fungi | looks like they're slightly different but related tools | 18:54 |
fungi | whoopsie: Ubuntu error tracker submission | 18:54 |
fungi | sosreport: Set of tools to gather troubleshooting data from a system | 18:54 |
clarkb | for the system-config change: my expectation is that ansible will deploy the containers and then beacuse I modified host_vars/gitea09.opendev.org.yaml it will also run manage projects. Manage projects should create empty projects on gitea09 | 19:02 |
clarkb | then we can figure out the db surgery | 19:02 |
clarkb | I believe that should all be safe as the server is not added to gerrit replication or the load balancer pool | 19:03 |
clarkb | worst case we edit the load balancer to remove the server but it shouldn't have any way of getting in there so I think we're good | 19:03 |
clarkb | I'll sneak out for early lunch today as that should eat up the gating and deploy time | 19:06 |
opendevreview | Merged opendev/system-config master: Add gitea09 to our inventory https://review.opendev.org/c/opendev/system-config/+/874043 | 19:51 |
clarkb | zuul is working through the long list of jobs for ^ | 20:24 |
clarkb | anyone know if we manage the known hosts for gitea servers on gerrit? I'm trying to sort htat out and prep the followup changes for gitea09 while I wait for ansible to do its thing. But I'm not finding it if it exists | 21:29 |
clarkb | nevermind its in private vars I think because we don't want teh test env to have any trust to prod | 21:30 |
fungi | that sounds right | 21:37 |
clarkb | the gitea job is running now. Assuming that goes ok I guess I need to copy the db from 01 to 09 and restore on 09. I'll dump 09's db prior to that so that we can compare if we find that is necessary later | 21:38 |
clarkb | I forgot we process the giteas serially to ensure service uptime. I was wondering why nothing had happened on 09 yet | 21:41 |
clarkb | Iv'e also just realized I should've updated the gitea system-config-run job to deploy a jammy node. Oh well if this fails we'll figure it out | 21:43 |
clarkb | oh I should actually wait until the manage-projects job runs later just to avoid manipulating the db when something else could be modifying settings | 21:45 |
clarkb | the steps are assuming gitea deployment goes well (it looks happy to me right now), then wait for manage projects, then transplant databases | 21:46 |
clarkb | and that won't be done for an hour or two? | 21:47 |
fungi | makessense | 21:49 |
clarkb | also just to remind people this isn't a bfv node. I was planning to use regular disk for these new nodes. I think mnaser warned that data you care about should be in volumes but since these are mirrors and we have several of them I think this is ok? | 21:50 |
clarkb | it simplifies management of things | 21:50 |
clarkb | I guess we can change that later if we decide that is a good idea too | 21:50 |
fungi | yeah, i think it's fine | 22:01 |
clarkb | ok manage projects is done. Lets see bout that db surgery | 23:45 |
clarkb | I've got a leaf blower singing its song outsdie :( | 23:46 |
*** dasm_ is now known as dasm|off | 23:55 | |
clarkb | ok after transplanting the db I see the redirects table has content and the users table has old orgs like osf in it. I'm going to try starting gitea now | 23:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!