*** diablo_rojo has quit IRC | 00:01 | |
*** mlavalle has quit IRC | 00:23 | |
*** diablo_rojo has joined #opendev | 00:31 | |
ianw | johnsom: instance as in that nic should be brought up by glean? | 00:58 |
---|---|---|
mordred | johnsom: I have not seen that issue | 01:17 |
mordred | ianw: oh - haha: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_276/723528/9/check/system-config-run-zuul/276cc43/bridge.openstack.org/ara-report/result/4a8210fa-682e-4af6-b689-fdc263b00b49/ | 01:18 |
mordred | ianw: this is where I'm at atm | 01:18 |
johnsom | ianw: I am just using the ubuntu-minimal element, which used cloud-init. I am pretty sure it is the ifupdown we add (historical reasons) that is the trouble maker. | 01:20 |
johnsom | I will dig in tomorrow | 01:20 |
johnsom | mordred: thank you. | 01:20 |
mordred | johnsom: we use simple-init instead of cloud-init | 01:21 |
mordred | johnsom: you might try that - ubuntu-minimal element itself doesn't install cloud-init I don't think - the ubuntu one does | 01:21 |
mordred | but you could try ubuntu-minimal simple-init and see if it works for you - it's working well for us | 01:22 |
mordred | ianw: 1.8.4~pre1-1ubuntu2 <-- focal has that version of afs - which is later than what's in our ppa | 01:22 |
mordred | ianw: maybe we should skip the ppa on foca? | 01:22 |
mordred | focal? | 01:22 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 01:25 |
mordred | ianw: giving that a try | 01:25 |
ianw | mordred: i'm very suspect of anything with ~pre in it wrt to afs | 01:31 |
ianw | mordred: i think we should update to 1.8.5, https://launchpad.net/~openafs/+archive/ubuntu/stable has packages but not for arm64 | 01:38 |
ianw | mordred: i've got 1.8.5 building for focal @ https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs/+packages now ... let's see how it goes | 01:55 |
ianw | i don't think it's worth updating the ppa with 1.8.5 for older distros. if it ain't broke ... | 02:02 |
ianw | amd64 and arm64 built. will run integration tests when they publish | 02:22 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add focal integration tests https://review.opendev.org/724214 | 02:30 |
ianw | ^ passed ... so i think that's the way to go | 02:59 |
clarkb | I'll try to figure out why https://review.opendev.org/#/c/722394/ is failing forst thing tomorrow to reduce time its sitting out there conflicting | 03:12 |
clarkb | I think the docs issue is a zuul sphinx bug | 03:14 |
*** diablo_rojo has quit IRC | 04:51 | |
*** ykarel|away is now known as ykarel | 04:51 | |
*** ysandeep|away is now known as ysandeep | 05:45 | |
*** ysandeep is now known as ysandeep|brb | 05:55 | |
*** rpittau|afk is now known as rpittau | 06:32 | |
*** ysandeep|brb is now known as ysandeep | 06:51 | |
*** DSpider has joined #opendev | 06:57 | |
*** dpawlik has joined #opendev | 06:59 | |
*** tosky has joined #opendev | 07:29 | |
*** ykarel is now known as ykarel|afk | 07:29 | |
*** ysandeep is now known as ysandeep|lunch | 07:42 | |
*** ralonsoh has joined #opendev | 07:48 | |
kevinz | Arm64 bionic Image fail to setup Devstack today, with newly build image: https://zuul.opendev.org/t/openstack/build/2b2cbea8882844bfa0bf5cc62c705242 | 08:04 |
*** ykarel|afk is now known as ykarel | 08:21 | |
*** ysandeep|lunch is now known as ysandeep | 08:26 | |
*** ykarel is now known as ykarel|lunch | 08:35 | |
*** ykarel|lunch is now known as ykarel | 09:44 | |
*** mrunge has quit IRC | 09:57 | |
*** mrunge has joined #opendev | 09:57 | |
*** rpittau is now known as rpittau|bbl | 10:55 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add loop var policy to ansible-lint https://review.opendev.org/724281 | 11:02 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add loop var policy to ansible-lint https://review.opendev.org/724281 | 11:22 |
*** ysandeep is now known as ysandeep|coffee | 11:29 | |
*** rpittau|bbl is now known as rpittau | 12:31 | |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 12:33 |
openstackgerrit | Thierry Carrez proposed opendev/system-config master: Disable global Github replication https://review.opendev.org/718478 | 12:37 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 12:50 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 12:54 |
*** ykarel is now known as ykarel|afk | 12:58 | |
mordred | ianw: awesome | 12:58 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 12:59 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 13:00 |
* ttx sighs | 13:02 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 13:07 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run test playbooks with more forks https://review.opendev.org/724317 | 13:07 |
mordred | ttx: yes? | 13:08 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 13:10 |
ttx | mordred: things my local tsets fail to catch | 13:11 |
mordred | ttx: oh look - there are some repos in governance that have been retired | 13:11 |
ttx | fun | 13:11 |
AJaeger | ttx, what about using a project-template? | 13:11 |
ttx | AJaeger: would that really be less wordy? | 13:12 |
AJaeger | ttx, one line instead of three - and if you ever want to change, it's easier. | 13:12 |
mordred | ttx: I look forward to a release-bot job that runs on governance and project-config changes and proposes update patches :) | 13:12 |
AJaeger | Name it "official-openstack-repo-jobs" ;) | 13:12 |
ttx | hmmm | 13:13 |
AJaeger | or something like that... | 13:13 |
ttx | ok. I just need to change the code that actually generates that change :) | 13:13 |
mordred | ttx: also - fwiw, if you grab project-config and look in gerrit/projects.yaml - if acl-config for a project is "/home/gerrit2/acls/openstack/retired.config" - it's retired | 13:13 |
mordred | I don't know if that makes anything easier - but just mentioning in case it does | 13:14 |
ttx | I already had a lot of fun with your weird alpha comparison in there | 13:14 |
mordred | we have weird alpha comparison? | 13:14 |
ttx | you normalize - and _ | 13:14 |
mordred | oh - good for us | 13:14 |
ttx | I know, right | 13:14 |
mordred | we like to make things easier for you | 13:15 |
mordred | it's what we're here for | 13:15 |
mordred | especially when jeepyb is involved | 13:15 |
ttx | You think I'm getting too lazy on my python coding | 13:15 |
ttx | so you throw weird curve balls at me | 13:15 |
mordred | I do - I wouldn't want your muscles to atrophy | 13:15 |
ttx | ok, stay put, adding template | 13:15 |
AJaeger | ttx, we don't normalize - sort does | 13:16 |
ttx | hmm | 13:16 |
AJaeger | LC_ALL=C sort --ignore-case ... | 13:16 |
ttx | I think you do https://opendev.org/openstack/project-config/src/branch/master/tools/zuul-projects-checks.py#L41 | 13:17 |
AJaeger | ttx, yeah that surprised me once as well | 13:17 |
ttx | return s.lower().replace("_", "-") | 13:17 |
AJaeger | ttx, you're right. I looked at the wrong place. | 13:17 |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: tox: allow tox to be upgraded https://review.opendev.org/690057 | 13:24 |
fungi | ttx: mordred: in the opendev metrics script i'm using, i check whether project.state is ACTIVE (as opposed to READ_ONLY) as a proxy for identifying what isn't retired | 13:33 |
fungi | though i ultimately only use that as a filter right now for building a list of namespaces with at least one non-retired project | 13:34 |
fungi | can just hit the gerrit rest api method for /project/ | 13:35 |
fungi | anonymously even | 13:35 |
fungi | no need to separately parse acl config options in project-config | 13:35 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add loop var policy to ansible-lint https://review.opendev.org/724281 | 13:36 |
fungi | ttx: actually if you call for a list of projects by regex like ^openstack/.* then gerrit will give you just that subset. filter it by ['state'] == 'ACTIVE' and then if you want, use a set.intersection() against the set of deliverable repos | 13:39 |
fungi | from governance | 13:40 |
fungi | assuming you really just want the subset of writeable openstack namespace repos and deliverable repos in the governance projects.yaml, that is | 13:41 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add loop var policy to ansible-lint https://review.opendev.org/724281 | 13:41 |
*** ykarel|afk is now known as ykarel | 13:46 | |
ttx | AJaeger: where are templates defined those days? | 13:47 |
AJaeger | openstack-zuul-jobs | 13:48 |
ttx | Also to make it one line instead of 3 I have to break how 'templates:' is used in that file | 13:48 |
AJaeger | I would be fine to put the template into a new file in project-config/zuul.d/project-templates.yaml | 13:50 |
AJaeger | ttx, let's see what other reviewers think... | 13:50 |
ttx | openstack-zuul-jobs looks ok to me | 13:51 |
hrw | morning | 14:01 |
hrw | can someone look at arm64 mirrors in opendev infra? jobs fail while connecting | 14:02 |
*** rkukura has quit IRC | 14:03 | |
AJaeger | ttx, I updated your description a bit. | 14:04 |
*** rkukura has joined #opendev | 14:04 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: Add testing of fetch-sphinx-tarball role https://review.opendev.org/721584 | 14:05 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 14:06 |
ttx | arh | 14:07 |
AJaeger | ttx, now double the size? Is that what arh means? | 14:08 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 14:08 |
ttx | sorry, I fumbled | 14:08 |
ttx | ok, this one /should/ be ok | 14:09 |
ttx | (famous last words) | 14:09 |
ttx | good thing being, my script is not idempotent :) | 14:10 |
ttx | s/not/now | 14:10 |
ttx | I need to sleep | 14:10 |
fungi | hrw: have a link to a failed build report? it'll be easier to start from there | 14:11 |
fungi | hrw: since we have arm-specific regions, i'm going to guess something has happened to the local mirror server in one of them | 14:12 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 14:12 |
AJaeger | ttx, LGTM | 14:12 |
*** hashar has joined #opendev | 14:12 | |
mordred | corvus, fungi: apparently https://podman.io/getting-started/installation.html says that we should be getting podman and friends from opensuse kubic now rather than the ppa | 14:13 |
ttx | also submitted https://review.opendev.org/724334 so that the repos removed from Zuul are no longer in governance | 14:13 |
AJaeger | infra-root, please review https://review.opendev.org/718478 https://review.opendev.org/724329 and https://review.opendev.org/#/c/724310 together - that's github mirroring change | 14:13 |
ttx | They should be linked by depends on | 14:14 |
AJaeger | ttx, https://review.opendev.org/#/c/721723/ exists for i18n-specs, let's get it merged ;) | 14:14 |
*** avass has joined #opendev | 14:15 | |
ttx | wait, I left the post job defined in release-test | 14:15 |
AJaeger | ttx, they are correctly linked - still I wanted to have people review them toggether. | 14:15 |
ttx | one laaaast update | 14:15 |
zbr | do we still have problems with zuul? i just got a job failed with "EXCEPTION" at https://review.opendev.org/#/c/721844/ | 14:15 |
hrw | fungi: https://zuul.opendev.org/t/openstack/build/50687729da274e74a70ad8fd9e9fb26d https://zuul.opendev.org/t/openstack/build/c1a316962a044ac79917b2a4a7130777 https://zuul.opendev.org/t/openstack/build/fcdc90d63fde4038aa6dc4fd1689ac7f | 14:16 |
hrw | sorry, had a call | 14:16 |
hrw | fungi: ubuntu debian centos8 | 14:16 |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 14:16 |
AJaeger | infra-root, Zuul is dead again - https://zuul.opendev.org/tenants gives error 500 | 14:17 |
AJaeger | infra-root, seem again run out of swap: http://cacti.openstack.org/cacti/graph.php?action=zoom&local_graph_id=64794&rra_id=0&view_type=tree&graph_start=1588083483&graph_end=1588169883 | 14:18 |
corvus | i'll restart it | 14:20 |
AJaeger | thx, corvus | 14:20 |
fungi | oom already killed a scheduler process (likely the geard fork again) at 14:01:32 | 14:20 |
corvus | something happened to free up memory at 22:40 yesterday | 14:21 |
corvus | it looks like zuul-web is the big user here | 14:21 |
fungi | the fact that our last good queue backup (openstack_status_1588168861.json) was at 14:01 i think confirms it was geard which got killed | 14:21 |
mordred | corvus: clarkb updated the apache to do better caching | 14:22 |
fungi | er, no, just that's also when the web process started returning 500 | 14:22 |
corvus | mordred: did that get restarted? | 14:22 |
mordred | corvus: I think I remember him saying he was going to restart apache to pick up the new settings | 14:23 |
corvus | cool; i'll restart it again just in case | 14:23 |
fungi | 2020-04-28 12:12:33,399 INFO zuul.Scheduler: Starting scheduler | 14:23 |
fungi | that's the last scheduler start i find in the logs | 14:23 |
mordred | corvus: kk | 14:23 |
fungi | which was after the oom a couple days ago i think | 14:23 |
fungi | (immediately after i mean) | 14:23 |
corvus | zuul-web isn't shutting down after docker-compose stop | 14:24 |
*** priteau has joined #opendev | 14:24 | |
corvus | nor fingergw | 14:25 |
corvus | i ran 'docker stop' + their container ids | 14:25 |
corvus | that seems to have stopped them | 14:25 |
mordred | corvus: wow | 14:25 |
avass | corvus: interesting | 14:25 |
corvus | starting again | 14:25 |
corvus | oh | 14:26 |
corvus | i forgot they are in different docker-compose directories | 14:26 |
corvus | so operator error on my part there, sorry | 14:27 |
*** tkajinam has joined #opendev | 14:27 | |
corvus | -rw-r--r-- 1 root root 1340022 Apr 29 14:01 openstack_status_1588168861.json | 14:27 |
corvus | i'll restore from that | 14:27 |
AJaeger | should we send an all green status afterwards? | 14:27 |
fungi | thanks, that matches the last good queue backup i found as well | 14:28 |
*** sean-k-mooney has joined #opendev | 14:28 | |
*** stephenfin has joined #opendev | 14:28 | |
*** dpawlik has quit IRC | 14:28 | |
fungi | doesn't look like the --fail option addition patch for that curl cronjob has gotten applied yet | 14:28 |
corvus | it's up and re-enqueing | 14:29 |
corvus | yeah, a status notice indicating 14:04 - 14:30 should be rechecked would probably be good | 14:30 |
fungi | 14:01 looks like? | 14:30 |
fungi | though also who knows how many events might have also been lost in its events queue | 14:31 |
AJaeger | status notice Zuul had to be restarted, all changes submitted or approved between 14:04 UTC to 14:30 need to be rechecked, we rechecked already those running at 14:04 | 14:31 |
AJaeger | is that good? Anybody better wording? | 14:31 |
corvus | AJaeger: yes, except i typod: should be 14:01. maybe just say 14:00 | 14:31 |
fungi | i would probably round that start time to 14:00 just to be on the safe side, though you could say we've queued earlier changes | 14:32 |
fungi | (we didn't technically recheck them) | 14:32 |
AJaeger | So: #status notice Zuul had to be restarted, all changes submitted or approved between 14:00 UTC to 14:30 need to be rechecked, we queued already those running at 14:00 | 14:32 |
fungi | lgtm | 14:33 |
AJaeger | #status notice Zuul had to be restarted, all changes submitted or approved between 14:00 UTC to 14:30 need to be rechecked, we queued already those running at 14:00 | 14:33 |
openstackstatus | AJaeger: sending notice | 14:33 |
-openstackstatus- NOTICE: Zuul had to be restarted, all changes submitted or approved between 14:00 UTC to 14:30 need to be rechecked, we queued already those running at 14:00 | 14:33 | |
openstackstatus | AJaeger: finished sending notice | 14:36 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 14:38 |
corvus | done enqueing | 14:41 |
corvus | we should check in on this in a few hours and see what's going on | 14:41 |
*** mlavalle has joined #opendev | 14:41 | |
fungi | so looking at our scheduler restart timelines, we didn't restart for weeks up until 2020-04-20 | 14:42 |
fungi | there was a memory jump on 2020-04-16 but corvus you said that coincided with when you were doing stuff with repl, yeah? | 14:42 |
corvus | was it that one? or was it before that? i can't recall | 14:43 |
corvus | this is the current usage of the processes: http://paste.openstack.org/show/792884/ | 14:43 |
fungi | if we discount that, a leak could have been brought in with the 2020-04-20 13:39:31 restart (though we didn't see evidence of a leak) or the 2020-04-24 22:22:05 restart | 14:45 |
fungi | we saw our first oom event between the 2020-04-25 15:56:59 and 2020-04-28 12:12:33 restarts | 14:45 |
corvus | AJaeger: we cleared out more swap on this restart -- maybe not everything was restarted yesterday (especially zuul-web)? | 14:47 |
fungi | hrw: looking at the errors you linked, the first is about "libssl-dev : Depends: libssl1.1 (= 1.1.0g-2ubuntu4) but 1.1.1-1ubuntu2.1~18.04.5 is to be installed" which suggests there's a mismatch between an older libssl-dev you're trying to install and a libssl1.1 package | 14:56 |
mordred | corvus: should we combine the compose file and just have the scheduler start run docker-compose up scheduler -d - and the web start do up web fingergw -d ? | 14:56 |
fungi | hrw: the second looks like it's that we're not hosting a /debian/dists/buster-backports/main/source/Sources on our mirror network | 14:57 |
mordred | corvus: it seems like the experience with split files so far is not super positive | 14:57 |
corvus | mordred: i don't know; cause we may well want to split these to different hosts later, so keeping the roles separate may be good. | 14:57 |
corvus | i'm inclined to stick with the status quo a bit longer | 14:57 |
fungi | hrw: and the third looks like missing (maybe stale) content under /centos/8/AppStream/aarch64/os/repodata/ | 14:58 |
mordred | corvus: ok | 14:58 |
fungi | hrw: we should probably look into each of these as separate problems, i don't see that they're likely to be related in any way | 14:58 |
hrw | fungi: backports were hosted as we use it for months in kolla. | 14:58 |
hrw | fungi: ok. it just happenned at same time | 14:59 |
fungi | hrw: but also, none of these look like connection failures as you originally suggested | 15:00 |
hrw | sorry | 15:00 |
hrw | will do some checking later during evening. | 15:00 |
fungi | hrw: so for the first one, it may be a problem with the mirror server in that region as i can retrieve that directory on our other servers in our mirror network, e.g. mirror.iad.rax.openstack.org/debian/dists/buster-backports/main/source/Sources | 15:02 |
fungi | i mean, that file | 15:04 |
fungi | and i guess that was actually the second failure example, not the first | 15:05 |
*** ykarel is now known as ykarel|away | 15:06 | |
fungi | okay, so second and third failures look like they may be related, as the files they complained about being unable to retrieve are available from endpoints in our other locations | 15:07 |
fungi | i think i see the problem too | 15:08 |
fungi | /dev/vda1 78G 78G 0 100% / | 15:08 |
fungi | rootfs has filled up on mirror.regionone.linaro-us.opendev.org | 15:08 |
*** larainema has joined #opendev | 15:08 | |
fungi | it's possible the first example was also a cascade failure related to being unable to fetch some file from the mirror server | 15:09 |
fungi | seeing what i can do to clean up the disk some and then i'll reboot the server to make sure it's operating correctly again, but ultimately this was built with a too-small rootfs and seems to have no separate volume to put caches on | 15:11 |
fungi | it was likely deemed "good enough" when there was very little arm64 testing going on | 15:11 |
fungi | but it's clearly insufficient now | 15:11 |
hrw | kevinz: ^^ we need to enlarge rootfs on mirror.regionone.linaro-us.opendev.org | 15:11 |
hrw | fungi: suggested size? | 15:12 |
fungi | either get cinder volumes working there or rebuild it with a flavor which has at least a 200gb rootfs | 15:12 |
hrw | fungi: thanks | 15:12 |
fungi | right now it looks like it probably used the same flavor as our test nodes, which just doesn't provide nearly enough disk space for the afs and apache proxy caches | 15:13 |
hrw | sure | 15:13 |
clarkb | mordred: corvus we need https://review.opendev.org/#/c/724115/ to land to apply zuul-web updates | 15:13 |
clarkb | until that gets fixed my changes to apache config are never applied beacuse we fail in the zuul-scheduler portion of our zuul service playbook and never get to zuul-web | 15:14 |
fungi | hrw: kevinz: https://docs.opendev.org/opendev/system-config/latest/contribute-cloud.html actually suggests a 500gb disk, but we can get by with 200gb right now | 15:14 |
mordred | clarkb: any reason to not land that now? | 15:15 |
clarkb | mordred: I dont think so | 15:15 |
mordred | clarkb: great. +A | 15:15 |
clarkb | fungi: no lists OOMs overnight | 15:16 |
hrw | fungi: mailed kevinz to be sure it is not lost | 15:18 |
fungi | clarkb: yeah, still a bit of a memory spike between 09:00-10:00 but not like previous days | 15:18 |
clarkb | fungi: I think that was bup | 15:19 |
fungi | hrw: we've got 2.6gb free on the rootfs now, after deleting the systemd journal and rebooting | 15:19 |
hrw | fungi: thank you | 15:20 |
clarkb | fungi: hrw was it booted on a larger flavor? | 15:20 |
fungi | i'll also make sure apache tries to purge as much of its proxy cache sa it thinks is stale, though that can take a while and may not buy much additional room | 15:20 |
fungi | clarkb: it was not, no, it's got a 80gb rootfs and no cinder volume | 15:20 |
hrw | clarkb: kevinz is an admin there but it is night in China now so it has to wait for tomorrow | 15:21 |
fungi | looks like it already started at fresh htcacheclean run at boot, so i won't run a separate one | 15:21 |
fungi | looks like this server was created 2020-01-22 | 15:22 |
fungi | which i guess is when the new region was being brought online | 15:22 |
fungi | i'll check the flavor list just to make sure there's not already a suitable one available to us for this | 15:23 |
fungi | maybe the mirror server was just incorrectly created | 15:23 |
corvus | clarkb: do we need a zuul-web restart, or just apache? | 15:23 |
corvus | clarkb: (once 115 lands) | 15:23 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use only timesyncd on focal https://review.opendev.org/724354 | 15:25 |
mordred | corvus: I believe just an apache | 15:25 |
mordred | corvus: and that should reduce the pressure on zuul-web aiui | 15:25 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Add loop var policy to ansible-lint https://review.opendev.org/724281 | 15:26 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Set up robots.txt on lists servers https://review.opendev.org/724356 | 15:26 |
clarkb | corvus: just apache | 15:26 |
fungi | clarkb: hrw: kevinz: i've confirmed with openstack flavor list, the largest flavor available is 80gb, but we do seem to have some cinder quote there (the arm64 nodepool builder is using 400gb of it), so i'll try to add a 200gb cinder volume for the mirror | 15:27 |
clarkb | infra-root ^ change above puppetizes the robots.txt change I made on lists.o.o which seems to have addressed the OOMing | 15:27 |
hrw | fungi: thanks | 15:27 |
AJaeger | infra-root, please review https://review.opendev.org/718478 https://review.opendev.org/724329 and https://review.opendev.org/#/c/724310 together - that's github mirroring change. Let's get those quickly in to reduce merge conflicts further... | 15:27 |
clarkb | AJaeger: I've approved the bottom of the stack but might be good to have corvus ack https://review.opendev.org/#/c/724310 since he was invovled in getting the jobs set up right | 15:29 |
fungi | hrw: kevinz: clarkb: okay, so we either need a bigger rootfs flavor or more cinder quota... VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed gigabytes quota. Requested 200G, quota is 500G and 400G has been consumed. | 15:30 |
clarkb | fungi: did we have a volume that disappeared itself? | 15:30 |
clarkb | oh no its the builder | 15:30 |
fungi | i could make a 100gb volume in there, but that won't be enough | 15:30 |
fungi | clarkb: yeah, we have 500gb quota but like i said the arm64 nodepool builder is using 400 | 15:31 |
clarkb | fungi: 100GB might be enough if we put apache cache on cinder and afs cache on root disk | 15:31 |
fungi | yeah, i can probably make that work if i mount creatively | 15:31 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 15:37 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use only timesyncd on focal https://review.opendev.org/724354 | 15:37 |
clarkb | infra-root actually https://review.opendev.org/724356 is likely to fail on the same job my zuul config reorg is failing on. We need to set at least one private hiera var in testing. I'll sort that out shortly | 15:39 |
clarkb | the other major issue is that zuul-sphinx doesn't seem to like it when you have configs in subdirs under zuul.d/ which the zuul docs say is valid | 15:39 |
clarkb | I'll work on sorting that out too | 15:39 |
fungi | hrw: a few builds in that region may get connection refused errors while i've got apache stopped to relocate its cache onto the separate cinder volume | 15:42 |
fungi | shouldn't be long | 15:42 |
hrw | fungi: ok | 15:44 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Set up robots.txt on lists servers https://review.opendev.org/724356 | 15:48 |
fungi | #status log moved /var/cache/apache2 for mirror01.regionone.linaro-us.opendev.org onto separate 100gb cinder volume to free some of its rootfs | 15:49 |
openstackstatus | fungi: finished logging | 15:49 |
clarkb | fungi: thanks! | 15:49 |
fungi | hrw: kevinz: clarkb: ^ we could still stand an additional 100gb of cinder quota so we can move the afsclient cache off the rootfs as well, but that should get us through for now | 15:50 |
fungi | ianw: ^ heads up, i know you've been dealing with that region more than most of us, so just be aware | 15:51 |
clarkb | mordred: https://zuul.opendev.org/t/openstack/build/cd423a364032445bbd9cb4f200c0c871/log/job-output.txt#51660 thats a puppet test failure for nodepool puppeting because it needs a private sshkey in 'hiera'. I expect this is going to be handled by the ansibleification. Do we want to hold off on the zuul.d reorg for that as that should fix the job? I've got to fix the lists job and zuul-sphinx (with a release) | 15:51 |
clarkb | anyway so waiting may not be terrible | 15:51 |
clarkb | mordred: note https://review.opendev.org/724356 bundles the lists fix so that the robots.txt change can land | 15:51 |
clarkb | I'm going to work on zuul-sphinx now | 15:51 |
mordred | clarkb: ++ | 15:53 |
hrw | fungi: thanks | 15:56 |
fungi | welcome! | 15:57 |
openstackgerrit | Merged openstack/project-config master: Add Github mirroring job to all official repos https://review.opendev.org/724310 | 15:58 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709 | 16:07 |
mordred | clarkb, corvus : rereview of that ^^ when you get a sec, it was missing a private hiera var | 16:08 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 16:10 |
*** ysandeep|coffee is now known as ysandeep|away | 16:12 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use only timesyncd on focal https://review.opendev.org/724354 | 16:13 |
clarkb | mordred: I think thats the fix for the nodepool job failure I had too fwiw | 16:17 |
*** Dmitrii-Sh0 has joined #opendev | 16:20 | |
*** rkukura has quit IRC | 16:21 | |
*** rkukura has joined #opendev | 16:21 | |
mordred | clarkb: yeah - I think so | 16:22 |
*** hashar has quit IRC | 16:22 | |
*** rkukura has left #opendev | 16:23 | |
clarkb | infra-root I've noticed that I never commited our host var / group var changes from friday to fix ssh keys for zuul things | 16:23 |
ttx | fungi, mordred, corvus: looking at https://zuul.openstack.org/builds?job_name=openstack-upload-github-mirror it seems to have picked up mirroring | 16:23 |
fungi | woo! | 16:23 |
ttx | The duration of those jobs is still a bit disturbing | 16:24 |
*** Dmitrii-Sh has quit IRC | 16:24 | |
*** Dmitrii-Sh0 is now known as Dmitrii-Sh | 16:24 | |
clarkb | I'm committing those changes now | 16:24 |
ttx | the mirroring itself takes about 5 seconds but there is lots of boilerplate | 16:24 |
ttx | about 50s to set up the job and 30s to close it | 16:25 |
ttx | I wonder if that blocks the executors for too long | 16:27 |
AJaeger | clarkb, fungi, can you now +A https://review.opendev.org/#/c/718478/ to remove the gerrit mirroring? | 16:27 |
fungi | ttx: the executors run ansible for multiple builds in parallel, and really only throttle taking on new builds if they start to get excessive system load or memory utilization | 16:28 |
ttx | ok, maybe keep an eye on it and see if it horribly slows down things or not | 16:29 |
ttx | at least the refs/changes cleanup was useful | 16:29 |
fungi | ttx: so running an additional 1-2 minute job for each commit which merges is probably not going to make a dent given the average number of test-hours each of those changes consumed already | 16:30 |
ttx | FYI it took about 4 days of continuous work to clean them all up | 16:30 |
fungi | that's a lot of deletions, indeed | 16:32 |
ttx | also github fails horribly when you try to delete more than 100 at a time | 16:32 |
fungi | you could have just stopped at "github fails horribly" ;) | 16:33 |
ttx | fungi: https://review.opendev.org/#/c/718478/ ready to go | 16:34 |
ttx | (removing the gerrit-level mirroring) | 16:34 |
mordred | ttx: this is very exciting - thanks for doing that work! | 16:35 |
ttx | now I can start to aggressively move abandoned things out of the openstack org | 16:35 |
ttx | thanks all for the help! | 16:36 |
AJaeger | mordred: want to +A the change after your +2 and corvus', please? It is ready now to merge | 16:37 |
mordred | AJaeger: done | 16:37 |
*** rpittau is now known as rpittau|afk | 16:37 | |
AJaeger | thx | 16:37 |
fungi | ttx: we'll still need a gerrit restart after 718478 merges | 16:38 |
fungi | since the replication plugin config is only read on gerrit service startup | 16:39 |
ttx | noted | 16:39 |
AJaeger | fungi: do we need a long downtime for it? | 16:39 |
clarkb | AJaeger: no just a restart | 16:39 |
clarkb | shouldn't be more than a couple minutes | 16:40 |
fungi | AJaeger: nope, just a quick restart | 16:40 |
AJaeger | so, something we can sneak in today? | 16:40 |
fungi | maybe, though this is a busy couple of weeks for openstack | 16:40 |
fungi | final rcs next week | 16:40 |
*** Dmitrii-Sh has quit IRC | 16:41 | |
AJaeger | if we don't restart, is there a problem that gerrit pushes and the job pushes as well? | 16:41 |
fungi | though there's not much of a rush on the gate at this point, mostly just the deployment projects trying to catch up i think | 16:42 |
*** Dmitrii-Sh has joined #opendev | 16:42 | |
openstackgerrit | Merged zuul/zuul-jobs master: Add loop var policy to ansible-lint https://review.opendev.org/724281 | 16:42 |
AJaeger | fungi: yes, RC1 is cut for most (all?) repos | 16:42 |
fungi | AJaeger: not really, it may just generate errors when it tries to replicate to nonexistent github repos | 16:42 |
fungi | (errors nobody will see unless they look at the gerrit error log) | 16:43 |
AJaeger | ;) | 16:43 |
fungi | though it might also trigger some sort of throttling behavior from github if we're bombarding them with replication attempts to nonexistent repos | 16:43 |
fungi | so probably still best avoided | 16:43 |
AJaeger | so, let's ask ttx to wait with repo removal until the restart | 16:44 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709 | 16:44 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 16:44 |
ttx | sure, np | 16:45 |
mordred | clarkb, corvus : sorry - one more time - I forgot that puppet only recognizes a single group for a host - so putting the vars in nodepool-launcher was bogus - they have to go in nodepool | 16:45 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 16:56 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use only timesyncd on focal https://review.opendev.org/724354 | 16:56 |
*** tkajinam has quit IRC | 16:57 | |
clarkb | infra-root https://review.opendev.org/#/c/723756/ increases the system-config-run-zuul job's timeout because we get failures like https://zuul.opendev.org/t/openstack/build/7c4c4eaaaf0646c08ac0355212c6f60b | 17:01 |
clarkb | I think openafs dkms builds are a good chunk of that | 17:02 |
mordred | clarkb: I've actually got an increase in zuul-executor patch too :) | 17:02 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Increase timeout on system-config-run-zuul https://review.opendev.org/723756 | 17:03 |
AJaeger | system-config reviewer, https://review.opendev.org/723251 removes git*openstack.org from cacti, can we merge that, please? | 17:03 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run test playbooks with more forks https://review.opendev.org/724317 | 17:04 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 17:04 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use only timesyncd on focal https://review.opendev.org/724354 | 17:04 |
mordred | clarkb: ^^ rebased yours then rebsaed the focal stack on top of it so I could de-dupe the timeout increase | 17:05 |
clarkb | mordred: cool | 17:05 |
clarkb | and actually my system-config reorg doesn't have that in it | 17:05 |
clarkb | so once we think all three of the failing jobs on ^ are fixed I should refresh it with up to date content again | 17:05 |
clarkb | then try and land it fast :) | 17:06 |
mordred | clarkb: :) | 17:08 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 17:19 |
mordred | clarkb: https://review.opendev.org/#/c/720709/ is green now | 17:29 |
AJaeger | mordred: could I trouble you with reviewing 723251, please? Will make scrolling for zuul in cacti a tiny bit easier ;) | 17:29 |
mordred | corvus: https://review.opendev.org/#/c/723889/ could use a re-review | 17:29 |
clarkb | mordred: +A | 17:30 |
*** priteau has quit IRC | 17:32 | |
AJaeger | mordred: thx | 17:32 |
*** diablo_rojo has joined #opendev | 17:33 | |
*** priteau has joined #opendev | 17:34 | |
*** priteau has quit IRC | 17:40 | |
*** ralonsoh has quit IRC | 17:41 | |
openstackgerrit | Clark Boylan proposed opendev/puppet-mailman master: Create /srv/mailman https://review.opendev.org/724389 | 17:47 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Set up robots.txt on lists servers https://review.opendev.org/724356 | 17:48 |
clarkb | mordred: ^ eventually we should get a working test there | 17:48 |
openstackgerrit | Clark Boylan proposed opendev/puppet-mailman master: Create /srv/mailman https://review.opendev.org/724389 | 18:03 |
mordred | clarkb, corvus: woot - https://review.opendev.org/#/c/720527/ is green again | 18:03 |
*** priteau has joined #opendev | 18:10 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 18:11 |
clarkb | mordred: I need to pop out for a bit but will try and properly review that one after | 18:12 |
*** priteau has quit IRC | 18:14 | |
mordred | clarkb: maybe by the time you get back the executor-on-focal patch will be grreen too | 18:26 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 18:37 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: DNM Run builder tests on expanded node https://review.opendev.org/724079 | 18:37 |
*** iurygregory has quit IRC | 18:38 | |
*** Dmitrii-Sh1 has joined #opendev | 18:38 | |
*** Dmitrii-Sh has quit IRC | 18:42 | |
*** Dmitrii-Sh1 is now known as Dmitrii-Sh | 18:42 | |
clarkb | mordred: I don't think the system-config-run jobs that use puppet are doing depends-on properly | 19:02 |
clarkb | mordred: https://review.opendev.org/#/c/724356/3 that is still failing even though its parent creates the dir it complains about | 19:03 |
corvus | where do modules like puppet-mailman get installed? | 19:40 |
clarkb | corvus: I think /etc/puppet/modules then they are ansible synchronized onto remote hosts into the puppet install | 19:42 |
clarkb | thats /etc/puppet/modules on bridge | 19:43 |
*** hashar has joined #opendev | 19:43 | |
clarkb | looking at production brdige that seems to be the case | 19:43 |
corvus | ansible-role-puppet does the copying from bridge to remote node | 19:44 |
corvus | so what, in the system-config-run job puts the repo in /etc/puppet/modules? | 19:44 |
*** priteau has joined #opendev | 19:44 | |
clarkb | corvus: looks like playbooks/roles/install-ansible/tasks/main.yaml calling install_modules.sh | 19:45 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 19:45 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: DNM Run builder tests on expanded node https://review.opendev.org/724079 | 19:45 |
mordred | yes - install-ansible | 19:46 |
mordred | clarkb, corvus and yes - getting install-modules to use zuul prepared repos is one of the next things on my list | 19:46 |
clarkb | mordred: ok in this case should we just land https://review.opendev.org/724389 and recheck https://review.opendev.org/724356 ? | 19:47 |
mordred | we can actually simplify it quite a bit | 19:47 |
mordred | yes - I think that would be better than waiting on reworking install-modules | 19:47 |
clarkb | mordred: https://review.opendev.org/#/c/720709/ failed on timed out nodepool job fwiw | 19:47 |
mordred | I also think we can improve it even past just installin from zuul - to not syncing all of the puppet to every host | 19:48 |
mordred | but I think that's two steps | 19:48 |
mordred | clarkb: sigh. afs modules | 19:48 |
mordred | clarkb: we should maybe bump the timeout there in the same way | 19:49 |
clarkb | if afs is involved then very likely | 19:49 |
mordred | clarkb: I've also got a patch up to increase the forks setting | 19:49 |
openstackgerrit | Merged opendev/system-config master: Update to tip of master in periodic jobs https://review.opendev.org/723889 | 19:49 |
mordred | so that we do more in parallel on the jobs with multiple hosts | 19:49 |
mordred | clarkb: https://review.opendev.org/#/c/724317/ | 19:50 |
corvus | mordred: i don't understand how afs relates to 709 | 19:50 |
mordred | corvus: it runs the nodepool job, which is installing the afs package - which compiles the kernel module | 19:51 |
corvus | why does the nodepool job install the afs package? | 19:51 |
mordred | oh - wait. it doesn't | 19:51 |
mordred | nevermind - I'm dump | 19:51 |
mordred | dumb | 19:51 |
*** priteau has quit IRC | 19:51 | |
* mordred is confusing launchers and executors again | 19:51 | |
mordred | lemme go see what went on | 19:51 |
openstackgerrit | Merged opendev/system-config master: Disable global Github replication https://review.opendev.org/718478 | 19:53 |
openstackgerrit | Merged opendev/system-config master: Remove git*.openstack.org https://review.opendev.org/723251 | 19:53 |
mordred | corvus: if I'm not reading the log wrong, I think we're running the run playbooks serially - like, there doesn't seem to be much in the way of parallelism | 20:03 |
corvus | mordred: i'm having a lot of trouble reading those logs, they seem to mostly be rsync file lists; can you summarise what you're seeing? | 20:04 |
corvus | mordred: like what playbook? i thought we're only running one -- the service-nodepool playbook? | 20:05 |
corvus | er what playbooks | 20:05 |
mordred | yes - that's what I mean - I think our run of service-nodepool is taking longer because the change added more hosts and we don't have good parallelism | 20:07 |
mordred | (trying to find some good links - this is still just in hypothesis range) | 20:08 |
clarkb | mordred: few things on https://review.opendev.org/#/c/720527 | 20:08 |
corvus | mordred: right, but there's only one playbook, so you're saying the playbook itself has poor paralellism? | 20:08 |
mordred | yes - I'm saying I think this: https://review.opendev.org/#/c/724317/2/playbooks/zuul/run-base.yaml might help with that | 20:08 |
corvus | mordred: well default is 5 and we don't have more than 5 hosts | 20:09 |
corvus | so that should help in prod but not test? | 20:09 |
mordred | oh - right - the default is 5 isn't it | 20:09 |
clarkb | ansible will do each task across all hosts before moving to the next task right? | 20:10 |
mordred | then I agree - that won't be super helpful for this one - I think it helped the zuul change because there are 6 hosts there | 20:10 |
clarkb | might be quicker to have them run free if that isn't going to cause problems | 20:10 |
*** avass has quit IRC | 20:10 | |
corvus | the plays are free | 20:11 |
clarkb | k | 20:11 |
corvus | but the playbook is still a series of plays | 20:11 |
corvus | it's nb01,nl01 followed by nb04, followed by nb01 | 20:12 |
corvus | and i'm guessing that happens after the zk stuff | 20:12 |
mordred | yeah -actually - I think we could streamline a bit by rearranging that and de-duplicating a bit | 20:13 |
corvus | mordred: i think we could rework the service-nodepool playbook to be more parallel but with a little more yaml repitition | 20:13 |
corvus | heh, i would describe it as re-duplicating :) | 20:13 |
mordred | yeah | 20:13 |
corvus | i'm not quite sure what that looks like though; this seems a little hard to describe in ansible | 20:14 |
mordred | corvus: https://etherpad.opendev.org/p/aDp6AHnb84UKlAD7LUvH | 20:14 |
mordred | corvus: there's one change that I think would add more parallism for install-docker, nodepool-base and configure-openstacksdk | 20:15 |
mordred | we could further make configure-openstacksdk happen on all of them at the same time | 20:15 |
corvus | structurally, isn't the way to get more parallelism to have fewer plays? | 20:15 |
mordred | forks only helps within a play | 20:17 |
mordred | so in the cases where we have a given role spread into two plays, we can't do that role in parallel - we're doing it twice on different sets of hjosts | 20:18 |
mordred | corvus: so -in that etherpad, we're always doing each role on the all of the hosts - and only listing it once | 20:18 |
mordred | it's possible it's ahorrible strategy | 20:18 |
corvus | i think that's more serial | 20:19 |
mordred | I think it's both more serial and more parallel :) | 20:19 |
corvus | mordred: in https://github.com/ansible/proposals/issues/31 bcoca says "have your cd system run multiple ansible-playbook processes" | 20:19 |
mordred | yeah - really here the issue is that we have one service-nodepool playbook doing 3 completely different things | 20:20 |
corvus | mordred: we only have one of each host in the test, so if a play only applies to one class of host, then it's as serial as it can possibly get in the test, right? 1 play, one host. | 20:20 |
mordred | corvus: yeah - that's why I was combining things like install-docker into a play with more hosts | 20:20 |
mordred | so that we wouldn't run it twice across teh series - but instead would only ever run it once | 20:20 |
mordred | but I think the real answer is to decompose this into more playbooks | 20:21 |
mordred | and trigger them differently even | 20:21 |
corvus | there's one other alternative but its ugly | 20:21 |
mordred | yeah? | 20:21 |
corvus | a playbook with a task list that does conditional role inclusion | 20:21 |
mordred | eww | 20:21 |
mordred | yeah | 20:21 |
corvus | it's the best way to get high parallelism with the playbook construction we have now | 20:21 |
mordred | this is true | 20:22 |
corvus | if we want to split the playbooks into "nodepool-builder" "nodepool-launcher" "new-style-nodepool-builder" then we still also need a way to run those in parallel | 20:22 |
mordred | want me to try that? also - this is an extra bad case currently because we've got multiple different deploy strategies going on at the same time | 20:22 |
mordred | corvus: well - for tests we dont' really need to run all of them in the same test | 20:23 |
corvus | mordred: yeah, but there's a huge test setup cost | 20:23 |
mordred | yeah | 20:23 |
corvus | mordred: we could also bump the timeout here and kick this down the road until we're not straddling puppet | 20:23 |
mordred | corvus: yeah. I think we do need to make this better - but this is a particularly bad version of this | 20:23 |
corvus | it still may be worth improving paralleism then, but that might just be a single role | 20:24 |
mordred | yeah - a bunch more of them get to be shared - liek nodepool-base | 20:24 |
corvus | so our calculus about whether to split the playbook vs do conditional role inclusion may be a lot different | 20:24 |
mordred | ++ | 20:24 |
mordred | that said - maybe we should have the rsync of puppet modules be quieter? | 20:25 |
mordred | because there's just a crapton of rsyncing output | 20:25 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709 | 20:27 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 20:27 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Increase timeout for run-service-nodepool https://review.opendev.org/724415 | 20:27 |
mordred | corvus: while we're on the topic - wanna review https://review.opendev.org/#/c/724317/ and https://review.opendev.org/#/c/723756/ ? | 20:28 |
*** mrunge has quit IRC | 20:31 | |
corvus | mordred: yeah, we should dial down the rsync :) | 20:31 |
openstackgerrit | Monty Taylor proposed opendev/ansible-role-puppet master: Add flag to control logging the rsyncs https://review.opendev.org/724418 | 20:34 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Stop logging the rsync of puppet https://review.opendev.org/724419 | 20:35 |
mordred | corvus: ^^ done | 20:35 |
clarkb | mordred: did you see my note on https://review.opendev.org/#/c/720527/ ? | 20:43 |
mordred | clarkb: I did not - but do now - and agree - thank you | 20:44 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 20:51 |
mordred | clarkb: responded and fixed | 20:51 |
clarkb | mordred: the problem with that first testinfra condition is the not | 20:52 |
clarkb | mordred: I think its still wrong? | 20:52 |
mordred | clarkb: oh - duh | 20:59 |
mordred | it wants to be "if not nl, skip | 20:59 |
clarkb | ya | 20:59 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 21:00 |
mordred | yah. with you now | 21:00 |
clarkb | cool I'm going to pop out for a bike ride now. If anyone is able to review thsoe mailman changes, particularly https://review.opendev.org/#/c/724389/ so https://review.opendev.org/#/c/724356/3 can be rechecked that would be great | 21:04 |
clarkb | that coupled with zuul-sphinx release and mordred's nodepool work should make it possible for us to split the zuul configs in system-config | 21:04 |
openstackgerrit | Merged opendev/system-config master: Increase timeout on system-config-run-zuul https://review.opendev.org/723756 | 21:08 |
openstackgerrit | Merged opendev/system-config master: Run test playbooks with more forks https://review.opendev.org/724317 | 21:08 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709 | 21:18 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 21:18 |
mordred | clarkb, corvus : sigh - I had to squash the timeout change with the zk change because of a different fix that's also in the zk change. so I need to ask for another re-review | 21:19 |
corvus | mordred: i'm going to +3 for clarkb on that one | 21:30 |
mordred | corvus: cool | 21:32 |
*** hashar has quit IRC | 21:35 | |
mordred | corvus: oh wow - going for it with the launchers! | 21:36 |
mordred | corvus: I believe we're going to need to chown things on them like we did on the zuul nodes - but obviously things won't be down in a user-facing way while we do | 21:37 |
corvus | mordred: er, should i not have +3d that? i thought they had all been approved already | 21:38 |
corvus | mordred: i'll change that to a +2 | 21:39 |
*** DSpider has quit IRC | 21:39 | |
mordred | corvus: I mean - honestly it's probably fine as long as we're ready to go chown things | 21:40 |
mordred | but - yeah - maybe we should land it when we're all watching just in case? | 21:40 |
corvus | i've got enough other stuff going on | 21:41 |
corvus | it looks like memory use on zuul01 has increased greatly | 21:41 |
corvus | i'll start looking into that | 21:41 |
mordred | corvus: ++ | 21:42 |
corvus | it does appear that it's actually zuul-web that's the lion's share | 21:42 |
mordred | corvus: gross (although I suppose it's better that than a scheduler memory leak) | 21:42 |
mordred | corvus: have we restarted apache with clarkb's changes yet? | 21:43 |
mordred | corvus: nope | 21:43 |
corvus | here's the currest snapshot: http://paste.openstack.org/show/792904/ | 21:43 |
mordred | corvus: https://review.opendev.org/#/c/724115/ is the patch that fixes ansible so that the zuul-web changes can get applied | 21:44 |
corvus | mordred: we may have one of those too | 21:44 |
mordred | and it hit a timeout - which we have now fixed in the job | 21:44 |
mordred | so I'm going ot recheck that | 21:44 |
corvus | yeah i think maybe we don't need to be landing any prod changes that aren't fixing things at this point :) | 21:44 |
mordred | ++ | 21:44 |
mordred | corvus: how about I enqueue that one into the gate | 21:44 |
corvus | so is the apache fix on disk? | 21:44 |
mordred | no | 21:45 |
fungi | took a peek at zuul.o.o memory utilization just now, still growing well beyond our normal levels and zuul-web seems to be consuming the lion's share | 21:45 |
fungi | now i see you just switched gears to talking about that | 21:45 |
corvus | mordred: then yes please | 21:45 |
mordred | it's not been applied because ansible keeps bombing out | 21:45 |
mordred | corvus: ok. enqueued | 21:45 |
corvus | i'll restart zuul-web to buy us more time | 21:45 |
mordred | that should fix the ansible run which should then restart apache with the new caching settings in place | 21:46 |
mordred | (that was one of those fun rabbit hole) | 21:46 |
fungi | i agree, this seems to be growing fast enough it almost had to be something in the restart on friday/saturday | 21:47 |
mordred | I think the friday restart was _definitely_ involved with the zuul-web issue | 21:47 |
mordred | given the followup apache fix | 21:48 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use only timesyncd on focal https://review.opendev.org/724354 | 21:49 |
fungi | presumably something which merged between 2020-04-20 14:03:36 and 2020-04-24 22:42:48 | 21:49 |
corvus | there was a cherrypy release on april 17 | 21:49 |
corvus | last one before that was nov 27; we should keep that in mind | 21:49 |
fungi | otherwise i think we would have been seeing higher memory utilization late last week, which cacti doesn't indicate | 21:50 |
corvus | it's probably good to reduce the exposure of zuul-web, however, i'm not sure it should be using linearly growing memory under any circumstances | 21:54 |
corvus | so i'm not sure we should expect or consider the apache changes to be a fix for the memory use | 21:54 |
mordred | corvus: no - if anything I expect them at best to just to be a a damper on the growth- if the issue is that cherrypy is getting hit for status.json more frequently and for some reason it's leaking that (that being likely to be the largest object it interacts with frequently) - then apache config issue could simply exacerbate | 21:57 |
mordred | but I completely agree - it's not reasonable for zuul-web to use memory like that | 21:57 |
corvus | i wonder if we should try a pin to cherrypy release-1 | 21:57 |
mordred | corvus: worth a try - the memory growth is quick enough we should be able to get some data by having done so | 21:58 |
mordred | corvus: (we should collect a baseline growth with apache config in place first) | 21:59 |
mordred | so that we're not comparing apples to oranges | 21:59 |
corvus | mordred: or even temporarily revert the apache change if it "helps" too much | 21:59 |
mordred | yah | 22:00 |
mordred | corvus: speaking of: https://review.opendev.org/#/c/723855/ | 22:00 |
mordred | corvus: so there is apparently another reason to pin back anyway | 22:01 |
mordred | corvus: I need to afk for a bit - you in ok shape for me to do that? | 22:01 |
corvus | mordred: yeah, but i don't understand that commit message | 22:02 |
fungi | we should be able to do that without a full scheduler restart, just restarting zuul-web, right? | 22:02 |
corvus | yeah | 22:03 |
mordred | corvus: I think the issue is in prepping for the new depsolver from pip | 22:04 |
mordred | corvus: since we pin cheroot and that is different from what cherrypy declares, current pip will install it, but the pip check which will apply the depsolver will fail - and future pip+depsolver will fail | 22:04 |
mordred | since future pip will refuse to install conflicting sets of dependencies | 22:05 |
corvus | mordred: thanks | 22:05 |
*** mlavalle has quit IRC | 22:09 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run nodepool launchers with ansible and containers https://review.opendev.org/720527 | 22:11 |
*** mlavalle has joined #opendev | 22:12 | |
*** mlavalle has quit IRC | 22:20 | |
openstackgerrit | Merged opendev/system-config master: Don't restart the zuul scheduler in prod https://review.opendev.org/724115 | 22:22 |
openstackgerrit | Merged opendev/system-config master: Run zookeeper cluster in nodepool jobs https://review.opendev.org/720709 | 22:22 |
*** mlavalle has joined #opendev | 22:26 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: install-docker: remove arch match https://review.opendev.org/724435 | 22:39 |
clarkb | mordred: corvus anything I can do to help with that ? | 22:40 |
clarkb | I'm back from biking | 22:40 |
clarkb | that == zuul web stuff | 22:41 |
corvus | clarkb: i think the next step is to restart once we have an image with https://review.opendev.org/723855 | 22:46 |
clarkb | rgr | 22:46 |
corvus | that should confirm or eliminate cherrypy as a cause, but we need to keep your apache fixes in mind -- we might want to temporarily revert them on disk to reduce variables | 22:46 |
clarkb | k | 22:48 |
clarkb | though I don't think my changes have taken effect yet | 22:49 |
clarkb | (so all of the laeking was without them) | 22:49 |
corvus | clarkb: right, but they're about to (or just have) since 724115 merged | 22:50 |
clarkb | ya | 22:50 |
*** tkajinam has joined #opendev | 22:51 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add system-config-run-base-arm64 https://review.opendev.org/724439 | 23:04 |
openstackgerrit | Merged opendev/system-config master: Add --fail flag to zuul status backup curl https://review.opendev.org/723896 | 23:05 |
*** tosky has quit IRC | 23:11 | |
ianw | whast is Open-E JovianDSS CI ? | 23:30 |
clarkb | ianw: what is the context? | 23:36 |
ianw | clarkb: sorry i probably should have said "have we had any communication from ... " | 23:38 |
ianw | it's been leaving config error comments on devstack chanages (at least) for a while | 23:39 |
clarkb | not that I am aware of | 23:39 |
fungi | ianw: i'm guessing some third-party ci system for a cinder driver | 23:39 |
clarkb | usually we can disable those if they are too chatty | 23:39 |
clarkb | disable the account in gerrit I mean | 23:39 |
fungi | web searches for "Open-E JovianDSS" turn up network-attached storage devices | 23:40 |
fungi | sounds like it may be misconfigured to report on the wrong project(s) | 23:41 |
clarkb | my changes to the zuul vhost config have applied | 23:41 |
clarkb | they appear to be functional | 23:41 |
clarkb | my browser says I'm getting back gzipped js and css files now which is good and transfer time for those seems a bit slower | 23:41 |
clarkb | the caching changes should only apply to status.json though | 23:41 |
ianw | i'll reach out ... https://wiki.openstack.org/wiki/ThirdPartySystems/Open-E_JovianDSS_CI | 23:42 |
clarkb | we could extend the caching to apply to static resources too? | 23:42 |
clarkb | anyway hopefully caching the status json helps with memory use | 23:42 |
clarkb | (we are using the disk not memory cache iirc) | 23:42 |
ianw | for the record i've sent a mail to the contacts there asking to take a look, i'l log in qa too | 23:43 |
clarkb | corvus: looks like we failed to promote that image for zuul-web | 23:44 |
clarkb | corvus: I think we may have suppressed the logging of why it failed too | 23:44 |
clarkb | https://zuul.opendev.org/t/zuul/build/dd5783de71054298904f7890adf45529/console#1/0/3/localhost | 23:44 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!