ianw | so weird, because pip verbosity doesn't even seem to affect the verbosity of the uwsgi build bits | 00:05 |
---|---|---|
clarkb | no, but I think it does affect buffering due to python stuff | 00:07 |
ianw | could try reducing CPUCOUNT= to see it's some sort of dependency thing ... but no error at all ... | 00:07 |
clarkb | thats the other weird thing is it says it failed but doesn't say how or why | 00:09 |
fungi | feels like maybe it's saying why on stderr and pip is bitbucketing that | 00:10 |
ianw | you could also try python uwsgiconfig.py --build directly maybe? | 00:11 |
clarkb | hrm ya we could try that. Would haev to clone the repo rather than relying on pypi but that seems possible | 00:11 |
clarkb | let me try that | 00:11 |
opendevreview | Clark Boylan proposed opendev/system-config master: Try building uWSGI directly https://review.opendev.org/c/opendev/system-config/+/821631 | 00:23 |
clarkb | that isn't mergeable beacuse `ython uwsgiconfig.py --build` doesn't produce a wheel. But maybe it will give us insight if we can make it fail | 00:23 |
ianw | ++ | 00:30 |
clarkb | updated gerritbot is running now. Anyone have a change to update? | 00:33 |
clarkb | All three of the uwsgi bullseye builds when built directly seem fine: https://zuul.opendev.org/t/openstack/build/9c401ac728ed44ab87ce77d368245c6d/log/job-output.txt#1767 https://zuul.opendev.org/t/openstack/build/1ca52f43ac834b5e95d047d91918b1da/log/job-output.txt#1764 https://zuul.opendev.org/t/openstack/build/74c2c9ff1b8948c18d7f5841430a7554/log/job-output.txt#1804 | 00:35 |
ianw | sigh | 00:36 |
ianw | the only other thing i can think is run it under strace with a really big -s value | 00:36 |
clarkb | I think we pushed an event for magnum. trying to verify with logs now (as I'm not in that channel) | 00:39 |
clarkb | hrm no these are all comment added events which we don't notify for | 00:40 |
clarkb | aha it logged we sent something to #tacker | 00:42 |
clarkb | yup its there. I'll link to it as soon as our htmlification runs | 00:42 |
clarkb | But I think gerritbot is good | 00:42 |
clarkb | https://meetings.opendev.org/irclogs/%23tacker/%23tacker.2021-12-14.log.html#t2021-12-14T00:39:41 this was from the new bot | 00:46 |
clarkb | s/new/updated | 00:46 |
clarkb | ianw: ya or maybe hold a node like fungi suggests and see if it is consistent on specific nodes (then we can try all manner of debugging) | 00:47 |
clarkb | But I'm running out of time today. I'll see if I can pick this up tomorrow | 00:47 |
clarkb | Need ot figure out dinner now | 00:47 |
*** rlandy|ruck is now known as rlandy|out | 00:54 | |
ianw | ok, i'll keep thinking | 01:01 |
opendevreview | Ian Wienand proposed opendev/infra-specs master: zuul-credentials : new spec https://review.opendev.org/c/opendev/infra-specs/+/821645 | 03:58 |
opendevreview | Merged openstack/project-config master: Add openEuler 20.03 LTS SP2 node https://review.opendev.org/c/openstack/project-config/+/818723 | 04:56 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: Update Fedora latest nodeset to 35 https://review.opendev.org/c/opendev/base-jobs/+/821649 | 05:00 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: Add 8-stream-arm64 and 9-stream nodesets https://review.opendev.org/c/opendev/base-jobs/+/821650 | 05:00 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Switch 9-stream testing to use opendev mirrors https://review.opendev.org/c/openstack/diskimage-builder/+/821651 | 05:05 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Add debian-bullseye-arm64 build test https://review.opendev.org/c/openstack/diskimage-builder/+/821652 | 05:16 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Add debian-bullseye-arm64 build test https://review.opendev.org/c/openstack/diskimage-builder/+/821652 | 05:24 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Add 9-stream ARM64 testing https://review.opendev.org/c/openstack/diskimage-builder/+/821653 | 05:24 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: debian-minimal: remove old testing targets https://review.opendev.org/c/openstack/diskimage-builder/+/821654 | 05:24 |
*** ysandeep|out is now known as ysandeep | 05:27 | |
opendevreview | chandan kumar proposed openstack/diskimage-builder master: Revert "Fix BLS based bootloader installation" https://review.opendev.org/c/openstack/diskimage-builder/+/821526 | 06:14 |
*** sshnaidm|afk is now known as sshnaidm | 06:57 | |
*** ysandeep is now known as ysandeep|lunch | 07:26 | |
opendevreview | Merged openstack/diskimage-builder master: Use OpenDev mirrors for 8-stream CI builds https://review.opendev.org/c/openstack/diskimage-builder/+/820978 | 07:38 |
*** ysandeep|lunch is now known as ysandeep | 08:41 | |
*** ysandeep is now known as ysandeep|afk | 09:35 | |
opendevreview | Lajos Katona proposed opendev/elastic-recheck master: Add query for bug 1954663 https://review.opendev.org/c/opendev/elastic-recheck/+/821684 | 09:56 |
opendevreview | Lajos Katona proposed opendev/elastic-recheck master: Add query for bug 1799790 https://review.opendev.org/c/opendev/elastic-recheck/+/821684 | 10:22 |
*** ysandeep|afk is now known as ysandeep | 10:32 | |
*** rlandy is now known as rlandy|ruck | 11:13 | |
*** jpena|off is now known as jpena | 11:42 | |
dtantsur | hey folks! any issues with pypi mirrors? we see a ton of random errors today. | 12:15 |
dtantsur | see https://review.opendev.org/c/openstack/ironic/+/821010 for example | 12:16 |
ykarel | seeing a lot of those in neutron too, too many red in https://zuul.opendev.org/t/openstack/status, seems only some providers are impacted | 12:19 |
fungi | dtantsur: the first one i looked at seems to be complaining about a dependency conflict between openstackdocstheme and constraints over dulwich, are they all like that? | 12:19 |
dtantsur | different packages | 12:19 |
fungi | ykarel: providers in/around montreal canada again? | 12:19 |
ykarel | fungi, yes atleast i noticed in those | 12:20 |
ykarel | iweb-mtl01 | 12:20 |
fungi | iweb mtl01, vexxhost ya-cmq-1, and ovh bhs1 are all in that area | 12:20 |
*** ysandeep is now known as ysandeep|brb | 12:21 | |
fungi | er, vexxhost ca-ymq-1 | 12:21 |
fungi | i'm still pre-coffee | 12:21 |
ykarel | also seen in rax-iad | 12:21 |
fungi | definitely not that region, that's virginia/washington dc | 12:22 |
fungi | so whatever's going on with pypi is probably more global | 12:22 |
*** outbrito_ is now known as outbrito | 12:22 | |
fungi | is it only pypi-related errors, or problems with other content too? | 12:23 |
ykarel | i noticed only pypi till now | 12:23 |
dtantsur | same | 12:23 |
fungi | pypi's just a caching proxy in our case, so it looks like pypi is probably being serving us stale or incomplete indices again | 12:24 |
fungi | if we can figure out which specific package(s) is/are impacted, we can issue requests to their cdn to refetch from pypi's backend | 12:25 |
fungi | dulwich===0.20.26 seems to probably be one | 12:26 |
dtantsur | keystone 20.1.0.dev19 depends on PyJWT>=1.6.1 The user requested (constraint) pyjwt===2.3.0 | 12:27 |
fungi | yeah, it's usually whatever constraint it's complaining about that it couldn't find in those cases | 12:28 |
dtantsur | ironic 19.0.1.dev7 depends on pecan!=1.0.2, !=1.0.3, !=1.0.4, !=1.2 and >=1.0.0 The user requested (constraint) pecan===1.4.1 | 12:28 |
dtantsur | fungi: looks like dulwich, pyjwt and pecan in our case | 12:29 |
ykarel | http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22The%20user%20requested%20(constraint)%5C%22 | 12:29 |
ykarel | yeap pecan/dulwich/pyjwt | 12:29 |
ykarel | and providers: rax-iad, iweb-mtl01, rax-dfw, | 12:31 |
fungi | thanks, those are the only packages i've seen in the errors so far as well. i'll dig up my notes on how to ask fastly to refresh those indices | 12:31 |
ykarel | and ovh-bhs1 | 12:32 |
frickler | fungi: curl -XPURGE https://pypi.org/simple , just did that | 12:34 |
fungi | i've done like `curl -XPURGE https://pypi.org/simple/dulwich` with and without a trailing / for each of the identified package names | 12:35 |
fungi | from each of the mirrors, in case it matters which endpoint cluster they're sending it to | 12:36 |
ykarel | also seen few failures for python-ironic-inspector-client===4.7.0 in provider airship-kna1 | 12:37 |
jrosser | we see the same keystone/pyjwt problem in some OSA jobs | 12:39 |
fungi | i've now done it for python-ironic-inspector-client as well | 12:39 |
ykarel | ack Thanks fungi | 12:45 |
fungi | here's hoping it helps. if some fastly endpoints simply refreshes from the same stale backend again, then we're not any better off | 12:46 |
ykarel | ack lets see how it goes | 12:47 |
*** ysandeep|brb is now known as ysandeep | 12:48 | |
opendevreview | yatin proposed openstack/project-config master: Update Neutron's Grafana as per recent changes https://review.opendev.org/c/openstack/project-config/+/821706 | 13:30 |
jrosser | clarkb: fungi i may have reproduced the uwsgi build failure https://paste.opendev.org/show/811652/ | 13:43 |
jrosser | hacking the code a bit to import builtins and switching __builtins__.compile for builtins.compile makes it work | 13:44 |
jrosser | but that is now the limit of my python understanding | 13:44 |
fungi | oh weird! | 13:54 |
fungi | if it's that, i wonder why pip is eating the error details | 13:55 |
opendevreview | Merged openstack/project-config master: Update Neutron's Grafana as per recent changes https://review.opendev.org/c/openstack/project-config/+/821706 | 13:56 |
fungi | jrosser: and also i wonder why it only fails for us sometimes | 13:57 |
jrosser | fungi: i'm not sure what is going on tbh - if your build is run through a script or something and stderr gets lost? | 13:58 |
fungi | it's being built by pip which is downloading the sdist and installing it | 13:58 |
jrosser | so locally, when i build with the makefile it's completey fine | 13:58 |
jrosser | but if i `pip3 wheel .` in the same directory it looks like it fails exactly at the point you saw yesterday | 13:59 |
fungi | yeah, that seems like ore than mere coincidence, i agre | 13:59 |
fungi | e | 13:59 |
jrosser | and for $reasons, messing with how it finds __builtin__.compile fixes it | 14:00 |
jrosser | reason i had a dig was that we build uwsgi on every OSA bullseye job and never see anything like this | 14:00 |
jrosser | fungi: with CPUCOUNT=1 the output is not confused with threading, so you can see exactly where it fails https://paste.opendev.org/show/811661/ | 14:03 |
fungi | it came up for us when switching from debian buster to bullseye based python container images | 14:03 |
jrosser | yeah, and i think it's when it enters plugins/python that errors | 14:04 |
jrosser | which may point to python version | 14:04 |
fungi | interestingly, we used python3.7 built on both buster and bullseye in this case | 14:05 |
fungi | switching from the buster 3.7 to bullseye 3.7 images is when we started to run into it | 14:06 |
fungi | but yeah, i have a feeling it's something like a race related to concurrency because whether or not we hit it seems to be influenced by simple things like increasing verbosity | 14:07 |
fungi | classic heisenbug | 14:07 |
fungi | up the logging so you can observe, and you influence the outcome so it stops breaking | 14:07 |
fungi | i'm not finding any examples like yours via a web search, so probably not common | 14:10 |
fungi | their issue tracker is littered with people reporting linker errors on macos | 14:12 |
jrosser | no, i also had a search and didnt find anything | 14:13 |
*** ysandeep is now known as ysandeep|out | 14:13 | |
jrosser | there must be a detail difference between import builtins and __builtins__ in the context of the pip build | 14:13 |
fungi | jrosser: maybe https://github.com/unbit/uwsgi/pull/2373 is a clue? | 14:14 |
jrosser | i've applied that here and there is no difference | 14:16 |
jrosser | i was really surprised they've built their own parallel build system out of python though | 14:16 |
fungi | yeah | 14:17 |
fungi | the comment in https://github.com/agdsn/pycroft/pull/508 does also mention bullseye | 14:18 |
fungi | jrosser: https://github.com/unbit/uwsgi/pull/2362 | 14:19 |
jrosser | oh! | 14:19 |
fungi | though that's with 3.10 | 14:20 |
fungi | web search engines do a poor job of indexing github comments, or so it seems | 14:21 |
jrosser | that has the same effect as switching to builtins.compile | 14:22 |
jrosser | i.e its no longer throwing a error | 14:23 |
fungi | yep | 14:23 |
fungi | more just pointing out that it seems to mention the same exception you got | 14:23 |
fungi | and that someone was seeing it at least as far back as 2021-11-02 | 14:24 |
jrosser | could adjust this patch to do the direct build with `pip3 wheel` instead of calling the build script directly | 14:24 |
jrosser | https://review.opendev.org/c/opendev/system-config/+/821631/1/docker/uwsgi-base/Dockerfile | 14:25 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Try building uWSGI directly https://review.opendev.org/c/opendev/system-config/+/821631 | 14:34 |
fungi | jrosser: clarkb: ^ like that? | 14:34 |
jrosser | yes - hopefully that will behave similarly to what i see | 14:35 |
noonedeadpunk | can I ask infra-root to abandon patches for retired repos? like https://review.opendev.org/c/openstack/openstack-ansible-pip_install/+/720133 and https://review.opendev.org/c/openstack/openstack-ansible-os_almanach/+/658585 ? | 15:30 |
fungi | noonedeadpunk: tc members should be able to abandon patches on retired repos | 15:31 |
noonedeadpunk | ok, gotcha | 15:31 |
fungi | the openstack retirement acl grants them rights to make changes to the repos for such purposes | 15:31 |
fungi | i would, but i'm in the middle of several things already | 15:32 |
fungi | and this is one of the reasons the tc has special acl access over retired repos in the openstack/ namespace | 15:32 |
clarkb | jrosser: fungi: thank you for the help debugging that. I've just returned from a number of early morning errands and it decided to snow just to make things more difficult :) | 16:22 |
clarkb | catching up now | 16:22 |
jrosser | o/ hello | 16:23 |
clarkb | fungi: jrosser: so one thing thatmakes this extra weird is we are trying to rely on our "assemble" script to do bindep and make wheels for us | 16:25 |
clarkb | running pip3 wheel doesn't quite work because you also need to install all the deps and their wheels | 16:26 |
clarkb | considering that upping the verbosity works and we've got a hint as to what is happening maybe we keep the verbosity and link to https://github.com/unbit/uwsgi/pull/2362 ? As for why older python exhibits this I bet you python backported whatever caused that and since we get up to date python we see it | 16:27 |
clarkb | let me know what yall think is reasonable and I'll try ot update changes to accomodate | 16:28 |
clarkb | I've approved https://review.opendev.org/c/opendev/gerritbot/+/818494 and will monitor that as it goes in | 16:29 |
jrosser | instinct says that you are seeing a failure due to https://github.com/unbit/uwsgi/pull/2362 even though the stderr has gone missing | 16:29 |
jrosser | as it stops in exactly the same point as mine did | 16:30 |
clarkb | jrosser: ya I wouldn't be surprised | 16:30 |
clarkb | and strongly suspect python backported whatever change did that in 3.10 on our images | 16:30 |
jrosser | fwiw i had 3.9.2-3 on a bullseye vm | 16:32 |
clarkb | My thought is to link to that pull request and stick with the verbose flag for now. Or just stick the pull request in there as a note for why we don't have bullseye yet. Except we thought we were already on bullseye with those images so I think hacking it to work is probably best | 16:33 |
fungi | yeah, i agree it's quite likely something happening with more recent point releases of python interpreters of varying minor revs | 16:37 |
opendevreview | Merged opendev/gerritbot master: Update the docker image to run as uid 11000 https://review.opendev.org/c/opendev/gerritbot/+/818494 | 16:37 |
clarkb | fungi: do you think that is a reasonable compromize to just stick with the verbosity for now and land the update? | 16:39 |
clarkb | lodgeit in particular thought it was already on bullseye but since our uwsgi image is publishing bullseye with buster contents that isn't true. And this change will fix that | 16:39 |
fungi | clarkb: yeah, that seems fine to me. if the problem begins to crop up for us again we have more to go on and hopefully more detail captured in the build log | 16:40 |
clarkb | cool I'll make that udpate as soon as I've eaten something | 16:40 |
*** marios is now known as marios|out | 16:46 | |
jrosser | fungi: clarkb this does seem to be a bit self-inflicted by uwsgi, they've re-used a builtin function name `compile` and then had to reference the actual builtin version explicitly | 16:54 |
jrosser | and renaming the function away from the builtin also seems to resolve this trouble https://paste.opendev.org/show/811669/ | 16:55 |
fungi | yeah, rolling their own parallel build system, as you observed, is a special kind of nih as well | 17:03 |
jrosser | maybe i make a PR for this as it's really odd what they've done | 17:04 |
opendevreview | Clark Boylan proposed opendev/system-config master: Properly build bullseye uwsgi-base docker images https://review.opendev.org/c/opendev/system-config/+/821339 | 17:06 |
clarkb | jrosser: ++ | 17:06 |
clarkb | also ^ there is the verbosity hack with appropriate details | 17:06 |
fungi | clarkb: your commit message includes a reminder to check with vexxhost, should we get noonedeadpunk to confirm it's fine? | 17:08 |
* noonedeadpunk not working for vexxhost for quite a while now | 17:09 | |
clarkb | fungi: not a bad idea. I'm not sure if they are using this image beyond lodgeit though. If it is just lodgeit then we should be able to confirm it works | 17:09 |
clarkb | https://review.opendev.org/c/opendev/lodgeit/+/821340 via recheck on that running some testing | 17:09 |
fungi | noonedeadpunk: what i meant was i wondered if it was really a reminder to check with you | 17:09 |
fungi | no idea if it was actually vexxhost using those lodgeit images | 17:10 |
clarkb | fungi: well its for whoever at vexxhost is still using that image if at all | 17:10 |
clarkb | mnaser: are you using opendevorg/uwsgi-base docker images for anything? I think you proposed the image initially. We discovered that our bullseye images are actually buster images and https://review.opendev.org/c/opendev/system-config/+/821339 corrects this | 17:10 |
clarkb | Wanted to warn you if you are using them as this shift could be surprising depending on how you use it | 17:10 |
noonedeadpunk | fungi: yeah they used images for lodgeit one day. no idea if they are now. | 17:13 |
clarkb | fungi: is '*.foobar CNAME foobar' and 'foobar CNAME foobar01' a valid DNS configuration? | 17:19 |
clarkb | I guess we have CI for that so I can just push up the change I'm thinking of | 17:19 |
fungi | yeah, that should be fine | 17:21 |
fungi | it was traditionally considered poor form to point a cname to another cname (or an mx to a cname) simply because it results in more recursion to get to the intended address(es), but these days that's usually not the case because modern nameservers are smart enough to return related records when queried so that you don't have to ask again | 17:22 |
fungi | so when you ask for baz.foobar the response from the resolver is going to have not only the cname to foobar but also the cname from foobar to foobar01 and the address records for foobar01 if it has them | 17:23 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Try to make zuul-preview records more clear https://review.opendev.org/c/opendev/zone-opendev.org/+/821743 | 17:23 |
clarkb | fungi: ^ the context is zuul-preview and me getting all confused trying to figure out what the actual host to ssh into was | 17:24 |
clarkb | this was when I was auditing buster to bullseye image update needs | 17:24 |
clarkb | I had our inventory and was checkign things in inventory but zp01 in our inventory wasn't in dns :/ | 17:25 |
clarkb | fungi: I responded to your question at https://review.opendev.org/c/opendev/zone-opendev.org/+/821743 | 17:36 |
*** jpena is now known as jpena|off | 17:37 | |
fungi | clarkb: maybe i wasn't clear with my question... what uses the zuul-preview.opendev.org name? anything? i know what we're using the *.zuul-preview.opendev.org names for | 17:40 |
clarkb | oh I have no idea. But that is all that was in DNS so I assume something | 17:41 |
fungi | i suspect the original server was named zuul-preview and we didn't reevaluate the need for that record when we replaced it with zp01 | 17:41 |
fungi | doesn't hurt to keep the old name around, i guess, i was just pointing out that it's probably cruft | 17:42 |
clarkb | hrm ya maybe check with corvus and mordred and we can shift the *.zuul-preview CNAME to zp01 instead of zuul-preview | 17:43 |
clarkb | mordred: corvus ^ does anything use the zuul-preview.opendev.org name? or should it just be zp01.opendev.org? | 17:43 |
clarkb | wow pytest loads configs out of tox.ini for massive confusion | 17:44 |
fungi | yes, i found that amazing | 17:45 |
fungi | granted, flake8 does as well | 17:45 |
clarkb | flake8 only does it from its flake8 section though right? | 17:45 |
clarkb | at least it is somewhat explicit in that case | 17:45 |
fungi | right | 17:47 |
fungi | the pytest solution is all extra sorts of nuts | 17:48 |
corvus | clarkb: i think the magic proxy is designed to use zuul-preview, but i'm not 100% sure | 18:02 |
clarkb | corvus: ya I think fungi's question is if we only need the *.zuul-preview.opendev.org for the proxy | 18:03 |
clarkb | btu we can be safe and leave both records in place | 18:03 |
corvus | ooh... erm... yeah i'd guess we can remove it | 18:05 |
fungi | right, trying to determine if the bare name (not the subdomain records under it) is cruft | 18:05 |
corvus | still not 100% on that, but i agree, i can't think of a reason we need it | 18:05 |
fungi | but as mentioned, it's fine to keep it | 18:05 |
corvus | i think it was probably just to keep similarity with other hosts, even though nothing should reference it | 18:05 |
fungi | i did some digging in the git histories, and it doesn't seem like there was actually any server before zp01, so my theory that zuul-preview was an older server name is probably wrong | 18:06 |
clarkb | ok if you have a preference to keep or renew let me know and I can update the change | 18:07 |
fungi | i have no preference really, just making sure i understood whether the record was actually used by anything | 18:08 |
fungi | also the extra cname indirection is sort of pointless | 18:08 |
fungi | in fact, even the *.zuul-preview rr doesn't need to be a cname, it could be a/aaaa rrs instead | 18:09 |
fungi | but the cname makes it a little more convenient when we replace the server as its' fewer records to update in the zone | 18:10 |
fungi | it's | 18:10 |
frickler | corvus: regarding zuul processing multiple branch deletions serially: would it make sense to activate tracing while this is happening? maybe too late now but before the next deletions? | 18:11 |
frickler | (we were discussing it in #openstack-infra before) | 18:11 |
frickler | elodilles is currently doing some cleanups | 18:11 |
corvus | frickler: i found and reproduced the bug, so i shouldn't need any more info | 18:14 |
fungi | elodilles has a bunch of outstanding deletes still to apply, so there's an opportunity yet | 18:14 |
fungi | but doesn't sound necessary | 18:14 |
frickler | so how far are we from deploying the fix? does it make sense to delay outstanding deletions to verify it? | 18:15 |
corvus | no fix yet; many hours or maybe tomorrow | 18:19 |
elodilles | actually i can break the script and run again the deletions tomorrow if that makes sense | 18:23 |
elodilles | i mean, continue the branch deletions | 18:24 |
frickler | elodilles: thx, I was just going to ask: how much would it matter to you to delay the deletions? | 18:24 |
elodilles | it shouldn't be a problem | 18:24 |
elodilles | the branches are eol'd already, and the branch deletions are not run instantly anyway, so one extra day shouldn't cause any problem | 18:26 |
fungi | yeah, so you can either continue to trickle them in, or wait until our next rolling scheduler restart once a fix lands, or both | 18:26 |
elodilles | fungi: will 'rolling scheduler restart' happen tomorrow as well, after the fix has landed, or is it something that is scheduled, like, weekly, or so? | 18:38 |
fungi | elodilles: the fix doesn't exist yet, so hard to predict exactly | 18:41 |
fungi | but yes we should in theory be able to restart things once the fix merges | 18:41 |
fungi | now that we have highly-available schedulers, most restarts should be ~zero impact to zuul's operation (except in non-backward-compatible situations with changes to the state data generally) | 18:42 |
elodilles | ok, i understand that. it really shouldn't be a problem to wait a couple of days so that we can test the zuul fix as well. i just wondered if the restart would happen more later, like couple of weeks for example, then it might not worth to wait with the branch deletions | 18:46 |
frickler | relatedly we should also discuss at the meeting about whether and when to do some freeze period over the holidays | 18:46 |
frickler | but I think that wouldn't happen this week, so waiting until tomorrow and then deciding based upen fix progress would be my proposal | 18:47 |
elodilles | frickler: if you say it regarding the branch deletions, then it sounds good to me :) | 18:49 |
frickler | elodilles: yes, pause them until tomorrow and then re-evaluate the status, that's what I meant to say | 18:50 |
elodilles | frickler: ack, thanks, i will do like that :) | 18:52 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] boot test with centos 9-stream https://review.opendev.org/c/openstack/diskimage-builder/+/821772 | 19:07 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 19:34 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 19:34 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 19:37 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 19:37 |
clarkb | looks like gerribot restarted about an hour ago on the uid image update. And ^ happened more rencelty so we should be good on that | 20:09 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 20:15 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 20:15 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] add vm element to 9-stream image test to test bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/821772 | 20:25 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] add vm element to 9-stream image test to test bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/821772 | 20:44 |
opendevreview | Merged opendev/system-config master: Block outbound SMTP connections from test jobs https://review.opendev.org/c/opendev/system-config/+/820900 | 20:46 |
fungi | interesting, looks like mailman is failing to start in our deploy tests for lists.k.i (but working on lists.o.o): https://zuul.opendev.org/t/openstack/build/5657946352694851926161489bfec28f/log/lists.katacontainers.io/syslog.txt#1521-1525 | 20:59 |
fungi | i think it may be due to the lack of a "mailman" meta-list in the config | 21:00 |
fungi | the production server has one | 21:01 |
fungi | so if i add it to the inventory, it'll be a no-op in prod | 21:01 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] add vm element to 9-stream image test to test bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/821772 | 21:04 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 21:04 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 21:04 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add "mailman" meta-list to lists.katacontainers.io https://review.opendev.org/c/opendev/system-config/+/821775 | 21:04 |
ianw | is it just me or is there a lot more "second attempts" in zuul atm? | 21:18 |
fungi | i did see a post_failure on a zuul change moments ago where nodejs ran out of heap memory during yarn build | 21:24 |
fungi | no idea if that's typical | 21:24 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] add vm element to 9-stream image test to test bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/821772 | 21:24 |
fungi | clarkb: ianw: should i move the new playbook for 821144 into playbooks/zuul/ instead? i noticed we have other playbooks/test-* files and so am unsure if there's a reason to keep them in one vs the other | 21:32 |
fungi | i guess it's a question of whether the playbook is run by the nested ansible as opposed to zuul's ansible? | 21:32 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 21:35 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 21:35 |
fungi | 820397 seems to have fixed the failures on the subsequent changes, at least | 21:36 |
ianw | fungi: i think most of them are in playbooks/test-blah.yaml | 21:59 |
fungi | yes, i concur | 21:59 |
fungi | i found only two playbooks/zuul/test_blah.yaml counterexamples | 21:59 |
corvus | the zuul fix merged; i think this is an excellent candidate for a rolling restart based on our analysis. i'm going to begin that shortly. | 22:18 |
fungi | thanks! i concur | 22:21 |
corvus | ianw: incidentally -- what's the latest on the load balancer prep -- did that change to generalize load balancer configs merge? so are we ready to make a zuul lb based on that? | 22:21 |
fungi | i'll be around for a while yet too | 22:21 |
clarkb | I'm back and around if I can help | 22:21 |
clarkb | corvus: they did merge | 22:21 |
clarkb | corvus: they were in the perido of time where system-config wasn't running so I remember them going in | 22:21 |
corvus | cool, so next step is to make "zuul-lb.opendev.org" in the style of gitea-lb? | 22:21 |
clarkb | fungi: ianw: might be a good idea to move them under zuul/ to avoid confusion but I'm not sure if that affects role lookups and similar | 22:21 |
clarkb | corvus: ya I think so | 22:22 |
ianw | ++ afaik we're good to make new lb nodes | 22:22 |
corvus | running the pull playbook now | 22:23 |
corvus | done; that was not a noop | 22:26 |
corvus | i'd like to tempt fate again and hard-stop the schedulers instead of graceful... thoughts? | 22:27 |
corvus | (last time i did that, we found a bug) | 22:28 |
clarkb | oh I think that waws the only way I did it before. I guess it should've been graceful/ | 22:28 |
clarkb | the tripleo gate queue isn't short, might be better to try the least impactful thing if we can | 22:28 |
corvus | well, i'm being loose with terminology; by graceful i mean "run 'zuul-scheduler stop' and wait for it to idle before running 'docker-compose down" | 22:28 |
corvus | by hard i mean "run docker-compose down" | 22:29 |
clarkb | got it | 22:29 |
clarkb | I think last time I ran the stop playbook which probably does the down. Oh but we did a full shutdown then and deleted all data so wouldn't have been caught by any issues | 22:30 |
clarkb | ya I guess I don't know how to judge so am indifferent :) | 22:30 |
fungi | i'm fine with either experiment | 22:30 |
fungi | whatever is likely to yield the most useful data/find the most bugs | 22:31 |
corvus | i know i can be around for > 2 hours, which is the longest i would expect a latent issue from a hard-restart to show up, so i like the idea of accepting a little more risk now to try to reduce it later | 22:32 |
clarkb | wfm | 22:32 |
corvus | okay. i'll make sure to save a copy of the queues in case something goes wrong | 22:33 |
corvus | zuul02 is stopped | 22:35 |
corvus | zuul01 still seems happy; i think i stopped zuul02 right as it was about to start processing openstack/check | 22:36 |
corvus | i'll restart zuul02 now | 22:37 |
corvus | and start peeling a mandarin | 22:39 |
clarkb | I might actually have some, but my fingers will get all oily and I don't want that on the keyboard :) | 22:39 |
corvus | ill just throw it in the dishwasher if it's a problem | 22:40 |
corvus | zuul02 is back | 22:46 |
corvus | watching the logs, it's a bit like a car accelerating onto the highway... it handles more and more pipelines until it's fully synced... | 22:47 |
corvus | i'm going to kill zuul01 now | 22:47 |
corvus | starting zuul01 | 22:48 |
corvus | in retrospect, i don't think either of those stops were very disruptive. maybe next time i want to chaos monkey i should sigterm | 22:49 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Try to make zuul-preview records more clear https://review.opendev.org/c/opendev/zone-opendev.org/+/821743 | 22:49 |
clarkb | fungi: ^ I went ahead and updated the zone change to remove the unneeded record. This way we don't go through the same q&a in a year :) | 22:49 |
fungi | fair enough | 22:50 |
corvus | i saw a traceback scroll by... but it was just a 5xx from gerrit@google | 22:50 |
clarkb | ianw: if you have time for https://review.opendev.org/q/hashtag:%2522bullseye-image-update%2522+status:open today that would be great. In particular I'm thinking doing limnoria tomorrow unless you want to watch it would be good so that you can help debug should it have a sad (you did the previous fixup fork so have a good grasp of it I think) | 22:51 |
clarkb | Once the zuul updating is done I'll go ahead and approve the accessbot change since that should be super low impact if it breaks | 22:52 |
ianw | clarkb: will do. just trying to think through some bootloader issues with 9-stream but will look in a bit | 22:52 |
clarkb | ianw: ya no rush. I won't approve any you haven't already +2'd until tomorrow relative to me | 22:52 |
corvus | zuul01 is up | 22:54 |
corvus | zuul-web is next | 22:55 |
clarkb | fungi: fwiw your iptables update seems to have hit a lot of servers and it all seems to be working as expected | 22:55 |
corvus | my heart rate increased at the start of that sentence and decreased at the end | 22:56 |
clarkb | thats interesting though it looks like the zookeeper and zuul jobs are running concurrently | 22:56 |
clarkb | corvus: sorry :) | 22:56 |
clarkb | I wonder if the starting the jobs concurrently is an artifact of the zuul rolling restart | 22:56 |
fungi | clarkb: thanks, i was spot-checking too and don't see any unexpected new rules | 22:57 |
corvus | clarkb: what are the job names? | 22:57 |
corvus | (i have no status page) | 22:58 |
clarkb | corvus: infra-prod-service-zookeeper and infra-prod-service-zuul in deploy for change 820900,9 | 22:58 |
corvus | 2021-12-14 22:54:24,263 ERROR zuul.zk.SemaphoreHandler: Releasing leaked semaphore /zuul/semaphores/openstack/infra-prod-playbook held by a8b0a7c92aa1449b9eade0dbdf7f781e-infra-prod-service-zookeeper | 22:59 |
corvus | that could indicate a problem | 22:59 |
clarkb | in this case it is ok for those to run concurrently so we should be fine this instance | 23:00 |
clarkb | but ya might need to look into that for future rolling restarts if that was the cause | 23:00 |
corvus | i think it was the cause and is a bug | 23:01 |
corvus | we run the semaphore cleanup handler right after startup, and i think we can do that before restoring the pipeline state | 23:02 |
corvus | web is back up; that concludes the rolling restart | 23:03 |
fungi | thanks! | 23:04 |
clarkb | corvus: other than the concurrent builds due to the semaphore release any concerns? or are we looking happy? | 23:04 |
fungi | elodilles: ^ we're all set for more branch deletions the next time you want to try a batch | 23:04 |
corvus | clarkb: so far so good. and that should be a one-time issue; there shouldn't be continuing fallout from the semaphore cleanup. | 23:05 |
corvus | elodillesfungi i'd suggest doing at least 3-4 branches all around the same time if you want to confirm the behavior is fixed (it's possible the first 2 may not merge if it starts processing the first event quickly enough, so i'd make sure to submit a minimum of 3) | 23:07 |
corvus | and of course, if it is fixed as we suspect, the more done at once the better | 23:08 |
corvus | since things are looking food now, i'm going to take a short break and will check back in a bit | 23:08 |
fungi | freudian slip! | 23:10 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add firewall behavior assertions to test_bridge https://review.opendev.org/c/opendev/system-config/+/821780 | 23:29 |
corvus | guess it's time to wash my keyboard :) | 23:31 |
clarkb | ha | 23:31 |
opendevreview | Merged opendev/system-config master: Update the accessbot image to bullseye https://review.opendev.org/c/opendev/system-config/+/821328 | 23:40 |
clarkb | hrm the testinfra get_host doesn't seem to check the inventory as much as just give you what you want even if it isn't already there | 23:51 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add firewall behavior assertions to test_bridge https://review.opendev.org/c/opendev/system-config/+/821780 | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!