Wednesday, 2022-06-01

*** rlandy|bbl is now known as rlandy00:03
opendevreviewIan Wienand proposed opendev/glean master: redhat-ish platforms: refactor simplification of interface writing
ianwclarkb: ^ thanks, that was the refactor you suggested.  i have to walk away from this or I'll end up rewriting the whole thing, which is probably not time well spent.00:36
fungiyou'd distil it down to merely a glimmer00:37
ianwthe next thing it needs to do is switch to "keyfile" ini-style NetworkManager files, instead of ifcfg-* format.  clearly no development is happening on the NM plugin that reads ifcfg-* files, but I also don't see why any of it would need to be broken00:37
*** rlandy is now known as rlandy|out01:19
*** rlandy is now known as rlandy|out01:24
ianwthanks for the glean reviews.  i tagged and pushed 1.22.0 which sets the foundation for rh ipv6 but is intended to have no functional changes.  so i'll keep an eye on things, and we can push the actual changes in a day or two01:36
fungithanks for all your hard work on it so far!01:36
fungilooks like we're on to ze03 now01:38
opendevreviewMerged opendev/system-config master: Update Gerrit images to 3.4.5 and 3.5.2
ianw^ i can do a quick gerrit pull and restart for that in an hour or two when it's super quiet02:55
ianwdocker inspect 23534ba51fc3 | grep opendevorg/gerrit@sha03:59
ianw            "opendevorg/gerrit@sha256:e114ec73aa90e04f0611609f34f585b269bea766d42e1b57d150987b5d450864"03:59
ianwlines up with
ianwi'll restart it now04:01
ianw#status log Restarted gerrit with 3.4.5 (
opendevstatusianw: finished logging04:04
*** marios is now known as marios|ruck05:05
*** ysandeep|out is now known as ysandeep06:14
opendevreviewRodolfo Alonso proposed openstack/project-config master: Remove lower-constraints and tox-py36 from Neutron Grafana
*** ysandeep is now known as ysandeep|lunch08:09
*** pojadhav is now known as pojadhav|lunch08:16
*** pojadhav|lunch is now known as pojadhav08:45
*** marios|ruck is now known as marios|ruck|afk08:55
*** ysandeep|lunch is now known as ysandeep09:26
*** jpena|off is now known as jpena09:45
*** marios|ruck|afk is now known as marios|ruck09:46
*** rlandy|out is now known as rlandy10:18
*** rlandy_ is now known as rlandy__10:24
*** pojadhav is now known as pojadhav|afk11:12
mgariepygood morning11:45
mgariepyis the centos9-stream hold available for 844037 change ?11:46
fungimgariepy: just a sec and i'll take a peek11:50
fungimgariepy: ssh root@
mgariepygreat thanks a lot :D11:53
*** dviroel|afk is now known as dviroel12:21
mgariepythanks fungi i did find the issue.12:42
frickler"find not found" sounds ... nice ;)12:54
mgariepyyep indeed 12:56
mgariepythanks again for you help on this one.12:57
fungimgariepy: so you're done with the held node, or still experimenting?12:59
mgariepyi'm done with it.12:59
fungialso don't forget, if it's just a matter of not being sure you have a representative test environment, we do publish our vm images you can download12:59
mgariepywere are the images?13:00
fungimgariepy: and depending on which builder built a particular image (they grab the build requests at random)13:02
fungialso for the aarch64 (arm64) images13:02
fungiyou'd have to do something to get your ssh key into them (configdrive metadata or editing the images)13:03
mgariepyha cool i didn't knew that they were published.13:03
fricklerworth noting that these require to be booted with a config-drive for setup, they don't do cloud-init (I think)13:03
mgariepywere is the config for the building of the image?13:04
fungithey can get away without configdrive if you have dhcp/slaac for dynamic network configuration13:04
fungimgariepy: nodepool builds the images by calling diskimage-builder, but the basic parameters for that are configured here:
fungimost of the elements listed are part of dib's stdlib, but a few (like infra-package-needs) are custom here:
mgariepycool i'll take a look to try them on my servers so i can stop asking for hold : ) haha13:08
fungiyeah, if you have an openstack you can just upload those to glance and set the appropriate metadata for your ssh key, then boot them with configdrive enabled13:09
Clark[m]It is also worth noting that you should look at the logs jobs produce and determine if they are sufficient or need to be improved. In this case the error was logged13:10
fungiright, and obviously if there's information you're missing which would have helped to find that, gathering those additional logs in the job would be a great idea13:10
Clark[m]Looking at zuul restart progress as soon as ze12 is paused I think we can test the rax swift uploads as ze12 won't schedule new jobs13:11
fungiClark[m]: agreed. what's the simplest way to exercise base-test?13:11
fungiwe just need a change in an untrusted repo which would normally run something directly parented to base, i guess, and reparent it13:11
fungibut wondering if you happen to know a good one off the top of your head13:12
Clark[m]fungi: I typically use a DNM change against zuul-jobs swapping out base for base-test on it's unittests iirc13:12
fungimaybe something in zuul/zuul-jobs13:12
fungiyeah that would work13:12
fungii'll get that pushed now13:12
*** pojadhav|afk is now known as pojadhav13:12
opendevreviewJeremy Stanley proposed zuul/zuul-jobs master: DNM: exercise base-test job
*** dviroel_ is now known as dviroel13:15
Clark[m] if that uploads to rax it is running on ze0413:24
Clark[m]It has reported and looks like it uploaded to rax and the logs are viewable13:29
Clark[m]Someone not on a phone should double check all that :) but we may need to bring this up with openstacksdk next if that all checks out13:30
*** marios|ruck is now known as marios|ruck|call13:32
fungii'll recheck it again just so we have a bit more data13:47
opendevreviewMerged openstack/project-config master: Add a repository for the Large Scale SIG
corvusze12 is still running, so there's still a small chance it could run a job14:13
*** rlandy__ is now known as rlandy14:16
fungiyeah, i'll check that examples didn't come from there14:22
fungiwe're still waiting for ze11 to stop completely14:23
Clark[m]ze12 is paused now. All new jobs should run with old openstacksdk14:58
*** ysandeep is now known as ysandeep|out14:58
*** marios|ruck|call is now known as marios|ruck15:06
*** dviroel is now known as dviroel|lunch15:08
clarkbtox-py27 and tox-py38 on fungi's recheck both uploaded to rax and they have logs15:31
clarkbthe other tox jobs uploaded to ovh and also have logs15:31
clarkbI'm pretty much convinced now that openstacksdk is deleting our metadata somehow15:31
clarkber openstacksdk 0.99.015:31
clarkbgtema: fyi ^ would it help to send email to the discuss list about that or is filing an issue btter or?15:32
clarkbfungi: and I think we can go ahead and revert the rax removal from base jobs?15:33
fungiclarkb: i think so, still seems to be working15:33
fungichecking to see if i proposed the revert as wip15:34
opendevreviewJeremy Stanley proposed opendev/base-jobs master: Revert "Temporarily stop uploading logs to Rackspace"
fungiclarkb: ^15:36
clarkbI've gone ahead and approed that as ze12 won't run any new jobs now15:41
opendevreviewMerged opendev/base-jobs master: Revert "Temporarily stop uploading logs to Rackspace"
*** marios|ruck is now known as marios|out15:53
*** dviroel|lunch is now known as dviroel16:20
clarkbcorvus: I think we might be stuck stopping mergers still16:36
clarkbcorvus: it looks like the merger restarted instead of stopping?16:37
clarkbansible is still waiting on it to stop so I don't know what happend there16:37
clarkbmaybe docker restarted it too quickly for our wait to notice?16:38
fungioh, hrm16:42
clarkbok ya the container id never changed the process just restarted16:42
clarkbthe way we awit for the container to stop is we expect the container to go away?16:43
clarkbso I think there is a mismatch in how the graceful stop works now and how docker is reporting the container presence16:43
fungithough docker-compose logs for it doesn't mention any restart16:43
fungiyeah, i concur, i don't think docker restarted it16:43
clarkbdocker ps -a says it is up 16 minutes16:43
clarkband ps proper seems to correlate with that16:43
fungioh, docker restarted it but didn't percolate into the docker-compose logs?16:44
clarkb`docker-compose ps -q | xargs docker wait` is what we are waiting on so its not the logs that matter but the listing16:45
clarkbrestart: always <- is set on the service16:46
clarkbI think that is the issue16:46
clarkbya executors set restart: on-failure16:47
clarkblet me push a fix then once that lands we can manually trigger a merger stop again which the ansible playbook should catch allowing it to continue16:47
clarkbhrm schedulers are also restart always. How did they work before?16:48
clarkbah because we down the scheduler rather than doing a graecful thing16:48
clarkbso ya one sec change incoming16:48
opendevreviewClark Boylan proposed opendev/system-config master: Fix zuul merger graceful stops
corvusclarkb: ah yep, lgtm16:54
*** jpena is now known as jpena|off17:10
clarkbinfra-root thats an openstack-discuss email about openstacksdk 0.99.0 and the swift uploads. Does that draft look good?17:32
clarkbalso looking at the script I wonder if delete after was not set on those objects either17:34
clarkbwhich means we'll have leaked the job logs from those days potentially17:34
fricklergtema: ^^17:42
gtemaI would be looking at that, was waiting to get bit more details17:43
gtemaFor sure 0.99 changes things and also on this front, but I really wonder that it now fails for particular type of cloud only17:44
gtemaWould be good to think about testing possibility 17:44
fricklerwell it seems to fail only for rax and not for ovh17:44
clarkbwell it looks like we have to specifically set a header on each obect in rax to get the cors headers (linked to in that ehterpad)17:45
clarkbwhich is why I'm wondering iwe just aren't setting those headers anymore17:46
fricklerso possibly would need some historic version of swift to reproduce17:46
clarkbwhich is why I also wonder about the delete after values17:46
fungirackspace's swift wasn't ever actually swift, from what i gather17:46
fricklerclarkb: but setting the headers on ovh continued to work? or don't we need them there?17:46
clarkbfrickler: we don't appear to need them there (they must either default to the right cors value or maybe they consult the index.html value for the container top level)17:47
corvusthe headers are required for allowing access through rackspace's cdn, which is the only way to have anonymous public access in rax.  that's the main difference.17:47
clarkb is the other place we set these headers but that happens once per container and all our containers would've had that happen long ago17:47
fungii suppose we could record the api interactions at debug level with openstacksdk 0.61.0 and 0.99.0 and compare them17:48
corvushowever, if the mechanism for setting those is broken in general, then it's possible that the x-delete-after header was not being set on *any* cloud uploads, so as clarkb suggested, we may have objects in *all* of our clouds which will not expire automatically.17:48
clarkbfungi: ya that might be a good way to test it too. Can check that all expected headers make it outbound17:48
clarkbcorvus: exactly17:48
fungiclarkb: small edits to the pad, also those urls are not permalinks so may change before people read them17:51
clarkbfungi: good point I can fix the links17:51
clarkbI'll also add a note about x-delete-after too17:51
clarkbalright sending that out17:53
clarkbwe can track progress on the problem there. To be clear I'm not sure how much I'll be able to debug for the next while as other things are distracting too :) I think we're likely stable on 0.61.0 though17:56
timburke_from a swift perspective, i would've expected CORS to be controlled via X-Container-Meta-Access-Control-Allow-Origin set at the container level, not Access-Control-Allow-Origin set on individual objects18:04
gtemaI am pretty confident x-delete was working earlier, since i rely on that heavily in my cloud. Will anyway have a deeper analysis tomorrow (will try to bisect what particularly changed in 0.9918:04
timburke_rax might be doing something different, though -- while they certainly used to run swift, they might be running hummingbird for cloudfiles these days. even with vanilla swift, though, you could probably update the allowed_headers configured on the object-server to have that CORS header stick -- i just didn't know of anyone that did that18:04
clarkbtimburke_: according to the comments its specific to their CDN18:05
gtemaThe only thing that immediately came to my mind is eventually changed case for header names18:06
clarkbtimburke_: basically we're setting an object metadata/header value in swift to affect another service18:06
clarkbtimburke_: and ya we set it at the container level too
clarkbhowever we use something like 4096 containers to shard the logs across and all of those would've been created a long time ago so hard to say if that is also affected here18:07
timburke_👍 yeah, the CDN stuff can definitely cause headaches, too18:07
gtemaClarkb: do you have a chance to look at any of the "broken" containers/objects in rax to see if they have any headers set?18:10
clarkblet me see18:11
clarkbhave to find an object from last week18:11
gtemaNo hurry, thanks. I am anyway already off for today18:12
clarkbya enjoy your evening. we've managed to work around it for now18:12
clarkbhrm is an example but I'm not sure how to map that to a container or cloud18:12
*** rlandy is now known as rlandy|mtg18:13
gtemaOk, that should work. Regulat curl should help here18:13
clarkbgtema: I'm not sure if that will pass through all of the swift metadata though. I'm trying to map it to a swift container so I can check that directly but we'll see how successful I am18:15
gtemawell, I clearly see that X-Delete-At is set on container and any random object18:16
gtemaand Access-Control-Allow-Origin: "*" is set on root18:17
clarkbfor the objects at the link I just provided? I don't see either18:18
gtemaclarkb: are you sure this is bad example?18:18
clarkbbut maybe you don't get them by default with a basic GET18:18
clarkbgtema: yes fails to load and according to the console it is due to CORS and if you clikc the view log link top right you get the url above18:19
clarkband there definitely isn't cors headers set18:19
gtemaok, I see18:20
clarkbya ok curl shows the X-Delete-At but direfox didn't18:20
clarkbgtema: is a working example uploaded by 0.61.018:21
fungidid something else from that buildset upload to ovh so we can compare? are we also missing the headers there?18:21
clarkbAccess-Control-Allow-Origin: * is rpesent there18:21
gtemayes, I see18:21
clarkb is an ovh build result from the same buildset as the failing rax 0.99.0 case18:22
gtema - there is even explicit "hack" for rax in the role18:23
clarkbgtema: yes I called that out in my email18:23
gtemaah, right18:23
clarkbinteresting the ovh case also has CORS errors in the console but it loads the logs in the dashboard anyway18:25
clarkbah but it is only for a specific file which apparently zuul doesn't need and it isn't fatal?18:26
opendevreviewMerged opendev/system-config master: Fix zuul merger graceful stops
gtemaugh, I think I see the issue: access-control-allow-origin is not matching expected prefix for supported object headers(
fungionce that ^ deploys, we'll need to manually down the zuul-merger container on zm01 but after that we should be good?18:27
gtemaI will test this carefully tomorrow18:27
*** artom_ is now known as artom18:27
fungigtema: is there a reason for having an allowed list of object headers? i thought you could include any arbitrary header18:28
gtemawell, pretty much API description of Swift that tells that object metadata starts with X-Object-Meta18:29
gtemathere is set of additional "system" headers but cors headers are not in there18:29
gtemaI will create a test case tomorrow for that in sdk18:29
gtemahopefully however we can unblock ourselves with devstack-networking often OOMing 18:30
clarkbinterestingly when I make requests against ovh with curl I don't get the cors headers but when my browser makes them it does get them. The one that fails is for a 404 which is why it breaks and why its fine18:32
timburke_fwiw, there's also a configurable list of additional headers that may be stored:
gtemaso at the moment code works correctly according to official docs. And since cors is an additional middleware it is not present in API docs and thus not properly considered even as concept18:33
clarkbwe're fetching job-output.json.gz but the path is acutally job-output.json. I think its a compat thing to try and find multiple possible versions but the 404 version fails CORS headers ebcause nothing has said a 404 is a valid cross site request18:33
timburke_clarkb, if you include something like `-H 'Origin:'` you'll get the CORS header18:33
gtemaclarkb: I think you need also to send refer header or something like that (browsers should be doing that)18:33
clarkbtimburke_: ah thanks18:33
clarkbgtema: re working correctly according to official docs I don't think the docs say other headers are invalid just that if properly formatted they are treated special?18:34
fungiis the client-side header filtering idea that it can save the user time and bandwidth over a server-side api rejection?18:34
clarkbgtema: I did look at that fwiw and didn't find anything saying arbitrary headers are invalid/disallowed just that properly formatted ones can be managed by swift18:34
gtemaclarkb: it depends on how you read the doc. It of course does not mention that not listed headers are forbidden, but it lists headers it recognizes18:35
clarkbgtema: right18:35
fungibecause otherwise, a client second-guessing server-side limits is at best a redundancy and at worst likely to diverge over time18:35
clarkbI would argue clients/sdks/tools shouldn't be overly aggressive then18:35
clarkbclient tools should always be forgiving18:35
fungipostel's law18:36
clarkbthen let the remote end be angry if necessary18:36
fungisays the opposite ;)18:36
gtemaapproach of SDK was always to try to fail client as early as possible before even reaching server18:36
fungibut yeah, i don't think it's applicable in this case18:36
clarkbgtema: the problem with that is openstack has never been consistent enough to make that a reasonable thing to do18:36
gtemathat is so sadly true, this makes me cry18:37
* gtema is wiping tears18:37
gtemaokay, as said - I will try to fix that tomorrow18:38
clarkbfungi: ya I mean swift seems to ignore it entirely in the ovh case for example18:38
clarkbthe zuul merger docker compose config fix is deploying now18:39
clarkbonce it deploys I'll manually gracefulyl stop zm01 again and see if we get furhter18:39
clarkbI wonder if I have to down up the container to pick up the new config though :/18:39
fungifrom an sdk standpoint, i would interpret postel's law as saying that the user should be conservative in what data they supply as inputs but the sdk should be forgiving if what the user supplies it. then the sdk should be as conservative as it can in what it sends to the server-side api (while still trying to honor the caller's wishes), and the server should be as accepting as possible about18:39
fungiwhat it receives from the sdk18:39
corvusfwiw, the sdk docs don't say they will filter the header list:
corvus```headers – These will be passed through to the object creation API as HTTP Headers.```18:40
corvus(which, to be clear, is what i think is the expected and desired behavior)18:41
gtemabtw, mentions that you need to set X-Container-Meta-Access-Control-Allow-Origin18:42
clarkbya we do that too
gtemacorrect. That is why it works for OVH and not for RAX18:44
clarkbthat and the conatiners were all created years ago18:44
clarkbbut ya if we ran this against ovh today and it created new containers it would probably work18:44
gtemaif they have something not standard (what if not matching API docs of swift) we have issues18:44
gtemaok, done for tonight. Will add exception to sdk18:45
clarkbdeployment of the zm fix is done. Manually running the merger stop on zm01 now18:48
fungialso remember that the current api docs for swift are not necessarily going to be relevant to the 10-year-old fork some major service providers are still running18:48
clarkbthe playbook is proceeding18:48
corvuswe also set content-encoding and content-type using that mechanism -- do we know if we expect those to make it through 0.99?18:49
fungibut users may have an application which needs to talk to diablo-era and yoga-era swift in different providers at the same time18:49
clarkbcorvus: fungi  might be a good idea to followup on the thread with that info so it isn't lost in irc scrollback? but ya I agree those are good questions and considerations :)18:50
fungiwhich was the use case for the code in nodepool which was later extracted to become shade and then merged into openstacksdk18:50
clarkbI think zm02 is hitting the same problem because it started on the wrong config?18:50
clarkbwe may have to amnually stop each merger. I'll do that if so18:51
fungioh, so we'll need to stop them all manually this time18:51
fungiyeah, that makes sense. i guess docker-compose interprets that when "upping" and doesn't re-read it for other actions18:51
corvusbut this is the last time, really for real this time18:51
clarkbyes  Ithink so. I'm running the same command the playbook runs to stop them which means in theory they will all work next time18:51
timburke_looks like content-encoding and content-type should be fine:
clarkbERROR: 137 trying to stop on zm03 but it seems to have stopped18:54
clarkb04 didn't do that but 05 did. I wonder if its a timing thing stopping the merger too close to startup18:58
clarkbI'll give 06 plenty of time18:59
fungii guess 137 is a docker-specific exit code? i don't see zuul special-casing it anyway19:00
clarkbya I think so19:00
clarkbapparently error 137 is a "I don't have enough memory" error19:02
clarkbmaybe our mergers are a bit too small?19:02
fungimaybe it tried to start a new merger process while the old one's allocations hadn't been cleaned up?19:03
clarkbfree reports plenty of available memory19:03
fungibut yeah, the mergers only have 2gb ram19:04
fungithey do have swap too though19:04
clarkbya 07 was fine19:04
clarkbsomething to keep an eye on but probably not urgent?19:04
clarkb8 is proceeding now. It should get to zuul01 processes fairly quickly19:06
clarkbyup zuul01 is stopping now19:07
clarkbcorvus: I notice that the fingergw does not remove itself from the components registry when stopped19:08
clarkboh wait there it goes. Maybe just a delay on the zk ephemeral node cleanup?19:08
clarkbit is waiting for the scheduler on 01 to start now. i'm going to eat lunc hwhile that happens19:09
fungiyeah, web and scheduler will take a while19:09
*** rlandy|mtg is now known as rlandy19:12
clarkblooks like it is doing 02 now19:38
clarkbthe playbook is done. Seems to have had no errors. I'll close the screen session now since everything was logged for it20:06
clarkbpart of me wants to run it again today just to make sure the mergers are happy but I don't think that is super important20:07
fungii'm happy to run it again. we technically also didn't really exercise the image update this last time since that was done prior to restarting the mergers manually20:09
corvusif you wanted a new version of zuul for the next update....merging would do, and it's operationally interesting for opendev...  ;)20:12
corvus(meanwhile, any objection to my restarting the launchers?20:13
fungino objection from me20:13
clarkbya no objection here, though that will likely pull in openstacksdk 0.99.0 on the launchers20:15
clarkbI think now that the previous issue is better understood we wouldn't epxect that to affect nodepool, but calling it out as a change20:15
corvus#status log restarted nodepool launchers on 6416b1483821912ac7a0d954aeb6e864eafdb819, likely with sdk 0.9920:15
opendevstatuscorvus: finished logging20:15
clarkbI jsut we should status log the restart of zuul too20:16
corvusclarkb: agreed20:16
clarkb#status log Restarted all of zuul on 6.0.1.dev54 69199c6fa20:16
corvus(agreed re sdk)20:16
opendevstatusclarkb: finished logging20:16
corvusopenstack.exceptions.BadRequestException: BadRequestException: 400: Client Error for url: [...] Bad networks format20:17
corvusi'm looking into whether that's new or not20:17
corvusnope that's new20:18
corvusi'm going to assume occam's razor and that's an sdk 0.99 bug (i have confirmed 0.99 is in the container)20:19
clarkbwouldn't surprise me20:19
corvusnext step? roll back our launchers to nodepool 6.0.0 and then merge a pin?20:19
clarkbseems reasonable to me20:20
clarkbI don't think opendev is relying on any new unreleased nodepool features/functionality20:20
fungioh that's a fun error20:20
fungii'll get the pin pushed20:21
corvusansible -f 20 nodepool-launcher -m shell -a 'docker pull zuul/nodepool-launcher:6.0.0; docker tag zuul/nodepool-launcher:6.0.0 zuul/nodepool-launcher:latest'20:22
corvus#status log restarted nodepool launchers on 6.0.0 after encountering suspected sdk 0.99 bug20:22
opendevstatuscorvus: finished logging20:22
clarkbI'll review the pin but then I'm going for a bike ride. My opportunity to do that are becoming fewer as we get closer to the summit20:23
fungicorvus: clarkb: Temporarily pin OpenStackSDK before 0.9920:27
clarkbheh we even had the 1.0.0 cap20:28
clarkband now bike ride time. Back in a bit20:30
*** timburke_ is now known as timburke20:59
*** dviroel is now known as dviroel|out21:34
clarkbfungi: just to catch up were you goign to rerun the reboot playbook? I can help keep an eye on it if so22:26
fungilemme check if that zuul change merged and published22:30
clarkbI think it did22:31
clarkbassuming the change that merged is the right one22:31
fungiyeah, promote finished22:31
fungii have a root screen session with the new run teed up22:32
clarkbcool I'm not joined yet but can keep an eye on it via /components and grafana and dig in futher if necessary22:32
fungiready to hit enter if no immediate objections22:33
clarkbnone from me.22:33
fungifire in the hole!22:33
fungiseems to have pulled the new images22:33
clarkbya that should be the first thing it does22:33
fungize01 should be in the process of stopping22:33
*** rlandy is now known as rlandy|out22:44
clarkbreminder I plan to delete the ethercalc server and its dns records tomorrow23:52
clarkbI haven't heard any noise since we shutdown the server. Please let me know if you saw something I missed23:52

Generated by 2.17.3 by Marius Gedminas - find it at!