*** rlandy|bbl is now known as rlandy | 00:03 | |
opendevreview | Ian Wienand proposed opendev/glean master: redhat-ish platforms: refactor simplification of interface writing https://review.opendev.org/c/opendev/glean/+/844148 | 00:35 |
---|---|---|
ianw | clarkb: ^ thanks, that was the refactor you suggested. i have to walk away from this or I'll end up rewriting the whole thing, which is probably not time well spent. | 00:36 |
fungi | you'd distil it down to merely a glimmer | 00:37 |
ianw | the next thing it needs to do is switch to "keyfile" ini-style NetworkManager files, instead of ifcfg-* format. clearly no development is happening on the NM plugin that reads ifcfg-* files, but I also don't see why any of it would need to be broken | 00:37 |
*** rlandy is now known as rlandy|out | 01:19 | |
*** rlandy is now known as rlandy|out | 01:24 | |
ianw | thanks for the glean reviews. i tagged and pushed 1.22.0 which sets the foundation for rh ipv6 but is intended to have no functional changes. so i'll keep an eye on things, and we can push the actual changes in a day or two | 01:36 |
fungi | thanks for all your hard work on it so far! | 01:36 |
fungi | looks like we're on to ze03 now | 01:38 |
opendevreview | Merged opendev/system-config master: Update Gerrit images to 3.4.5 and 3.5.2 https://review.opendev.org/c/opendev/system-config/+/843298 | 02:04 |
ianw | ^ i can do a quick gerrit pull and restart for that in an hour or two when it's super quiet | 02:55 |
ianw | docker inspect 23534ba51fc3 | grep opendevorg/gerrit@sha | 03:59 |
ianw | "opendevorg/gerrit@sha256:e114ec73aa90e04f0611609f34f585b269bea766d42e1b57d150987b5d450864" | 03:59 |
ianw | lines up with https://hub.docker.com/layers/opendevorg/gerrit/3.4/images/sha256-e114ec73aa90e04f0611609f34f585b269bea766d42e1b57d150987b5d450864?context=explore | 03:59 |
ianw | i'll restart it now | 04:01 |
ianw | #status log Restarted gerrit with 3.4.5 (https://review.opendev.org/c/opendev/system-config/+/843298) | 04:04 |
opendevstatus | ianw: finished logging | 04:04 |
*** marios is now known as marios|ruck | 05:05 | |
*** ysandeep|out is now known as ysandeep | 06:14 | |
opendevreview | Rodolfo Alonso proposed openstack/project-config master: Remove lower-constraints and tox-py36 from Neutron Grafana https://review.opendev.org/c/openstack/project-config/+/844254 | 07:43 |
Tengu | 1 | 08:02 |
*** ysandeep is now known as ysandeep|lunch | 08:09 | |
*** pojadhav is now known as pojadhav|lunch | 08:16 | |
*** pojadhav|lunch is now known as pojadhav | 08:45 | |
*** marios|ruck is now known as marios|ruck|afk | 08:55 | |
*** ysandeep|lunch is now known as ysandeep | 09:26 | |
*** jpena|off is now known as jpena | 09:45 | |
*** marios|ruck|afk is now known as marios|ruck | 09:46 | |
*** rlandy|out is now known as rlandy | 10:18 | |
*** rlandy_ is now known as rlandy__ | 10:24 | |
*** pojadhav is now known as pojadhav|afk | 11:12 | |
mgariepy | good morning | 11:45 |
mgariepy | is the centos9-stream hold available for 844037 change ? | 11:46 |
fungi | mgariepy: just a sec and i'll take a peek | 11:50 |
fungi | mgariepy: ssh root@149.202.172.204 | 11:52 |
mgariepy | great thanks a lot :D | 11:53 |
fungi | yw | 11:53 |
*** dviroel|afk is now known as dviroel | 12:21 | |
mgariepy | thanks fungi i did find the issue. | 12:42 |
mgariepy | https://zuul.opendev.org/t/openstack/build/ffb71d3850284a028fe4329c2c4abb20/log/logs/openstack/aio1_galera_container-170eefd4/mariadb.service.journal-21-15-29.log.txt#150 | 12:42 |
frickler | "find not found" sounds ... nice ;) | 12:54 |
mgariepy | lol | 12:56 |
mgariepy | yep indeed | 12:56 |
mgariepy | thanks again for you help on this one. | 12:57 |
fungi | mgariepy: so you're done with the held node, or still experimenting? | 12:59 |
mgariepy | i'm done with it. | 12:59 |
fungi | also don't forget, if it's just a matter of not being sure you have a representative test environment, we do publish our vm images you can download | 12:59 |
mgariepy | were are the images? | 13:00 |
fungi | mgariepy: https://nb01.opendev.org/images/ and https://nb02.opendev.org/images/ depending on which builder built a particular image (they grab the build requests at random) | 13:02 |
fungi | also https://nb03.opendev.org/images/ for the aarch64 (arm64) images | 13:02 |
fungi | you'd have to do something to get your ssh key into them (configdrive metadata or editing the images) | 13:03 |
mgariepy | ha cool i didn't knew that they were published. | 13:03 |
frickler | worth noting that these require to be booted with a config-drive for setup, they don't do cloud-init (I think) | 13:03 |
mgariepy | were is the config for the building of the image? | 13:04 |
fungi | they can get away without configdrive if you have dhcp/slaac for dynamic network configuration | 13:04 |
fungi | mgariepy: nodepool builds the images by calling diskimage-builder, but the basic parameters for that are configured here: https://opendev.org/openstack/project-config/src/branch/master/nodepool/nodepool.yaml#L103 | 13:05 |
fungi | most of the elements listed are part of dib's stdlib, but a few (like infra-package-needs) are custom here: https://opendev.org/openstack/project-config/src/branch/master/nodepool/elements | 13:07 |
mgariepy | cool i'll take a look to try them on my servers so i can stop asking for hold : ) haha | 13:08 |
fungi | yeah, if you have an openstack you can just upload those to glance and set the appropriate metadata for your ssh key, then boot them with configdrive enabled | 13:09 |
Clark[m] | It is also worth noting that you should look at the logs jobs produce and determine if they are sufficient or need to be improved. In this case the error was logged | 13:10 |
fungi | right, and obviously if there's information you're missing which would have helped to find that, gathering those additional logs in the job would be a great idea | 13:10 |
Clark[m] | Looking at zuul restart progress as soon as ze12 is paused I think we can test the rax swift uploads as ze12 won't schedule new jobs | 13:11 |
fungi | Clark[m]: agreed. what's the simplest way to exercise base-test? | 13:11 |
fungi | we just need a change in an untrusted repo which would normally run something directly parented to base, i guess, and reparent it | 13:11 |
fungi | but wondering if you happen to know a good one off the top of your head | 13:12 |
Clark[m] | fungi: I typically use a DNM change against zuul-jobs swapping out base for base-test on it's unittests iirc | 13:12 |
fungi | maybe something in zuul/zuul-jobs | 13:12 |
fungi | yeah that would work | 13:12 |
fungi | i'll get that pushed now | 13:12 |
*** pojadhav|afk is now known as pojadhav | 13:12 | |
opendevreview | Jeremy Stanley proposed zuul/zuul-jobs master: DNM: exercise base-test job https://review.opendev.org/c/zuul/zuul-jobs/+/844291 | 13:14 |
*** dviroel_ is now known as dviroel | 13:15 | |
Clark[m] | https://zuul.opendev.org/t/zuul/stream/1906f09b18644882971112de85f54ef4?logfile=console.log if that uploads to rax it is running on ze04 | 13:24 |
Clark[m] | It has reported and looks like it uploaded to rax and the logs are viewable | 13:29 |
Clark[m] | Someone not on a phone should double check all that :) but we may need to bring this up with openstacksdk next if that all checks out | 13:30 |
*** marios|ruck is now known as marios|ruck|call | 13:32 | |
fungi | i'll recheck it again just so we have a bit more data | 13:47 |
opendevreview | Merged openstack/project-config master: Add a repository for the Large Scale SIG https://review.opendev.org/c/openstack/project-config/+/843534 | 14:01 |
corvus | ze12 is still running, so there's still a small chance it could run a job | 14:13 |
*** rlandy__ is now known as rlandy | 14:16 | |
fungi | yeah, i'll check that examples didn't come from there | 14:22 |
fungi | we're still waiting for ze11 to stop completely | 14:23 |
Clark[m] | ze12 is paused now. All new jobs should run with old openstacksdk | 14:58 |
*** ysandeep is now known as ysandeep|out | 14:58 | |
*** marios|ruck|call is now known as marios|ruck | 15:06 | |
*** dviroel is now known as dviroel|lunch | 15:08 | |
clarkb | tox-py27 and tox-py38 on fungi's recheck both uploaded to rax and they have logs | 15:31 |
clarkb | the other tox jobs uploaded to ovh and also have logs | 15:31 |
clarkb | I'm pretty much convinced now that openstacksdk is deleting our metadata somehow | 15:31 |
clarkb | er openstacksdk 0.99.0 | 15:31 |
clarkb | gtema: fyi ^ would it help to send email to the discuss list about that or is filing an issue btter or? | 15:32 |
clarkb | fungi: and I think we can go ahead and revert the rax removal from base jobs? | 15:33 |
fungi | clarkb: i think so, still seems to be working | 15:33 |
fungi | checking to see if i proposed the revert as wip | 15:34 |
opendevreview | Jeremy Stanley proposed opendev/base-jobs master: Revert "Temporarily stop uploading logs to Rackspace" https://review.opendev.org/c/opendev/base-jobs/+/844316 | 15:36 |
fungi | clarkb: ^ | 15:36 |
clarkb | I've gone ahead and approed that as ze12 won't run any new jobs now | 15:41 |
opendevreview | Merged opendev/base-jobs master: Revert "Temporarily stop uploading logs to Rackspace" https://review.opendev.org/c/opendev/base-jobs/+/844316 | 15:47 |
*** marios|ruck is now known as marios|out | 15:53 | |
*** dviroel|lunch is now known as dviroel | 16:20 | |
clarkb | corvus: I think we might be stuck stopping mergers still | 16:36 |
clarkb | corvus: it looks like the merger restarted instead of stopping? | 16:37 |
clarkb | ansible is still waiting on it to stop so I don't know what happend there | 16:37 |
clarkb | maybe docker restarted it too quickly for our wait to notice? | 16:38 |
fungi | oh, hrm | 16:42 |
clarkb | ok ya the container id never changed the process just restarted | 16:42 |
clarkb | the way we awit for the container to stop is we expect the container to go away? | 16:43 |
clarkb | so I think there is a mismatch in how the graceful stop works now and how docker is reporting the container presence | 16:43 |
fungi | though docker-compose logs for it doesn't mention any restart | 16:43 |
fungi | yeah, i concur, i don't think docker restarted it | 16:43 |
clarkb | docker ps -a says it is up 16 minutes | 16:43 |
clarkb | and ps proper seems to correlate with that | 16:43 |
fungi | oh, docker restarted it but didn't percolate into the docker-compose logs? | 16:44 |
fungi | s/oh/or/ | 16:44 |
clarkb | https://paste.opendev.org/show/b972p4Kp9z2t6psFkHmZ/ | 16:45 |
clarkb | `docker-compose ps -q | xargs docker wait` is what we are waiting on so its not the logs that matter but the listing | 16:45 |
clarkb | restart: always <- is set on the service | 16:46 |
clarkb | I think that is the issue | 16:46 |
clarkb | ya executors set restart: on-failure | 16:47 |
clarkb | let me push a fix then once that lands we can manually trigger a merger stop again which the ansible playbook should catch allowing it to continue | 16:47 |
clarkb | hrm schedulers are also restart always. How did they work before? | 16:48 |
clarkb | ah because we down the scheduler rather than doing a graecful thing | 16:48 |
clarkb | so ya one sec change incoming | 16:48 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix zuul merger graceful stops https://review.opendev.org/c/opendev/system-config/+/844320 | 16:51 |
corvus | clarkb: ah yep, lgtm | 16:54 |
*** jpena is now known as jpena|off | 17:10 | |
clarkb | infra-root https://etherpad.opendev.org/p/3Rmqg-Tbb8qOqi1nFQp1 thats an openstack-discuss email about openstacksdk 0.99.0 and the swift uploads. Does that draft look good? | 17:32 |
clarkb | also looking at the script I wonder if delete after was not set on those objects either | 17:34 |
clarkb | which means we'll have leaked the job logs from those days potentially | 17:34 |
frickler | gtema: ^^ | 17:42 |
gtema | I would be looking at that, was waiting to get bit more details | 17:43 |
gtema | For sure 0.99 changes things and also on this front, but I really wonder that it now fails for particular type of cloud only | 17:44 |
gtema | Would be good to think about testing possibility | 17:44 |
frickler | well it seems to fail only for rax and not for ovh | 17:44 |
clarkb | well it looks like we have to specifically set a header on each obect in rax to get the cors headers (linked to in that ehterpad) | 17:45 |
clarkb | which is why I'm wondering iwe just aren't setting those headers anymore | 17:46 |
frickler | so possibly would need some historic version of swift to reproduce | 17:46 |
clarkb | which is why I also wonder about the delete after values | 17:46 |
fungi | rackspace's swift wasn't ever actually swift, from what i gather | 17:46 |
frickler | clarkb: but setting the headers on ovh continued to work? or don't we need them there? | 17:46 |
clarkb | frickler: we don't appear to need them there (they must either default to the right cors value or maybe they consult the index.html value for the container top level) | 17:47 |
corvus | the headers are required for allowing access through rackspace's cdn, which is the only way to have anonymous public access in rax. that's the main difference. | 17:47 |
clarkb | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/zuul_swift_upload.py#L108-L129 is the other place we set these headers but that happens once per container and all our containers would've had that happen long ago | 17:47 |
fungi | i suppose we could record the api interactions at debug level with openstacksdk 0.61.0 and 0.99.0 and compare them | 17:48 |
corvus | however, if the mechanism for setting those is broken in general, then it's possible that the x-delete-after header was not being set on *any* cloud uploads, so as clarkb suggested, we may have objects in *all* of our clouds which will not expire automatically. | 17:48 |
clarkb | fungi: ya that might be a good way to test it too. Can check that all expected headers make it outbound | 17:48 |
clarkb | corvus: exactly | 17:48 |
fungi | clarkb: small edits to the pad, also those urls are not permalinks so may change before people read them | 17:51 |
clarkb | fungi: good point I can fix the links | 17:51 |
clarkb | I'll also add a note about x-delete-after too | 17:51 |
clarkb | alright sending that out | 17:53 |
fungi | thanks! | 17:56 |
clarkb | we can track progress on the problem there. To be clear I'm not sure how much I'll be able to debug for the next while as other things are distracting too :) I think we're likely stable on 0.61.0 though | 17:56 |
timburke_ | from a swift perspective, i would've expected CORS to be controlled via X-Container-Meta-Access-Control-Allow-Origin set at the container level, not Access-Control-Allow-Origin set on individual objects | 18:04 |
gtema | I am pretty confident x-delete was working earlier, since i rely on that heavily in my cloud. Will anyway have a deeper analysis tomorrow (will try to bisect what particularly changed in 0.99 | 18:04 |
timburke_ | rax might be doing something different, though -- while they certainly used to run swift, they might be running hummingbird for cloudfiles these days. even with vanilla swift, though, you could probably update the allowed_headers configured on the object-server to have that CORS header stick -- i just didn't know of anyone that did that | 18:04 |
clarkb | timburke_: according to the comments its specific to their CDN | 18:05 |
gtema | The only thing that immediately came to my mind is eventually changed case for header names | 18:06 |
clarkb | timburke_: basically we're setting an object metadata/header value in swift to affect another service | 18:06 |
clarkb | timburke_: and ya we set it at the container level too https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae49c3caa105/roles/upload-logs-base/library/zuul_swift_upload.py#L112-L117 | 18:07 |
clarkb | however we use something like 4096 containers to shard the logs across and all of those would've been created a long time ago so hard to say if that is also affected here | 18:07 |
timburke_ | 👍 yeah, the CDN stuff can definitely cause headaches, too | 18:07 |
gtema | Clarkb: do you have a chance to look at any of the "broken" containers/objects in rax to see if they have any headers set? | 18:10 |
clarkb | let me see | 18:11 |
clarkb | have to find an object from last week | 18:11 |
gtema | No hurry, thanks. I am anyway already off for today | 18:12 |
clarkb | ya enjoy your evening. we've managed to work around it for now | 18:12 |
clarkb | hrm https://d321133537aef6ff2c0f-8ffa80ef1885272f8fa2b55d06420ca4.ssl.cf2.rackcdn.com/837180/7/check/designate-bind9-stable-xena/554a978/ is an example but I'm not sure how to map that to a container or cloud | 18:12 |
*** rlandy is now known as rlandy|mtg | 18:13 | |
gtema | Ok, that should work. Regulat curl should help here | 18:13 |
clarkb | gtema: I'm not sure if that will pass through all of the swift metadata though. I'm trying to map it to a swift container so I can check that directly but we'll see how successful I am | 18:15 |
gtema | well, I clearly see that X-Delete-At is set on container and any random object | 18:16 |
gtema | and Access-Control-Allow-Origin: "*" is set on root | 18:17 |
clarkb | for the objects at the link I just provided? I don't see either | 18:18 |
gtema | clarkb: are you sure this is bad example? | 18:18 |
clarkb | but maybe you don't get them by default with a basic GET | 18:18 |
clarkb | gtema: yes https://zuul.opendev.org/t/openstack/build/554a978fa1f346ddb89aea349cd4d76b fails to load and according to the console it is due to CORS and if you clikc the view log link top right you get the url above | 18:19 |
clarkb | and there definitely isn't cors headers set | 18:19 |
gtema | ok, I see | 18:20 |
clarkb | ya ok curl shows the X-Delete-At but direfox didn't | 18:20 |
clarkb | gtema: https://e6bafb017d035c5dec65-4d6af7aae3d7a953b788d97e53f5e54e.ssl.cf1.rackcdn.com/844291/1/check/tox-py27/5422451/ is a working example uploaded by 0.61.0 | 18:21 |
fungi | did something else from that buildset upload to ovh so we can compare? are we also missing the headers there? | 18:21 |
clarkb | Access-Control-Allow-Origin: * is rpesent there | 18:21 |
gtema | yes, I see | 18:21 |
clarkb | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_780/837180/7/check/designate-bind9-stable-yoga/780677a/ is an ovh build result from the same buildset as the failing rax 0.99.0 case | 18:22 |
gtema | https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae49c3caa105/roles/upload-logs-base/library/zuul_swift_upload.py#L227 - there is even explicit "hack" for rax in the role | 18:23 |
clarkb | gtema: yes I called that out in my email | 18:23 |
gtema | ah, right | 18:23 |
clarkb | interesting the ovh case also has CORS errors in the console but it loads the logs in the dashboard anyway | 18:25 |
clarkb | ah but it is only for a specific file which apparently zuul doesn't need and it isn't fatal? | 18:26 |
opendevreview | Merged opendev/system-config master: Fix zuul merger graceful stops https://review.opendev.org/c/opendev/system-config/+/844320 | 18:26 |
gtema | ugh, I think I see the issue: access-control-allow-origin is not matching expected prefix for supported object headers(https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/object_store/v1/_base.py#L37) | 18:27 |
fungi | once that ^ deploys, we'll need to manually down the zuul-merger container on zm01 but after that we should be good? | 18:27 |
gtema | I will test this carefully tomorrow | 18:27 |
*** artom_ is now known as artom | 18:27 | |
fungi | gtema: is there a reason for having an allowed list of object headers? i thought you could include any arbitrary header | 18:28 |
gtema | well, pretty much API description of Swift that tells that object metadata starts with X-Object-Meta | 18:29 |
gtema | there is set of additional "system" headers https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/object_store/v1/obj.py#L23 but cors headers are not in there | 18:29 |
gtema | I will create a test case tomorrow for that in sdk | 18:29 |
gtema | hopefully however we can unblock ourselves with devstack-networking often OOMing | 18:30 |
clarkb | interestingly when I make requests against ovh with curl I don't get the cors headers but when my browser makes them it does get them. The one that fails is for a 404 which is why it breaks and why its fine | 18:32 |
timburke_ | fwiw, there's also a configurable list of additional headers that may be stored: https://github.com/openstack/swift/blob/2.29.1/etc/object-server.conf-sample#L146 | 18:33 |
gtema | so at the moment code works correctly according to official docs. And since cors is an additional middleware it is not present in API docs and thus not properly considered even as concept | 18:33 |
clarkb | we're fetching job-output.json.gz but the path is acutally job-output.json. I think its a compat thing to try and find multiple possible versions but the 404 version fails CORS headers ebcause nothing has said a 404 is a valid cross site request | 18:33 |
timburke_ | clarkb, if you include something like `-H 'Origin: example.com'` you'll get the CORS header | 18:33 |
gtema | clarkb: I think you need also to send refer header or something like that (browsers should be doing that) | 18:33 |
clarkb | timburke_: ah thanks | 18:33 |
clarkb | gtema: re working correctly according to official docs I don't think the docs say other headers are invalid just that if properly formatted they are treated special? | 18:34 |
fungi | is the client-side header filtering idea that it can save the user time and bandwidth over a server-side api rejection? | 18:34 |
clarkb | gtema: I did look at that fwiw and didn't find anything saying arbitrary headers are invalid/disallowed just that properly formatted ones can be managed by swift | 18:34 |
gtema | clarkb: it depends on how you read the doc. It of course does not mention that not listed headers are forbidden, but it lists headers it recognizes | 18:35 |
clarkb | gtema: right | 18:35 |
fungi | because otherwise, a client second-guessing server-side limits is at best a redundancy and at worst likely to diverge over time | 18:35 |
clarkb | I would argue clients/sdks/tools shouldn't be overly aggressive then | 18:35 |
clarkb | client tools should always be forgiving | 18:35 |
fungi | postel's law | 18:36 |
clarkb | then let the remote end be angry if necessary | 18:36 |
fungi | says the opposite ;) | 18:36 |
gtema | approach of SDK was always to try to fail client as early as possible before even reaching server | 18:36 |
fungi | but yeah, i don't think it's applicable in this case | 18:36 |
clarkb | gtema: the problem with that is openstack has never been consistent enough to make that a reasonable thing to do | 18:36 |
gtema | that is so sadly true, this makes me cry | 18:37 |
* gtema is wiping tears | 18:37 | |
gtema | okay, as said - I will try to fix that tomorrow | 18:38 |
clarkb | fungi: ya I mean swift seems to ignore it entirely in the ovh case for example | 18:38 |
clarkb | the zuul merger docker compose config fix is deploying now | 18:39 |
clarkb | once it deploys I'll manually gracefulyl stop zm01 again and see if we get furhter | 18:39 |
clarkb | I wonder if I have to down up the container to pick up the new config though :/ | 18:39 |
fungi | from an sdk standpoint, i would interpret postel's law as saying that the user should be conservative in what data they supply as inputs but the sdk should be forgiving if what the user supplies it. then the sdk should be as conservative as it can in what it sends to the server-side api (while still trying to honor the caller's wishes), and the server should be as accepting as possible about | 18:39 |
fungi | what it receives from the sdk | 18:39 |
corvus | fwiw, the sdk docs don't say they will filter the header list: https://docs.openstack.org/openstacksdk/latest/user/connection.html#openstack.connection.Connection.create_object | 18:40 |
corvus | ```headers – These will be passed through to the object creation API as HTTP Headers.``` | 18:40 |
corvus | (which, to be clear, is what i think is the expected and desired behavior) | 18:41 |
gtema | btw, https://docs.openstack.org/swift/latest/cors.html mentions that you need to set X-Container-Meta-Access-Control-Allow-Origin | 18:42 |
clarkb | ya we do that too https://opendev.org/zuul/zuul-jobs/src/commit/e69d879caecb454c529a7d757b80ae49c3caa105/roles/upload-logs-base/library/zuul_swift_upload.py#L113 | 18:43 |
gtema | correct. That is why it works for OVH and not for RAX | 18:44 |
clarkb | that and the conatiners were all created years ago | 18:44 |
clarkb | but ya if we ran this against ovh today and it created new containers it would probably work | 18:44 |
gtema | if they have something not standard (what if not matching API docs of swift) we have issues | 18:44 |
gtema | ok, done for tonight. Will add exception to sdk | 18:45 |
clarkb | deployment of the zm fix is done. Manually running the merger stop on zm01 now | 18:48 |
fungi | also remember that the current api docs for swift are not necessarily going to be relevant to the 10-year-old fork some major service providers are still running | 18:48 |
clarkb | the playbook is proceeding | 18:48 |
corvus | we also set content-encoding and content-type using that mechanism -- do we know if we expect those to make it through 0.99? | 18:49 |
fungi | but users may have an application which needs to talk to diablo-era and yoga-era swift in different providers at the same time | 18:49 |
clarkb | corvus: fungi might be a good idea to followup on the thread with that info so it isn't lost in irc scrollback? but ya I agree those are good questions and considerations :) | 18:50 |
fungi | which was the use case for the code in nodepool which was later extracted to become shade and then merged into openstacksdk | 18:50 |
clarkb | I think zm02 is hitting the same problem because it started on the wrong config? | 18:50 |
clarkb | we may have to amnually stop each merger. I'll do that if so | 18:51 |
fungi | oh, so we'll need to stop them all manually this time | 18:51 |
fungi | yeah, that makes sense. i guess docker-compose interprets that when "upping" and doesn't re-read it for other actions | 18:51 |
corvus | but this is the last time, really for real this time | 18:51 |
clarkb | yes Ithink so. I'm running the same command the playbook runs to stop them which means in theory they will all work next time | 18:51 |
timburke_ | looks like content-encoding and content-type should be fine: https://opendev.org/openstack/openstacksdk/src/branch/master/openstack/object_store/v1/obj.py#L26-L27 | 18:52 |
clarkb | ERROR: 137 trying to stop on zm03 but it seems to have stopped | 18:54 |
clarkb | 04 didn't do that but 05 did. I wonder if its a timing thing stopping the merger too close to startup | 18:58 |
clarkb | I'll give 06 plenty of time | 18:59 |
fungi | i guess 137 is a docker-specific exit code? i don't see zuul special-casing it anyway | 19:00 |
clarkb | ya I think so | 19:00 |
clarkb | apparently error 137 is a "I don't have enough memory" error | 19:02 |
clarkb | maybe our mergers are a bit too small? | 19:02 |
fungi | maybe it tried to start a new merger process while the old one's allocations hadn't been cleaned up? | 19:03 |
clarkb | free reports plenty of available memory | 19:03 |
fungi | but yeah, the mergers only have 2gb ram | 19:04 |
fungi | they do have swap too though | 19:04 |
clarkb | ya 07 was fine | 19:04 |
clarkb | something to keep an eye on but probably not urgent? | 19:04 |
clarkb | 8 is proceeding now. It should get to zuul01 processes fairly quickly | 19:06 |
clarkb | yup zuul01 is stopping now | 19:07 |
clarkb | corvus: I notice that the fingergw does not remove itself from the components registry when stopped | 19:08 |
clarkb | oh wait there it goes. Maybe just a delay on the zk ephemeral node cleanup? | 19:08 |
clarkb | it is waiting for the scheduler on 01 to start now. i'm going to eat lunc hwhile that happens | 19:09 |
fungi | yeah, web and scheduler will take a while | 19:09 |
*** rlandy|mtg is now known as rlandy | 19:12 | |
clarkb | looks like it is doing 02 now | 19:38 |
fungi | yep | 19:40 |
clarkb | the playbook is done. Seems to have had no errors. I'll close the screen session now since everything was logged for it | 20:06 |
clarkb | part of me wants to run it again today just to make sure the mergers are happy but I don't think that is super important | 20:07 |
fungi | i'm happy to run it again. we technically also didn't really exercise the image update this last time since that was done prior to restarting the mergers manually | 20:09 |
corvus | if you wanted a new version of zuul for the next update....merging https://review.opendev.org/843737 would do, and it's operationally interesting for opendev... ;) | 20:12 |
corvus | (meanwhile, any objection to my restarting the launchers? | 20:13 |
fungi | no objection from me | 20:13 |
clarkb | ya no objection here, though that will likely pull in openstacksdk 0.99.0 on the launchers | 20:15 |
clarkb | I think now that the previous issue is better understood we wouldn't epxect that to affect nodepool, but calling it out as a change | 20:15 |
corvus | #status log restarted nodepool launchers on 6416b1483821912ac7a0d954aeb6e864eafdb819, likely with sdk 0.99 | 20:15 |
opendevstatus | corvus: finished logging | 20:15 |
clarkb | I jsut we should status log the restart of zuul too | 20:16 |
corvus | clarkb: agreed | 20:16 |
clarkb | #status log Restarted all of zuul on 6.0.1.dev54 69199c6fa | 20:16 |
corvus | (agreed re sdk) | 20:16 |
opendevstatus | clarkb: finished logging | 20:16 |
corvus | openstack.exceptions.BadRequestException: BadRequestException: 400: Client Error for url: [...] Bad networks format | 20:17 |
corvus | i'm looking into whether that's new or not | 20:17 |
corvus | nope that's new | 20:18 |
corvus | i'm going to assume occam's razor and that's an sdk 0.99 bug (i have confirmed 0.99 is in the container) | 20:19 |
clarkb | wouldn't surprise me | 20:19 |
corvus | next step? roll back our launchers to nodepool 6.0.0 and then merge a pin? | 20:19 |
clarkb | seems reasonable to me | 20:20 |
clarkb | I don't think opendev is relying on any new unreleased nodepool features/functionality | 20:20 |
fungi | oh that's a fun error | 20:20 |
fungi | i'll get the pin pushed | 20:21 |
corvus | ansible -f 20 nodepool-launcher -m shell -a 'docker pull zuul/nodepool-launcher:6.0.0; docker tag zuul/nodepool-launcher:6.0.0 zuul/nodepool-launcher:latest' | 20:22 |
corvus | #status log restarted nodepool launchers on 6.0.0 after encountering suspected sdk 0.99 bug | 20:22 |
opendevstatus | corvus: finished logging | 20:22 |
clarkb | I'll review the pin but then I'm going for a bike ride. My opportunity to do that are becoming fewer as we get closer to the summit | 20:23 |
fungi | corvus: clarkb: https://review.opendev.org/c/zuul/nodepool/+/844334 Temporarily pin OpenStackSDK before 0.99 | 20:27 |
clarkb | heh we even had the 1.0.0 cap | 20:28 |
clarkb | +2 | 20:28 |
fungi | yeah... | 20:29 |
clarkb | and now bike ride time. Back in a bit | 20:30 |
*** timburke_ is now known as timburke | 20:59 | |
*** dviroel is now known as dviroel|out | 21:34 | |
clarkb | fungi: just to catch up were you goign to rerun the reboot playbook? I can help keep an eye on it if so | 22:26 |
fungi | lemme check if that zuul change merged and published | 22:30 |
clarkb | I think it did | 22:31 |
clarkb | assuming the change that merged is the right one | 22:31 |
fungi | yeah, promote finished | 22:31 |
fungi | i have a root screen session with the new run teed up | 22:32 |
clarkb | cool I'm not joined yet but can keep an eye on it via /components and grafana and dig in futher if necessary | 22:32 |
fungi | ready to hit enter if no immediate objections | 22:33 |
clarkb | none from me. | 22:33 |
fungi | fire in the hole! | 22:33 |
fungi | seems to have pulled the new images | 22:33 |
clarkb | ya that should be the first thing it does | 22:33 |
fungi | ze01 should be in the process of stopping | 22:33 |
*** rlandy is now known as rlandy|out | 22:44 | |
clarkb | reminder I plan to delete the ethercalc server and its dns records tomorrow | 23:52 |
clarkb | I haven't heard any noise since we shutdown the server. Please let me know if you saw something I missed | 23:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!