opendevreview | Rajesh Tailor proposed openstack/nova master: Fix KeyError on assisted snapshot call https://review.opendev.org/c/openstack/nova/+/900783 | 07:01 |
---|---|---|
tkajinam | https://zuul.opendev.org/t/openstack/builds?job_name=grenade-skip-level-always&project=openstack/nova | 07:38 |
tkajinam | I'm afraid grenade-skip-level-always might be broken. I've not yet looked into it though. | 07:39 |
frickler | ERROR cinder etcd3gw.exceptions.Etcd3Exception: Not Found | 07:46 |
tkajinam | frickler, ah, thanks for that info. I saw the same error in a few run so there may be something wrong with cinder | 07:59 |
tkajinam | it seems cinder is deployed with etcd3gw tooz driver. | 07:59 |
tkajinam | I vaguely guess the issue was caused by the new tooz release and https://review.opendev.org/c/openstack/tooz/+/891355 might be the cause | 08:03 |
tkajinam | hmm. do we really need that job in master ? I see the job test upgrade from zed to 2023.2 | 08:06 |
tkajinam | and does not look appropriate for master | 08:06 |
tkajinam | ignore above. I noticed I was looking at a wrong branch of grenade... | 08:08 |
bauzas | tkajinam: looking | 09:30 |
bauzas | yeah this is failing when upgrading cinder | 09:31 |
bauzas | we're on Caracal, so it's upgrading from Antelope | 09:32 |
bauzas | https://github.com/openstack/grenade/commit/159054a9274f984d8e53e5e5bd9e7ab0fb29708b | 09:33 |
bauzas | (C is a SLURP release) | 09:33 |
bauzas | heh, fun, we got a GMR report in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_73a/900783/2/check/grenade-skip-level-always/73acbef/controller/logs/screen-c-vol.txt | 09:36 |
bauzas | 2023-11-17 06:30:28.118 | ++ /opt/stack/new/grenade/projects/70_cinder/resources.sh:verify:220 : openstack volume show cinder_grenade_vol2 -f shell -c status 2023-11-17 06:30:29.692 | Internal Server Error (HTTP 500) | 09:39 |
bauzas | yeah, something got wrong on the cinder side | 09:39 |
bauzas | and yup yup, good catch tkajinam | 09:40 |
bauzas | https://paste.opendev.org/show/br9ep5e0MjNReG9A9ta7/ | 09:40 |
frickler | bauzas: cf. https://review.opendev.org/c/openstack/devstack/+/901221 | 10:00 |
bauzas | ack thanks | 10:00 |
bauzas | I don't exactly get the patch that broke tooz yet | 10:01 |
bauzas | but it was merged on Oct 16 | 10:02 |
bauzas | IMHO we should rather revert it | 10:02 |
bauzas | https://review.opendev.org/c/openstack/tooz/+/891355 was merged on Sep 28 so it's not the culprit | 10:03 |
frickler | bauzas: tooz was released yesterday with that patch afaict, added into u-c with https://review.opendev.org/c/openstack/requirements/+/901131 | 10:09 |
frickler | maybe that default setting change should have triggered a major version bump. also maybe grenade should be patched to handle the etcd upgrade | 10:11 |
bauzas | frickler: oh right, tooz is a library | 10:12 |
bauzas | don't judge me, it's Friday morning and I'm already tired :) | 10:13 |
bauzas | and yeah, looks to me it's a behavioural change | 10:13 |
frickler | I don't judge anyone on Friday's ;) | 10:13 |
bauzas | "* Note that this probably means a new minor version of tooz, while the behaviour can be worked around in config, this is enough to break existing systems on upgrade." | 10:15 |
bauzas | indeed it breaks | 10:15 |
bauzas | frickler: I'm tempted to require a u-c reverty | 10:17 |
bauzas | elodilles: ^ | 10:17 |
bauzas | elodilles: fwiw, our grenade skip level job is broken due to the new tooz releasee https://review.opendev.org/c/openstack/releases/+/899966 | 10:18 |
elodilles | reading | 10:20 |
bauzas | I just sent the usual 'hold your rechecks' email to the wolrd | 10:24 |
bauzas | frickler: I feel something is wrong with the tooz patch | 10:26 |
bauzas | if we want to have SLURP upgrades, we somehow need to keep a backwards compatible behaviour for a second | 10:26 |
bauzas | or we need to say 'this is a major breaking change' and only once we no longer support older releases that we want to pull it | 10:27 |
elodilles | bauzas: thanks for the mail o/ (maybe [oslo] team should have been in the subject, too) | 10:31 |
bauzas | yeah | 10:31 |
bauzas | well, oslo is free to deliver major versions of their libraries that break compatibility | 10:32 |
bauzas | but then they need to signal it correctly for the release team appropriately | 10:32 |
bauzas | elodilles: I tagged with [tooz] I get that filter is maybe too much restricted ? | 10:33 |
bauzas | (fwiw, I don't do tag filtering on my mailbox) | 10:33 |
bauzas | (and I hope oslo contributors to not do it as well) | 10:33 |
elodilles | yes, they probably see the mail with [tooz] tag as well | 10:36 |
bauzas | elodilles: https://review.opendev.org/c/openstack/requirements/+/901079 | 10:40 |
elodilles | +1 | 10:42 |
elodilles | (from release team perspective i'm still happy we are not finding this one week before the final release o:)) | 10:43 |
bauzas | all hail to the grenade-skip-level lord | 10:44 |
frickler | and to the "release at rc-1" policy | 10:44 |
frickler | eh milestone-1 | 10:45 |
elodilles | yepp \o/ | 10:45 |
* bauzas goes outside for sweating needs | 10:49 | |
opendevreview | Rajesh Tailor proposed openstack/nova master: trivial doc fix https://review.opendev.org/c/openstack/nova/+/901237 | 11:33 |
pas-ha[m] | re tooz/etcd - may I bring your attention to https://review.opendev.org/c/openstack/etcd3gw/+/887662 - this adds version autodiscovery to etcd3gw, once there, we could switch default version in tooz to 'auto', and everyone is happy, new tooz will be able to work with the same 'backend_url' with old and new etcd3 | 11:34 |
pas-ha[m] | this allows to upgrade the etcd and tooz-using components in whatever order w/o changing the configuration | 11:39 |
opendevreview | Merged openstack/nova master: Bump jsonschema minimum to 4.0.0 https://review.opendev.org/c/openstack/nova/+/850021 | 12:57 |
jangutter | pas-ha[m]: regarding version autodetect: https://lists.openstack.org/pipermail/openstack-discuss/2023-August/034704.html | 13:33 |
jangutter | pas-ha[m]: probing for a version also means you don't error on the case where you forgot to upgrade your etcd. | 13:35 |
-opendevstatus- NOTICE: Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/ | 14:07 | |
bauzas | lemme get it straight | 14:13 |
bauzas | etcd changed its endpoint by a release | 14:14 |
bauzas | tooz accordingly changed to use the newer endpoint by a specific tooz release | 14:14 |
bauzas | given now, we upgrade tooz but not etcd, tooz talks the new way while it should talk to the other endpoint | 14:15 |
bauzas | amirite, jangutter pas-ha[m], frickler ? | 14:15 |
jangutter | bauzas: correct, it's also possible to override the endpoint away from the default, in cinder config. Both old and new tooz can work with old and new etcd if the endpoint is given explicitly. | 14:17 |
tkajinam | the problem is that tooz used to use v3beta which is hardcoded and users had to override it in case needed | 14:18 |
bauzas | jangutter: was cinder using the default, or was it specifically asking an endpoint? | 14:18 |
tkajinam | the problem here is that in stable/2023.1 we had old tooz using v1beta by default along with old etcd3 with v1beta. now we upgrade only tooz and the new version uses v1 by default. | 14:19 |
tkajinam | and because etcd3 is still old after upgrade v3 api is not available | 14:19 |
jangutter | https://opendev.org/openstack/devstack/commit/3a7a3cd8c5a5ac3f1655d6ff17974f8623fb3330 <--- before this, cinder was using default, after this cinder was asking explicitly for v3 | 14:20 |
bauzas | we shouldn't expect operators to upgrade their dependencies when upgrading | 14:20 |
tkajinam | we added api_version=/v3 in coordination url in devstack stable/2023.2, so if you first deploy cinder with stable/2023.1 coordination url does not contain api version | 14:20 |
opendevreview | John Garbutt proposed openstack/nova-specs master: Expose PCI device NUMA using PXB https://review.opendev.org/c/openstack/nova-specs/+/869416 | 14:20 |
tkajinam | and the same config value is carried over so tooz continue using the default which has been changed from stable/2023.1 to master | 14:21 |
jangutter | tkajinam: and it's worse, tooz was using v3alpha, not v3beta https://review.opendev.org/c/openstack/tooz/+/891355/6/tooz/drivers/etcd3gw.py | 14:21 |
bauzas | okay, but that's then a devstack config problem | 14:21 |
tkajinam | ah, yeah. I was wrong/ v1alpha | 14:21 |
bauzas | jangutter: why shouldn't we changing devstack now to use the default? | 14:22 |
tkajinam | bauzas, no we can't use default | 14:22 |
bauzas | that's still a breaking change and people would need to change their config when upgrading to Caracal | 14:22 |
jangutter | the only way config stays the same is if etcd and tooz is upgraded in lockstep. | 14:23 |
tkajinam | bauzas, the default in tooz stable/2023.1 is v3alpha, the default in tooz master is v3. Because we keep using old etcd without v3 during upgrade process we have to override api version to v3alpha to keep using v3alpha after upgrade | 14:23 |
jangutter | if etcd and tooz mismatches, the endpoint has to be specified. | 14:23 |
bauzas | shouldn't tooz be somehow a bit smarter and do a failback call if etcd isn't new enough ? | 14:24 |
tkajinam | that's what was proposed above by pas-ha[m] | 14:24 |
opendevreview | John Garbutt proposed openstack/nova-specs master: Add spec for PCI Groups https://review.opendev.org/c/openstack/nova-specs/+/899719 | 14:25 |
bauzas | what I'm afraid is that operators doing upgrades would face that problem too, right? | 14:25 |
bauzas | if we start tweaking devstack, there is surely an impact for regular ops that don't deploy with it :) | 14:26 |
jangutter | that's technically down to etcd3gw backend... and etcd doesn't have a nice version discovery, so, it's the try-different-versions method. | 14:26 |
bauzas | this means they have to change some config on the fly during upgrades | 14:26 |
tkajinam | well, we did document the upgrade impact to suggest setting api version in case old api version should be used. | 14:26 |
bauzas | jangutter: I'm personnally fine with a try/catch clause | 14:27 |
tkajinam | but what we are learning here is that just documentation may kill some people. | 14:27 |
tkajinam | including us | 14:27 |
bauzas | oh yeah | 14:27 |
tkajinam | https://review.opendev.org/c/openstack/tooz/+/891355/6/releasenotes/notes/etcd-3.4-eee8300c942a1263.yaml | 14:27 |
bauzas | if you really do breaking upgrades, my personal recommandation is to provide tools that people can consume for checking their states and whether they'll break | 14:28 |
jangutter | yep, the release notes are the last resort :-/ | 14:28 |
bauzas | tkajinam: we, in nova, create the nova-status upgrade command for that exact reason | 14:28 |
bauzas | created* | 14:28 |
bauzas | because even if we spoil the breaking changes in our relnotes, people may be impacted and miss the point | 14:29 |
bauzas | anyway, I'm not saying tooz shall do a tooz-status upgrade commande | 14:29 |
bauzas | command* | 14:29 |
tkajinam | the problem may be that tooz is not the right place to provide that config validation, because tooz itself does not register config options but components using it does and pass the url value | 14:29 |
jangutter | The problem is that etcd is sometimes regarded as an _externally_ managed service, and sometimes an _internally_ managed service. | 14:29 |
tkajinam | we can provide the base implementation but every project may need to implement the command using that | 14:30 |
bauzas | I'm just saying that we need some kind of lockstep mechanism that can handle both etcd/tooz upgrades at once | 14:30 |
jangutter | yeah: for kolla-ansible that happens because in one release all the elements (config, tooz, etcd) is in lockstep. But that's not necessarily the same case for source-level projects. | 14:31 |
bauzas | problem is, again, you can't really ask operators to do some OS upgrade while you're upgrading your services | 14:32 |
bauzas | that's why grenade is not bumping the reqs | 14:32 |
jangutter | yeah, _however_ etcd in devstack isn't the os-bundled one. | 14:33 |
bauzas | just checked, we pull tooz-3.2.0 | 14:35 |
bauzas | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_73a/900783/2/check/grenade-skip-level-always/73acbef/controller/logs/old/devstacklog.txt | 14:35 |
bauzas | and we have etcd3gw==2.1.0 | 14:36 |
bauzas | oh tonyb +Wd my revert patch https://review.opendev.org/c/openstack/requirements/+/901079 | 14:36 |
tonyb | it's clearly broken | 14:38 |
*** blarnath is now known as d34dh0r53 | 14:38 | |
bauzas | oh heh, https://github.com/openstack/grenade#theory-of-upgrade | 14:38 |
bauzas | tkajinam: jangutter: ^ | 14:38 |
jangutter | yeah I saw that. | 14:38 |
tonyb | we should block it in g-r and figure out how to handle things for the longer term | 14:38 |
bauzas | "Any other required changes on upgrade are an exception and must be called out in the release notes." and "Grenade supports per release specific upgrade scripts (from-juno, from-kilo)." | 14:39 |
bauzas | tonyb: I don't disagree | 14:39 |
bauzas | I still think I'd appreciate if tooz may do some fallback mechanism to an older endpoint if the newer endpoint sounds invalid | 14:40 |
tkajinam | https://review.opendev.org/q/topic:etcd-api-discovery | 14:40 |
bauzas | just for the sake of easing people's lifes | 14:40 |
tkajinam | We probably need a bug and link these to the bug, but we can make tooz smart and discover the available api version unless user explicitly requires a specific version | 14:41 |
bauzas | tkajinam: wow, kudos | 14:41 |
tkajinam | which is being implemented in these | 14:41 |
tkajinam | (tooz patch need further work because we first need a new release of etcd3gw... | 14:41 |
bauzas | sounds a good path to me | 14:41 |
bauzas | we may ban the current tooz release and just await for a newer release once everything is fine | 14:42 |
jangutter | tkajinam: one question - if etcd gets updated (v3alpha disappears and v3 appears) - would this case be handled? | 14:42 |
tkajinam | jangutter, yes. with that code etcd checks the etcd server version using version api and use the latest api version available | 14:44 |
tkajinam | jangutter, though you may need to restart servers because the detected api is cached after the first detection. | 14:44 |
tkajinam | s/servers/services/ I mean | 14:44 |
tkajinam | s/etcd checks/tooz checks/ | 14:45 |
jangutter | takjinam: so it will appear if cinder "remembers" the version of etcd it first connected to and never try a newer version? | 14:45 |
tkajinam | there is tradeoff. we can let tooz check api version every time but it always requires additional api call | 14:46 |
tkajinam | we may probably implement expire but I'd leave it for now | 14:46 |
tkajinam | jangutter, I probably had to say this first. yes, you are correct | 14:46 |
jangutter | tkajinam: yeah, gracefully redetecting the endpoint on error is also another option, that adds time to report errors. | 14:49 |
jangutter | but this covers 95% of the pain. | 14:50 |
tkajinam | https://bugs.launchpad.net/python-etcd3gw/+bug/2043810 | 14:58 |
tkajinam | I've created a bug here... in case you want something you can refer to | 14:59 |
tkajinam | jangutter, we can discuss how we can realize the automated switch without restart, but I'd prefer leaving it for follow-up. | 14:59 |
jangutter | tkajinam: right, I'm fine with just documenting the behavior (i.e. we expect dependent services to be restarted on etcd upgrade) | 15:00 |
jangutter | (that's a much more reasonable expectation than requiring operators to update etcd + tooz in lockstep. | 15:04 |
tonyb | bauzas: You probably need to remove your -W on 901079 | 15:15 |
bauzas | ack I will | 15:15 |
bauzas | dansmith: before I shutdown for the weekend and before your PTOs, just a quick question about https://review.opendev.org/c/openstack/nova/+/899625 | 15:49 |
bauzas | for non-SRIOV GPUs, shall we cap the total of the inventory by the max_instances or should we rather provide total=0 if the GPU has more than the max ? | 15:50 |
* dansmith waits | 15:50 | |
bauzas | for example, a GPU type supporting 8 vGPUs | 15:51 |
bauzas | and the operator saying max_instances=2 | 15:51 |
bauzas | shall we then accept one GPU and providing a total=2 or should we rather not accept it? | 15:51 |
dansmith | we can't *actually* have total=0 right? | 15:51 |
bauzas | we can now | 15:51 |
bauzas | with https://review.opendev.org/c/openstack/nova/+/899625/4/nova/virt/libvirt/driver.py | 15:52 |
bauzas | I'm just writing the unittests | 15:52 |
dansmith | right but that's just internally, placement won't allow it right? | 15:52 |
bauzas | dansmith: if we would say total=0 then the driver would ask to remove the GPU | 15:52 |
dansmith | right, so just internal notation for us to know what to do when we update placement | 15:53 |
dansmith | I guess I'm not sure what you're asking then | 15:53 |
bauzas | okay, then nevermind | 15:53 |
bauzas | I'll create a test and we'll discuss with gerrit | 15:53 |
dansmith | I was also going to ask on that patch why you continue to have providers above the limit, but with total=0, but I thought maybe that was related to cleanup for people who already have more providers than they should | 15:53 |
dansmith | okay | 15:54 |
bauzas | tbc, sriov gpus only have a total=1 for each of the VFs | 15:54 |
bauzas | so it's simple | 15:54 |
bauzas | either we accept a VF or we don't | 15:54 |
bauzas | but for non-sriov gpus, we can have a capacity for say 8 or 16 | 15:54 |
bauzas | so we create an inventory with a total=16 | 15:55 |
bauzas | now, the question is, what would happen if the operator would add a max_instances value for the type ? | 15:55 |
dansmith | just total=min(actual, max_instances) right? | 15:55 |
bauzas | that's my question | 15:56 |
dansmith | why not? | 15:56 |
bauzas | because for the moment, we have https://review.opendev.org/c/openstack/nova/+/899625/4/nova/virt/libvirt/driver.py#8144 | 15:56 |
bauzas | we calculate per GPU (or VF) | 15:56 |
bauzas | and if the number of possible vGPU for it is above the max_instances value, then we eventually say "heh, no, total=0" | 15:57 |
bauzas | so we don't really cap the GPU, we rather don't accept it | 15:57 |
bauzas | which is fine too | 15:57 |
dansmith | oh I see, you're saying your current patch doesn't add the inventory _at_all_ if they specify max_instances<total instead of just capping it? | 15:57 |
dansmith | I didn't pick up on that when I looked, but.. why would we do that? | 15:58 |
dansmith | isn't the point of this to be able to say "I know it says 16 available, but we can really only allocate 2 of this mdev_type because it takes so much of the card's resource" ? | 15:58 |
dansmith | if you don't allow them to set max_instances below total, we lose the ability to do that, no? | 15:58 |
bauzas | you know what ? I can upload my change | 15:59 |
bauzas | I'm done with the unitteests | 15:59 |
bauzas | I can create a functional test (and I want to) but for the sake of discussing, let me just upload it now | 15:59 |
opendevreview | Sylvain Bauza proposed openstack/nova master: WIP: libvirt: Cap with max_instances GPU types https://review.opendev.org/c/openstack/nova/+/899625 | 16:00 |
dansmith | okay, but .. the above question is what I need answered to really say anything... | 16:01 |
dansmith | but I can put that comment in gerrit if you prefer | 16:02 |
opendevreview | Sylvain Bauza proposed openstack/nova master: WIP: libvirt: Cap with max_instances GPU types https://review.opendev.org/c/openstack/nova/+/899625 | 16:04 |
bauzas | dansmith: I added some comments ^ | 16:04 |
dansmith | okay I'll just drop this comment/question in gerrit and we can iterate from there.. I have a call starting in a bit anyway | 16:04 |
bauzas | dansmith: fwiw, haven't seen your comments from PS4 but we run the periodic just after starting the service | 16:05 |
bauzas | so this works | 16:05 |
bauzas | given operators would modify the option, they would restart the nova-compute service so it will then delete the non-needed inventories then | 16:06 |
dansmith | yeah, but we may have boot requests pending for us when we start, and it means we're technically doing the cleanup on each run right? | 16:07 |
bauzas | dansmith: that's a tradeoff but I see your point | 16:17 |
bauzas | we could force a update_rp() right after init_host in the compute manager before we run RPC | 16:18 |
bauzas | but I think the problem is the same for all things | 16:18 |
bauzas | oh wait | 16:19 |
bauzas | https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1778 | 16:20 |
bauzas | we're fine then | 16:20 |
bauzas | I just replied | 16:22 |
bauzas | anyway, I'll need to go errand | 16:22 |
bauzas | my patch is up, with a proposition | 16:22 |
bauzas | an alternative could be to cap the GPU inventory instead of refusing it as a whole, but meh, people can review | 16:23 |
bauzas | I'll add functests on Monday when I'm back | 16:23 |
* bauzas jumps off the keyboard | 16:24 | |
dansmith | ack | 16:26 |
-opendevstatus- NOTICE: The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete. | 16:34 | |
-opendevstatus- NOTICE: Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs. | 17:13 | |
tonyb | bauzas: your revert has merged | 23:36 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!