Friday, 2023-11-17

opendevreview	Rajesh Tailor proposed openstack/nova master: Fix KeyError on assisted snapshot call https://review.opendev.org/c/openstack/nova/+/900783	07:01
tkajinam	https://zuul.opendev.org/t/openstack/builds?job_name=grenade-skip-level-always&project=openstack/nova	07:38
tkajinam	I'm afraid grenade-skip-level-always might be broken. I've not yet looked into it though.	07:39
frickler	ERROR cinder etcd3gw.exceptions.Etcd3Exception: Not Found	07:46
tkajinam	frickler, ah, thanks for that info. I saw the same error in a few run so there may be something wrong with cinder	07:59
tkajinam	it seems cinder is deployed with etcd3gw tooz driver.	07:59
tkajinam	I vaguely guess the issue was caused by the new tooz release and https://review.opendev.org/c/openstack/tooz/+/891355 might be the cause	08:03
tkajinam	hmm. do we really need that job in master ? I see the job test upgrade from zed to 2023.2	08:06
tkajinam	and does not look appropriate for master	08:06
tkajinam	ignore above. I noticed I was looking at a wrong branch of grenade...	08:08
bauzas	tkajinam: looking	09:30
bauzas	yeah this is failing when upgrading cinder	09:31
bauzas	we're on Caracal, so it's upgrading from Antelope	09:32
bauzas	https://github.com/openstack/grenade/commit/159054a9274f984d8e53e5e5bd9e7ab0fb29708b	09:33
bauzas	(C is a SLURP release)	09:33
bauzas	heh, fun, we got a GMR report in https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_73a/900783/2/check/grenade-skip-level-always/73acbef/controller/logs/screen-c-vol.txt	09:36
bauzas	2023-11-17 06:30:28.118 \| ++ /opt/stack/new/grenade/projects/70_cinder/resources.sh:verify:220 : openstack volume show cinder_grenade_vol2 -f shell -c status 2023-11-17 06:30:29.692 \| Internal Server Error (HTTP 500)	09:39
bauzas	yeah, something got wrong on the cinder side	09:39
bauzas	and yup yup, good catch tkajinam	09:40
bauzas	https://paste.opendev.org/show/br9ep5e0MjNReG9A9ta7/	09:40
frickler	bauzas: cf. https://review.opendev.org/c/openstack/devstack/+/901221	10:00
bauzas	ack thanks	10:00
bauzas	I don't exactly get the patch that broke tooz yet	10:01
bauzas	but it was merged on Oct 16	10:02
bauzas	IMHO we should rather revert it	10:02
bauzas	https://review.opendev.org/c/openstack/tooz/+/891355 was merged on Sep 28 so it's not the culprit	10:03
frickler	bauzas: tooz was released yesterday with that patch afaict, added into u-c with https://review.opendev.org/c/openstack/requirements/+/901131	10:09
frickler	maybe that default setting change should have triggered a major version bump. also maybe grenade should be patched to handle the etcd upgrade	10:11
bauzas	frickler: oh right, tooz is a library	10:12
bauzas	don't judge me, it's Friday morning and I'm already tired :)	10:13
bauzas	and yeah, looks to me it's a behavioural change	10:13
frickler	I don't judge anyone on Friday's ;)	10:13
bauzas	"* Note that this probably means a new minor version of tooz, while the behaviour can be worked around in config, this is enough to break existing systems on upgrade."	10:15
bauzas	indeed it breaks	10:15
bauzas	frickler: I'm tempted to require a u-c reverty	10:17
bauzas	elodilles: ^	10:17
bauzas	elodilles: fwiw, our grenade skip level job is broken due to the new tooz releasee https://review.opendev.org/c/openstack/releases/+/899966	10:18
elodilles	reading	10:20
bauzas	I just sent the usual 'hold your rechecks' email to the wolrd	10:24
bauzas	frickler: I feel something is wrong with the tooz patch	10:26
bauzas	if we want to have SLURP upgrades, we somehow need to keep a backwards compatible behaviour for a second	10:26
bauzas	or we need to say 'this is a major breaking change' and only once we no longer support older releases that we want to pull it	10:27
elodilles	bauzas: thanks for the mail o/ (maybe [oslo] team should have been in the subject, too)	10:31
bauzas	yeah	10:31
bauzas	well, oslo is free to deliver major versions of their libraries that break compatibility	10:32
bauzas	but then they need to signal it correctly for the release team appropriately	10:32
bauzas	elodilles: I tagged with [tooz] I get that filter is maybe too much restricted ?	10:33
bauzas	(fwiw, I don't do tag filtering on my mailbox)	10:33
bauzas	(and I hope oslo contributors to not do it as well)	10:33
elodilles	yes, they probably see the mail with [tooz] tag as well	10:36
bauzas	elodilles: https://review.opendev.org/c/openstack/requirements/+/901079	10:40
elodilles	+1	10:42
elodilles	(from release team perspective i'm still happy we are not finding this one week before the final release o:))	10:43
bauzas	all hail to the grenade-skip-level lord	10:44
frickler	and to the "release at rc-1" policy	10:44
frickler	eh milestone-1	10:45
elodilles	yepp \o/	10:45
* bauzas goes outside for sweating needs		10:49
opendevreview	Rajesh Tailor proposed openstack/nova master: trivial doc fix https://review.opendev.org/c/openstack/nova/+/901237	11:33
pas-ha[m]	re tooz/etcd - may I bring your attention to https://review.opendev.org/c/openstack/etcd3gw/+/887662 - this adds version autodiscovery to etcd3gw, once there, we could switch default version in tooz to 'auto', and everyone is happy, new tooz will be able to work with the same 'backend_url' with old and new etcd3	11:34
pas-ha[m]	this allows to upgrade the etcd and tooz-using components in whatever order w/o changing the configuration	11:39
opendevreview	Merged openstack/nova master: Bump jsonschema minimum to 4.0.0 https://review.opendev.org/c/openstack/nova/+/850021	12:57
jangutter	pas-ha[m]: regarding version autodetect: https://lists.openstack.org/pipermail/openstack-discuss/2023-August/034704.html	13:33
jangutter	pas-ha[m]: probing for a version also means you don't error on the case where you forgot to upgrade your etcd.	13:35
-opendevstatus- NOTICE: Gerrit will be unavailable for a short time starting at 15:30 UTC as it is upgraded to the 3.8 release. https://lists.opendev.org/archives/list/service-announce@lists.opendev.org/thread/XT26HFG2FOZL3UHZVLXCCANDZ3TJZM7Q/		14:07
bauzas	lemme get it straight	14:13
bauzas	etcd changed its endpoint by a release	14:14
bauzas	tooz accordingly changed to use the newer endpoint by a specific tooz release	14:14
bauzas	given now, we upgrade tooz but not etcd, tooz talks the new way while it should talk to the other endpoint	14:15
bauzas	amirite, jangutter pas-ha[m], frickler ?	14:15
jangutter	bauzas: correct, it's also possible to override the endpoint away from the default, in cinder config. Both old and new tooz can work with old and new etcd if the endpoint is given explicitly.	14:17
tkajinam	the problem is that tooz used to use v3beta which is hardcoded and users had to override it in case needed	14:18
bauzas	jangutter: was cinder using the default, or was it specifically asking an endpoint?	14:18
tkajinam	the problem here is that in stable/2023.1 we had old tooz using v1beta by default along with old etcd3 with v1beta. now we upgrade only tooz and the new version uses v1 by default.	14:19
tkajinam	and because etcd3 is still old after upgrade v3 api is not available	14:19
jangutter	https://opendev.org/openstack/devstack/commit/3a7a3cd8c5a5ac3f1655d6ff17974f8623fb3330 <--- before this, cinder was using default, after this cinder was asking explicitly for v3	14:20
bauzas	we shouldn't expect operators to upgrade their dependencies when upgrading	14:20
tkajinam	we added api_version=/v3 in coordination url in devstack stable/2023.2, so if you first deploy cinder with stable/2023.1 coordination url does not contain api version	14:20
opendevreview	John Garbutt proposed openstack/nova-specs master: Expose PCI device NUMA using PXB https://review.opendev.org/c/openstack/nova-specs/+/869416	14:20
tkajinam	and the same config value is carried over so tooz continue using the default which has been changed from stable/2023.1 to master	14:21
jangutter	tkajinam: and it's worse, tooz was using v3alpha, not v3beta https://review.opendev.org/c/openstack/tooz/+/891355/6/tooz/drivers/etcd3gw.py	14:21
bauzas	okay, but that's then a devstack config problem	14:21
tkajinam	ah, yeah. I was wrong/ v1alpha	14:21
bauzas	jangutter: why shouldn't we changing devstack now to use the default?	14:22
tkajinam	bauzas, no we can't use default	14:22
bauzas	that's still a breaking change and people would need to change their config when upgrading to Caracal	14:22
jangutter	the only way config stays the same is if etcd and tooz is upgraded in lockstep.	14:23
tkajinam	bauzas, the default in tooz stable/2023.1 is v3alpha, the default in tooz master is v3. Because we keep using old etcd without v3 during upgrade process we have to override api version to v3alpha to keep using v3alpha after upgrade	14:23
jangutter	if etcd and tooz mismatches, the endpoint has to be specified.	14:23
bauzas	shouldn't tooz be somehow a bit smarter and do a failback call if etcd isn't new enough ?	14:24
tkajinam	that's what was proposed above by pas-ha[m]	14:24
opendevreview	John Garbutt proposed openstack/nova-specs master: Add spec for PCI Groups https://review.opendev.org/c/openstack/nova-specs/+/899719	14:25
bauzas	what I'm afraid is that operators doing upgrades would face that problem too, right?	14:25
bauzas	if we start tweaking devstack, there is surely an impact for regular ops that don't deploy with it :)	14:26
jangutter	that's technically down to etcd3gw backend... and etcd doesn't have a nice version discovery, so, it's the try-different-versions method.	14:26
bauzas	this means they have to change some config on the fly during upgrades	14:26
tkajinam	well, we did document the upgrade impact to suggest setting api version in case old api version should be used.	14:26
bauzas	jangutter: I'm personnally fine with a try/catch clause	14:27
tkajinam	but what we are learning here is that just documentation may kill some people.	14:27
tkajinam	including us	14:27
bauzas	oh yeah	14:27
tkajinam	https://review.opendev.org/c/openstack/tooz/+/891355/6/releasenotes/notes/etcd-3.4-eee8300c942a1263.yaml	14:27
bauzas	if you really do breaking upgrades, my personal recommandation is to provide tools that people can consume for checking their states and whether they'll break	14:28
jangutter	yep, the release notes are the last resort :-/	14:28
bauzas	tkajinam: we, in nova, create the nova-status upgrade command for that exact reason	14:28
bauzas	created*	14:28
bauzas	because even if we spoil the breaking changes in our relnotes, people may be impacted and miss the point	14:29
bauzas	anyway, I'm not saying tooz shall do a tooz-status upgrade commande	14:29
bauzas	command*	14:29
tkajinam	the problem may be that tooz is not the right place to provide that config validation, because tooz itself does not register config options but components using it does and pass the url value	14:29
jangutter	The problem is that etcd is sometimes regarded as an _externally_ managed service, and sometimes an _internally_ managed service.	14:29
tkajinam	we can provide the base implementation but every project may need to implement the command using that	14:30
bauzas	I'm just saying that we need some kind of lockstep mechanism that can handle both etcd/tooz upgrades at once	14:30
jangutter	yeah: for kolla-ansible that happens because in one release all the elements (config, tooz, etcd) is in lockstep. But that's not necessarily the same case for source-level projects.	14:31
bauzas	problem is, again, you can't really ask operators to do some OS upgrade while you're upgrading your services	14:32
bauzas	that's why grenade is not bumping the reqs	14:32
jangutter	yeah, _however_ etcd in devstack isn't the os-bundled one.	14:33
bauzas	just checked, we pull tooz-3.2.0	14:35
bauzas	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_73a/900783/2/check/grenade-skip-level-always/73acbef/controller/logs/old/devstacklog.txt	14:35
bauzas	and we have etcd3gw==2.1.0	14:36
bauzas	oh tonyb +Wd my revert patch https://review.opendev.org/c/openstack/requirements/+/901079	14:36
tonyb	it's clearly broken	14:38
*** blarnath is now known as d34dh0r53		14:38
bauzas	oh heh, https://github.com/openstack/grenade#theory-of-upgrade	14:38
bauzas	tkajinam: jangutter: ^	14:38
jangutter	yeah I saw that.	14:38
tonyb	we should block it in g-r and figure out how to handle things for the longer term	14:38
bauzas	"Any other required changes on upgrade are an exception and must be called out in the release notes." and "Grenade supports per release specific upgrade scripts (from-juno, from-kilo)."	14:39
bauzas	tonyb: I don't disagree	14:39
bauzas	I still think I'd appreciate if tooz may do some fallback mechanism to an older endpoint if the newer endpoint sounds invalid	14:40
tkajinam	https://review.opendev.org/q/topic:etcd-api-discovery	14:40
bauzas	just for the sake of easing people's lifes	14:40
tkajinam	We probably need a bug and link these to the bug, but we can make tooz smart and discover the available api version unless user explicitly requires a specific version	14:41
bauzas	tkajinam: wow, kudos	14:41
tkajinam	which is being implemented in these	14:41
tkajinam	(tooz patch need further work because we first need a new release of etcd3gw...	14:41
bauzas	sounds a good path to me	14:41
bauzas	we may ban the current tooz release and just await for a newer release once everything is fine	14:42
jangutter	tkajinam: one question - if etcd gets updated (v3alpha disappears and v3 appears) - would this case be handled?	14:42
tkajinam	jangutter, yes. with that code etcd checks the etcd server version using version api and use the latest api version available	14:44
tkajinam	jangutter, though you may need to restart servers because the detected api is cached after the first detection.	14:44
tkajinam	s/servers/services/ I mean	14:44
tkajinam	s/etcd checks/tooz checks/	14:45
jangutter	takjinam: so it will appear if cinder "remembers" the version of etcd it first connected to and never try a newer version?	14:45
tkajinam	there is tradeoff. we can let tooz check api version every time but it always requires additional api call	14:46
tkajinam	we may probably implement expire but I'd leave it for now	14:46
tkajinam	jangutter, I probably had to say this first. yes, you are correct	14:46
jangutter	tkajinam: yeah, gracefully redetecting the endpoint on error is also another option, that adds time to report errors.	14:49
jangutter	but this covers 95% of the pain.	14:50
tkajinam	https://bugs.launchpad.net/python-etcd3gw/+bug/2043810	14:58
tkajinam	I've created a bug here... in case you want something you can refer to	14:59
tkajinam	jangutter, we can discuss how we can realize the automated switch without restart, but I'd prefer leaving it for follow-up.	14:59
jangutter	tkajinam: right, I'm fine with just documenting the behavior (i.e. we expect dependent services to be restarted on etcd upgrade)	15:00
jangutter	(that's a much more reasonable expectation than requiring operators to update etcd + tooz in lockstep.	15:04
tonyb	bauzas: You probably need to remove your -W on 901079	15:15
bauzas	ack I will	15:15
bauzas	dansmith: before I shutdown for the weekend and before your PTOs, just a quick question about https://review.opendev.org/c/openstack/nova/+/899625	15:49
bauzas	for non-SRIOV GPUs, shall we cap the total of the inventory by the max_instances or should we rather provide total=0 if the GPU has more than the max ?	15:50
* dansmith waits		15:50
bauzas	for example, a GPU type supporting 8 vGPUs	15:51
bauzas	and the operator saying max_instances=2	15:51
bauzas	shall we then accept one GPU and providing a total=2 or should we rather not accept it?	15:51
dansmith	we can't actually have total=0 right?	15:51
bauzas	we can now	15:51
bauzas	with https://review.opendev.org/c/openstack/nova/+/899625/4/nova/virt/libvirt/driver.py	15:52
bauzas	I'm just writing the unittests	15:52
dansmith	right but that's just internally, placement won't allow it right?	15:52
bauzas	dansmith: if we would say total=0 then the driver would ask to remove the GPU	15:52
dansmith	right, so just internal notation for us to know what to do when we update placement	15:53
dansmith	I guess I'm not sure what you're asking then	15:53
bauzas	okay, then nevermind	15:53
bauzas	I'll create a test and we'll discuss with gerrit	15:53
dansmith	I was also going to ask on that patch why you continue to have providers above the limit, but with total=0, but I thought maybe that was related to cleanup for people who already have more providers than they should	15:53
dansmith	okay	15:54
bauzas	tbc, sriov gpus only have a total=1 for each of the VFs	15:54
bauzas	so it's simple	15:54
bauzas	either we accept a VF or we don't	15:54
bauzas	but for non-sriov gpus, we can have a capacity for say 8 or 16	15:54
bauzas	so we create an inventory with a total=16	15:55
bauzas	now, the question is, what would happen if the operator would add a max_instances value for the type ?	15:55
dansmith	just total=min(actual, max_instances) right?	15:55
bauzas	that's my question	15:56
dansmith	why not?	15:56
bauzas	because for the moment, we have https://review.opendev.org/c/openstack/nova/+/899625/4/nova/virt/libvirt/driver.py#8144	15:56
bauzas	we calculate per GPU (or VF)	15:56
bauzas	and if the number of possible vGPU for it is above the max_instances value, then we eventually say "heh, no, total=0"	15:57
bauzas	so we don't really cap the GPU, we rather don't accept it	15:57
bauzas	which is fine too	15:57
dansmith	oh I see, you're saying your current patch doesn't add the inventory _at_all_ if they specify max_instances<total instead of just capping it?	15:57
dansmith	I didn't pick up on that when I looked, but.. why would we do that?	15:58
dansmith	isn't the point of this to be able to say "I know it says 16 available, but we can really only allocate 2 of this mdev_type because it takes so much of the card's resource" ?	15:58
dansmith	if you don't allow them to set max_instances below total, we lose the ability to do that, no?	15:58
bauzas	you know what ? I can upload my change	15:59
bauzas	I'm done with the unitteests	15:59
bauzas	I can create a functional test (and I want to) but for the sake of discussing, let me just upload it now	15:59
opendevreview	Sylvain Bauza proposed openstack/nova master: WIP: libvirt: Cap with max_instances GPU types https://review.opendev.org/c/openstack/nova/+/899625	16:00
dansmith	okay, but .. the above question is what I need answered to really say anything...	16:01
dansmith	but I can put that comment in gerrit if you prefer	16:02
opendevreview	Sylvain Bauza proposed openstack/nova master: WIP: libvirt: Cap with max_instances GPU types https://review.opendev.org/c/openstack/nova/+/899625	16:04
bauzas	dansmith: I added some comments ^	16:04
dansmith	okay I'll just drop this comment/question in gerrit and we can iterate from there.. I have a call starting in a bit anyway	16:04
bauzas	dansmith: fwiw, haven't seen your comments from PS4 but we run the periodic just after starting the service	16:05
bauzas	so this works	16:05
bauzas	given operators would modify the option, they would restart the nova-compute service so it will then delete the non-needed inventories then	16:06
dansmith	yeah, but we may have boot requests pending for us when we start, and it means we're technically doing the cleanup on each run right?	16:07
bauzas	dansmith: that's a tradeoff but I see your point	16:17
bauzas	we could force a update_rp() right after init_host in the compute manager before we run RPC	16:18
bauzas	but I think the problem is the same for all things	16:18
bauzas	oh wait	16:19
bauzas	https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L1778	16:20
bauzas	we're fine then	16:20
bauzas	I just replied	16:22
bauzas	anyway, I'll need to go errand	16:22
bauzas	my patch is up, with a proposition	16:22
bauzas	an alternative could be to cap the GPU inventory instead of refusing it as a whole, but meh, people can review	16:23
bauzas	I'll add functests on Monday when I'm back	16:23
* bauzas jumps off the keyboard		16:24
dansmith	ack	16:26
-opendevstatus- NOTICE: The Gerrit upgrade is complete, however we have Zuul offline in parallel for a schema migration, so any events occurring during this time will be lost (requiring a recheck or similar to trigger jobs once it returns to service); we'll update again once this is complete.		16:34
-opendevstatus- NOTICE: Zuul is fully back in service now, but any events occurring prior to 17:05 UTC may need a recheck to trigger jobs.		17:13
tonyb	bauzas: your revert has merged	23:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!