Monday, 2024-06-24

opendevreviewMichael Still proposed openstack/nova-specs master: Propose a new API microversion for a spice-direct console.  https://review.opendev.org/c/openstack/nova-specs/+/91519001:55
opendevreviewTakashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware  https://review.opendev.org/c/openstack/nova/+/90889004:02
opendevreviewTakashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware  https://review.opendev.org/c/openstack/nova/+/90889004:06
*** bauzas_ is now known as bauzas04:39
opendevreviewTakashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware  https://review.opendev.org/c/openstack/nova/+/90889005:03
sahido/ morning all07:30
sahidI this we are good with this one, CI is green now :-) 07:31
sahidhttps://review.opendev.org/c/openstack/nova/+/92166507:31
sahidsean-k-mooney, gibi if you have a moment07:31
*** bauzas_ is now known as bauzas07:42
*** LarsErik1 is now known as LarsErikP08:38
*** LarsErik1 is now known as LarsErikP09:09
mikalHey sean-k-mooney, sorry for the horrendous delay with that SPICE spec. I blame CentOS 7. I've replied to your various comments and done some updates.09:14
mikalSo whenever you're up for it I'd appreciate your thoughts.09:14
opendevreviewTakashi Kajinami proposed openstack/nova master: Migrate MEM_ENCRYPTION_CONTEXT from root provider  https://review.opendev.org/c/openstack/nova/+/92181409:33
opendevreviewTakashi Kajinami proposed openstack/nova master: Migrate MEM_ENCRYPTION_CONTEXT from root provider  https://review.opendev.org/c/openstack/nova/+/92181409:41
ralonsohsean-k-mooney, hello! maybe you'll like to check https://review.opendev.org/c/openstack/oslo-specs/+/92259710:22
sean-k-mooneyralonsoh: nice find ill take a look10:23
ralonsohsean-k-mooney, I talked to Daniel this morning, so I was warned10:23
opendevreviewStephen Finucane proposed openstack/nova master: api: Add 'removed' decorator  https://review.opendev.org/c/openstack/nova/+/91573611:01
opendevreviewStephen Finucane proposed openstack/nova master: api: Don't do version check if nothing required  https://review.opendev.org/c/openstack/nova/+/91573711:01
opendevreviewStephen Finucane proposed openstack/nova master: api: Add remaining missing query parameter schema  https://review.opendev.org/c/openstack/nova/+/91573811:01
opendevreviewStephen Finucane proposed openstack/nova master: tests: Ensure all APIs have a request query schema  https://review.opendev.org/c/openstack/nova/+/91573911:01
opendevreviewStephen Finucane proposed openstack/nova master: conf: Add '[api] response_validation' option  https://review.opendev.org/c/openstack/nova/+/91574011:01
opendevreviewStephen Finucane proposed openstack/nova master: api: Add response body validation helper  https://review.opendev.org/c/openstack/nova/+/91574111:01
opendevreviewStephen Finucane proposed openstack/nova master: api: Add response body schemas for server action APIs  https://review.opendev.org/c/openstack/nova/+/91574211:01
opendevreviewStephen Finucane proposed openstack/nova master: api: Add response body schemas for remaining server action APIs  https://review.opendev.org/c/openstack/nova/+/91574311:01
opendevreviewTakashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware  https://review.opendev.org/c/openstack/nova/+/90889012:15
sean-k-mooneymikal: i did a quick review and responed to some of the comments12:18
sean-k-mooneymikal: this is my main feedback https://review.opendev.org/c/openstack/nova-specs/+/915190/comment/2ba5ce88_5e260fbf/12:20
LarsErikPHi. Anyone running 2023.1, and a server with SapphireRapids-noTSX ? I got this weird behavior: https://paste.opendev.org/show/b8BM3HtnTKlQEqtv6i4I/12:20
sean-k-mooneyi would love if we coudl make it so we can just return '"url":"spice://${KERBSIDE_HOST}/?token=f9906a48-b71e-4f18-baca-c987da3ebdb3"'12:20
LarsErikPnot really sure what happens here. I did not act like that when it was running zed...12:21
sean-k-mooneyLarsErikP: its likely a combination of 2 things 1 the qemu version your using and 2 the way nova is using the older libvirt api to detect comaptiablity12:21
sean-k-mooneywe have started to use the newer api recommeneded by the libvirt folk around 2023.1 ish 12:22
LarsErikPright.. so maybe I need a newer qemu version?12:23
sean-k-mooneybut saphire rapid in particalar is causing issues with the places we still use the old api12:23
LarsErikPqemu-system-x86 1:6.2+dfsg-2ubuntu6.2112:23
sean-k-mooneywe have 2 workaroudn options that can be used to side step the issue at the expecse fo delegating all cpu compatiablity checkign to libvirt12:24
sean-k-mooneyhttps://docs.openstack.org/nova/latest/configuration/config.html#workarounds.skip_cpu_compare_on_dest and https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.skip_cpu_compare_at_startup12:24
LarsErikPah. ok. I'll try those12:25
sean-k-mooneysetting skip_cpu_compare_on_dest=true means tha tif the dest cpu is not compatiable instead of failing in pre-live migrate we will fail when we call migrate which is much later in the live-migrate workflow12:26
sean-k-mooneyskip_cpu_compare_at_startup=true means that if you udpate your config to add/remvoe flag ectra that make the vms incompatibel with a host or provision new compute nodes with a common config12:26
sean-k-mooneyisntead of geting an error when the compute start12:26
sean-k-mooneyyou wont get an error until you try to boot a vm12:27
sean-k-mooneyboth are not idaly but if libvirt/qemu is capable of runing the vm on the host in etither case the operation will succeed12:27
sean-k-mooneyat the expence of disabling the safty net nova is trying to provide12:28
LarsErikPI think I can live with this. Thank you so much12:28
sean-k-mooneymedium term we want to fix it properly in nova12:28
LarsErikPexactly12:28
LarsErikPso in a future release, I won't need these anymore12:28
sean-k-mooneythe delta between the api iss the old api compared the host cpus wihtout takign account fo the fact that qemu does not pass all flag to the guest adn can emulate some12:29
sean-k-mooneythe new api in libvirt is ment to take both of those facotrs into account12:29
LarsErikPright12:29
sean-k-mooneyhistorically this has not been a problem but with the microcode chagne intel has been making for the vulnerablities it has started to be12:29
sean-k-mooneywe have a bug and attempeted fix to track this but its semi stalled mainly be cause of dev/review time12:30
sean-k-mooneyhttps://bugs.launchpad.net/nova/+bug/203980312:31
sean-k-mooneyand i think https://review.opendev.org/c/openstack/nova/+/899185 is the initall attempted fix but its not complete/correct12:31
sean-k-mooneyLarsErikP: our downsteam products are based on train and wallaby and we have had similar issue with saphire rappids in both cases.12:33
sean-k-mooneyin our case its compounted by the fact oru tain based product uses a kernel/qemu that predate teh reelase of saphire rapids silicon12:33
sean-k-mooneyso there is no SapphireRapids cpu model aviaable in our train based release12:34
sean-k-mooneyour wallaby baed reslese is new enough to actully know what SapphireRapids is but there we see the issue you hit12:34
LarsErikPUnderstand. For me, this behaviour did not occur before I upgraded from Zed to 2023.1..12:35
sean-k-mooneyya w had a partical fix in 2023.1 for a related issue12:35
sean-k-mooneyill see if i can fidn that quickly. its why we added the workarounds12:36
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/870794 and https://review.opendev.org/c/openstack/nova/+/86995012:36
opendevreviewTakashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware  https://review.opendev.org/c/openstack/nova/+/90889012:38
sean-k-mooneyLarsErikP: it actully looks like you may not have https://review.opendev.org/c/openstack/nova/+/86995012:40
sean-k-mooneybased on your stack trace12:40
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/869950/9/nova/virt/libvirt/driver.py12:40
sean-k-mooneyits calling File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 10015, in _compare_cpu12:41
sean-k-mooneyperhaps not12:43
LarsErik1sean-k-mooney: lol, my irc-vm kernel paniced :P12:44
LarsErik1sean-k-mooney: lol, my irc-vm kernel paniced :P12:44
LarsErik1might have lost a few messages there :P12:44
sean-k-mooneyno worries i was speculatting that perhaps eyou did not hae the 2 patches we emrged in 2023.112:44
LarsErik1if that's the case, then UCA does not include the patches...12:45
*** LarsErik1 is now known as LarsErikP12:45
LarsErikPI think they should be included. UCA ships 27.1.012:46
sean-k-mooneywe have had 27.2.0 and 27.3.0 since then12:48
sean-k-mooneyit does have https://review.opendev.org/c/openstack/nova/+/87079412:49
sean-k-mooneyso you have the workaround optio nto disabel the startup check12:49
sean-k-mooneyit also has https://review.opendev.org/c/openstack/nova/+/86995012:49
sean-k-mooneyso it is usign the newer api that is not ment to have this proablem at least for the startup check12:50
sean-k-mooneyLarsErikP: if you dont mind could you add the output of `virsh capabilities` and `domcapabilities --arch x86_64 --machine q35` to https://bugs.launchpad.net/nova/+bug/203980312:52
sean-k-mooneythat the data we are parsing so it would help to have data for a system where it actully fails12:53
sean-k-mooneyLarsErikP: looking at my laptop which is runign debian testing i do not see a noTSK variant of SapphireRapids by the way12:54
sean-k-mooneyim using libvirt 10.0.012:55
sean-k-mooneyso SapphireRapids-noTSX might be a downstream only cpu model added by cannonical12:56
sean-k-mooneyif so then it may be a reall earror and the relevent host might not ahve that patched verison of libvirt?12:56
LarsErikPI can add that info to the bug. no worries =)13:01
LarsErikPdone13:04
LarsErikPI have to leave work now... Thanks for your help!13:04
*** haleyb is now known as haleyb|out13:16
opendevreviewTakashi Kajinami proposed openstack/nova master: Migrate MEM_ENCRYPTION_CONTEXT from root provider  https://review.opendev.org/c/openstack/nova/+/92181413:48
*** bauzas_ is now known as bauzas14:22
stephenfingmann: Thanks for all the reviews on the OpenAPI series last week :) Much appreciated. I think I have addressed your comments but lmk if I missed anything. I'm hoping sean-k-mooney will get a chance to take a look again this week or next so we can start get more of that merged14:44
sean-k-mooneyLarsErikP: your SapphireRapids-noTSX virsh capablities shows  <feature name='tsx-ctrl'/>14:57
sean-k-mooneyand lookig at virsh domain stats SapphireRapids-noTSX is not a supproted model14:59
sean-k-mooneyLarsErikP: this feels like a bug in the SapphireRapids-noTSX definition in your libvirt package in ubuntu14:59
sean-k-mooneybased on domstats it thinkg the cloase model is Icelake-Server15:00
bauzasdoh, spent a while for debugging *why* my mtty change was failing16:01
bauzassean-k-mooney: fwiw, I need to enable the nova plugins https://review.opendev.org/c/openstack/nova/+/897708/13/.zuul.yaml#490 :)16:01
bauzascall me stupid16:02
sean-k-mooneyhehe yep that would help16:02
sean-k-mooneywe enebale it in nova-next i think by default16:03
sean-k-mooneythat is where we wanted to eventually test the mtty stuff16:03
bauzasnope, was added by dan in its mtty patch :)16:03
bauzasbut yeah, given I created another job, I need also to use it :)16:03
bauzasthe mtty modules were then not created :)16:03
sean-k-mooneythat at least explains why no rp inventories were created16:03
bauzascorrect16:04
sean-k-mooneydid you push an updated review with it enabeld16:08
bauzasnot yet, will do tomorrow morning16:09
bauzasjust testing it on my noble instance16:10
bauzas(restacking by now)16:10
jangutterHi folks, a newbie question here: we've got a high-concurrency environment with a lot of instances booting at the same time. We have custom placement resources (but we also see this with things like VCPU), and we see hypervisors sometimes overcommitted.16:32
jangutterI'm assuming that this is because the hypervisor doesn't hard-enforce the placement resources, and that this is because of a collision at scheduling time?16:33
sean-k-mooneyjangutter: so placement is enforcing atomic allcoations16:58
sean-k-mooneyso the api will not allow an over commit to happen16:58
sean-k-mooneyso that implies that the allcoation ratio or other config options like reseved count are being modifed with instnace on the hosts16:58
sean-k-mooneythe compute agent and schduler do not ned to hard enforce the placement allocation claims becasue placement does16:59
sean-k-mooneyi.e. if two schdluer process both have an allcoation candiate for n vcpu  and only n are aviable only one of the two schduler proces will be able to take the allcoation candate and turn it into an allocation17:00
sean-k-mooneythe other will get a generation confclict and/or an allcoation error17:00
sean-k-mooneythe converation of an allocation candiate to an alloction happens in the conductor17:01
sean-k-mooneybefore we call the compute.17:01
sean-k-mooneyso the conductor will just try the next alternate host in the allocation candate set until it can atomicaly claim17:01
sean-k-mooneyby default the schdler will return 3 alternate host17:02
sean-k-mooneythe primary host and 2 addtional hosts17:02
jangutterOho, so something's hinky on our conductor!17:03
sean-k-mooneyplacement allows the allcoation ratio or reseved amount to be updated such that the inventory is oversubscibed but does not allow a new allcoation to be made in that state17:03
sean-k-mooneyjangutter: that or your runing placment with an active active galeara and the db is not consitent17:03
jangutterBingo on the second one!17:04
sean-k-mooney... we recently found out that is not safe17:04
sean-k-mooneyjangutter: oh your using A/A galea?17:04
jangutterindeed!17:04
sean-k-mooneyok there is a workaround you can try17:04
sean-k-mooneyonce sec17:04
sean-k-mooneyhttps://github.com/openstack-k8s-operators/placement-operator/commit/a5b0bf43b21f495db52eac58c44e52261744605c17:05
sean-k-mooneymysql_wsrep_sync_wait = 117:06
sean-k-mooneyjangutter: so we were thinking of deployign with a/a for our new installer17:06
sean-k-mooneythen we found that most service dont work properly id you do17:06
sean-k-mooneymysql_wsrep_sync_wait = 117:06
jangutterthanks, that makes absolute sense now.17:06
sean-k-mooneymight work as that forces galera to ensure that it sync on every read17:07
sean-k-mooneyor rather read form the most up to date write set whatever its called internally17:07
jangutter(we'll have to check our performance stats on that of course, but you can't argue against CAP)17:07
sean-k-mooneylooking at nova code while we were debuging it i dont think we have the code to proerly handal A/A galare with delayed consitency in nova17:08
sean-k-mooneyya so even if its a few seconds of a replciation dely it make the db nolonger atomic17:08
sean-k-mooneyand placemnt which is designed to rely on that for the allcoations really cant tollerate that17:09
sean-k-mooneyjangutter: still at workday?17:10
jangutterwhat gave it away :-p 17:10
sean-k-mooneynothing we just have not chatted in a while17:10
sean-k-mooneyi was wondering if you were still in dublin and working with them17:10
janguttercorrect on both counts, also somehow the amount of board games available has grown....17:11
sean-k-mooneydoes it ever decrease i tought they just started taking up space in fried houes instead :)17:12
sean-k-mooney*friends17:12
sean-k-mooneyjangutter: in case its of use what we ended up doing in our new installer is this https://github.com/openstack-k8s-operators/mariadb-operator/pull/229/files#diff-deca2c41d0839d5052bb1da5ac7ad924f126ef4adcf10fea54d00a1a737f668b17:13
sean-k-mooneyjangutter: we reverted to active passive with a script that is invoked by galera when teh leader changes17:13
sean-k-mooneythat updated the endpoint in the kubernets service to point only to the new leader17:13
jangutteroh man, I have PTSD about "leader election" and etcd.17:15
jangutterIt makes sense, 99% of the time you have a leader, but you want a smooth handover during updates.17:16
sean-k-mooneyjangutter: yep and its using a feature of galera that allows it to execute a user script when the leader changes17:16
sean-k-mooneyso we are now just providing a script for it to execute and leveragin the fact that k8s adds servie user tokens/config into the pods to allow it to auth and do the service update17:17
sean-k-mooneyit kind of sucks that gelera does not just proxy the connection ot the current leader internally but if the hack works...17:18
jangutterwe used an analogous way to backup the db by selecting a non-active node and transferring the state from it. the scary thing is that a chunk of this is load-bearing bash scripts.17:19
sean-k-mooneyhttps://imgs.xkcd.com/comics/dependency.png17:19
jangutter100% true.17:19
sean-k-mooneyi wish there was an image macro that would just replace the right arrow17:20
gmannstephenfin: ack, will do 2nd round of reviews.17:31
*** bauzas_ is now known as bauzas18:07
*** bauzas_ is now known as bauzas19:41

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!