opendevreview | Michael Still proposed openstack/nova-specs master: Propose a new API microversion for a spice-direct console. https://review.opendev.org/c/openstack/nova-specs/+/915190 | 01:55 |
---|---|---|
opendevreview | Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware https://review.opendev.org/c/openstack/nova/+/908890 | 04:02 |
opendevreview | Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware https://review.opendev.org/c/openstack/nova/+/908890 | 04:06 |
*** bauzas_ is now known as bauzas | 04:39 | |
opendevreview | Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware https://review.opendev.org/c/openstack/nova/+/908890 | 05:03 |
sahid | o/ morning all | 07:30 |
sahid | I this we are good with this one, CI is green now :-) | 07:31 |
sahid | https://review.opendev.org/c/openstack/nova/+/921665 | 07:31 |
sahid | sean-k-mooney, gibi if you have a moment | 07:31 |
*** bauzas_ is now known as bauzas | 07:42 | |
*** LarsErik1 is now known as LarsErikP | 08:38 | |
*** LarsErik1 is now known as LarsErikP | 09:09 | |
mikal | Hey sean-k-mooney, sorry for the horrendous delay with that SPICE spec. I blame CentOS 7. I've replied to your various comments and done some updates. | 09:14 |
mikal | So whenever you're up for it I'd appreciate your thoughts. | 09:14 |
opendevreview | Takashi Kajinami proposed openstack/nova master: Migrate MEM_ENCRYPTION_CONTEXT from root provider https://review.opendev.org/c/openstack/nova/+/921814 | 09:33 |
opendevreview | Takashi Kajinami proposed openstack/nova master: Migrate MEM_ENCRYPTION_CONTEXT from root provider https://review.opendev.org/c/openstack/nova/+/921814 | 09:41 |
ralonsoh | sean-k-mooney, hello! maybe you'll like to check https://review.opendev.org/c/openstack/oslo-specs/+/922597 | 10:22 |
sean-k-mooney | ralonsoh: nice find ill take a look | 10:23 |
ralonsoh | sean-k-mooney, I talked to Daniel this morning, so I was warned | 10:23 |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Add 'removed' decorator https://review.opendev.org/c/openstack/nova/+/915736 | 11:01 |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Don't do version check if nothing required https://review.opendev.org/c/openstack/nova/+/915737 | 11:01 |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Add remaining missing query parameter schema https://review.opendev.org/c/openstack/nova/+/915738 | 11:01 |
opendevreview | Stephen Finucane proposed openstack/nova master: tests: Ensure all APIs have a request query schema https://review.opendev.org/c/openstack/nova/+/915739 | 11:01 |
opendevreview | Stephen Finucane proposed openstack/nova master: conf: Add '[api] response_validation' option https://review.opendev.org/c/openstack/nova/+/915740 | 11:01 |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Add response body validation helper https://review.opendev.org/c/openstack/nova/+/915741 | 11:01 |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Add response body schemas for server action APIs https://review.opendev.org/c/openstack/nova/+/915742 | 11:01 |
opendevreview | Stephen Finucane proposed openstack/nova master: api: Add response body schemas for remaining server action APIs https://review.opendev.org/c/openstack/nova/+/915743 | 11:01 |
opendevreview | Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware https://review.opendev.org/c/openstack/nova/+/908890 | 12:15 |
sean-k-mooney | mikal: i did a quick review and responed to some of the comments | 12:18 |
sean-k-mooney | mikal: this is my main feedback https://review.opendev.org/c/openstack/nova-specs/+/915190/comment/2ba5ce88_5e260fbf/ | 12:20 |
LarsErikP | Hi. Anyone running 2023.1, and a server with SapphireRapids-noTSX ? I got this weird behavior: https://paste.opendev.org/show/b8BM3HtnTKlQEqtv6i4I/ | 12:20 |
sean-k-mooney | i would love if we coudl make it so we can just return '"url":"spice://${KERBSIDE_HOST}/?token=f9906a48-b71e-4f18-baca-c987da3ebdb3"' | 12:20 |
LarsErikP | not really sure what happens here. I did not act like that when it was running zed... | 12:21 |
sean-k-mooney | LarsErikP: its likely a combination of 2 things 1 the qemu version your using and 2 the way nova is using the older libvirt api to detect comaptiablity | 12:21 |
sean-k-mooney | we have started to use the newer api recommeneded by the libvirt folk around 2023.1 ish | 12:22 |
LarsErikP | right.. so maybe I need a newer qemu version? | 12:23 |
sean-k-mooney | but saphire rapid in particalar is causing issues with the places we still use the old api | 12:23 |
LarsErikP | qemu-system-x86 1:6.2+dfsg-2ubuntu6.21 | 12:23 |
sean-k-mooney | we have 2 workaroudn options that can be used to side step the issue at the expecse fo delegating all cpu compatiablity checkign to libvirt | 12:24 |
sean-k-mooney | https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.skip_cpu_compare_on_dest and https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.skip_cpu_compare_at_startup | 12:24 |
LarsErikP | ah. ok. I'll try those | 12:25 |
sean-k-mooney | setting skip_cpu_compare_on_dest=true means tha tif the dest cpu is not compatiable instead of failing in pre-live migrate we will fail when we call migrate which is much later in the live-migrate workflow | 12:26 |
sean-k-mooney | skip_cpu_compare_at_startup=true means that if you udpate your config to add/remvoe flag ectra that make the vms incompatibel with a host or provision new compute nodes with a common config | 12:26 |
sean-k-mooney | isntead of geting an error when the compute start | 12:26 |
sean-k-mooney | you wont get an error until you try to boot a vm | 12:27 |
sean-k-mooney | both are not idaly but if libvirt/qemu is capable of runing the vm on the host in etither case the operation will succeed | 12:27 |
sean-k-mooney | at the expence of disabling the safty net nova is trying to provide | 12:28 |
LarsErikP | I think I can live with this. Thank you so much | 12:28 |
sean-k-mooney | medium term we want to fix it properly in nova | 12:28 |
LarsErikP | exactly | 12:28 |
LarsErikP | so in a future release, I won't need these anymore | 12:28 |
sean-k-mooney | the delta between the api iss the old api compared the host cpus wihtout takign account fo the fact that qemu does not pass all flag to the guest adn can emulate some | 12:29 |
sean-k-mooney | the new api in libvirt is ment to take both of those facotrs into account | 12:29 |
LarsErikP | right | 12:29 |
sean-k-mooney | historically this has not been a problem but with the microcode chagne intel has been making for the vulnerablities it has started to be | 12:29 |
sean-k-mooney | we have a bug and attempeted fix to track this but its semi stalled mainly be cause of dev/review time | 12:30 |
sean-k-mooney | https://bugs.launchpad.net/nova/+bug/2039803 | 12:31 |
sean-k-mooney | and i think https://review.opendev.org/c/openstack/nova/+/899185 is the initall attempted fix but its not complete/correct | 12:31 |
sean-k-mooney | LarsErikP: our downsteam products are based on train and wallaby and we have had similar issue with saphire rappids in both cases. | 12:33 |
sean-k-mooney | in our case its compounted by the fact oru tain based product uses a kernel/qemu that predate teh reelase of saphire rapids silicon | 12:33 |
sean-k-mooney | so there is no SapphireRapids cpu model aviaable in our train based release | 12:34 |
sean-k-mooney | our wallaby baed reslese is new enough to actully know what SapphireRapids is but there we see the issue you hit | 12:34 |
LarsErikP | Understand. For me, this behaviour did not occur before I upgraded from Zed to 2023.1.. | 12:35 |
sean-k-mooney | ya w had a partical fix in 2023.1 for a related issue | 12:35 |
sean-k-mooney | ill see if i can fidn that quickly. its why we added the workarounds | 12:36 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/870794 and https://review.opendev.org/c/openstack/nova/+/869950 | 12:36 |
opendevreview | Takashi Kajinami proposed openstack/nova master: libvirt: Launch instances with stateless firmware https://review.opendev.org/c/openstack/nova/+/908890 | 12:38 |
sean-k-mooney | LarsErikP: it actully looks like you may not have https://review.opendev.org/c/openstack/nova/+/869950 | 12:40 |
sean-k-mooney | based on your stack trace | 12:40 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/869950/9/nova/virt/libvirt/driver.py | 12:40 |
sean-k-mooney | its calling File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 10015, in _compare_cpu | 12:41 |
sean-k-mooney | perhaps not | 12:43 |
LarsErik1 | sean-k-mooney: lol, my irc-vm kernel paniced :P | 12:44 |
LarsErik1 | sean-k-mooney: lol, my irc-vm kernel paniced :P | 12:44 |
LarsErik1 | might have lost a few messages there :P | 12:44 |
sean-k-mooney | no worries i was speculatting that perhaps eyou did not hae the 2 patches we emrged in 2023.1 | 12:44 |
LarsErik1 | if that's the case, then UCA does not include the patches... | 12:45 |
*** LarsErik1 is now known as LarsErikP | 12:45 | |
LarsErikP | I think they should be included. UCA ships 27.1.0 | 12:46 |
sean-k-mooney | we have had 27.2.0 and 27.3.0 since then | 12:48 |
sean-k-mooney | it does have https://review.opendev.org/c/openstack/nova/+/870794 | 12:49 |
sean-k-mooney | so you have the workaround optio nto disabel the startup check | 12:49 |
sean-k-mooney | it also has https://review.opendev.org/c/openstack/nova/+/869950 | 12:49 |
sean-k-mooney | so it is usign the newer api that is not ment to have this proablem at least for the startup check | 12:50 |
sean-k-mooney | LarsErikP: if you dont mind could you add the output of `virsh capabilities` and `domcapabilities --arch x86_64 --machine q35` to https://bugs.launchpad.net/nova/+bug/2039803 | 12:52 |
sean-k-mooney | that the data we are parsing so it would help to have data for a system where it actully fails | 12:53 |
sean-k-mooney | LarsErikP: looking at my laptop which is runign debian testing i do not see a noTSK variant of SapphireRapids by the way | 12:54 |
sean-k-mooney | im using libvirt 10.0.0 | 12:55 |
sean-k-mooney | so SapphireRapids-noTSX might be a downstream only cpu model added by cannonical | 12:56 |
sean-k-mooney | if so then it may be a reall earror and the relevent host might not ahve that patched verison of libvirt? | 12:56 |
LarsErikP | I can add that info to the bug. no worries =) | 13:01 |
LarsErikP | done | 13:04 |
LarsErikP | I have to leave work now... Thanks for your help! | 13:04 |
*** haleyb is now known as haleyb|out | 13:16 | |
opendevreview | Takashi Kajinami proposed openstack/nova master: Migrate MEM_ENCRYPTION_CONTEXT from root provider https://review.opendev.org/c/openstack/nova/+/921814 | 13:48 |
*** bauzas_ is now known as bauzas | 14:22 | |
stephenfin | gmann: Thanks for all the reviews on the OpenAPI series last week :) Much appreciated. I think I have addressed your comments but lmk if I missed anything. I'm hoping sean-k-mooney will get a chance to take a look again this week or next so we can start get more of that merged | 14:44 |
sean-k-mooney | LarsErikP: your SapphireRapids-noTSX virsh capablities shows <feature name='tsx-ctrl'/> | 14:57 |
sean-k-mooney | and lookig at virsh domain stats SapphireRapids-noTSX is not a supproted model | 14:59 |
sean-k-mooney | LarsErikP: this feels like a bug in the SapphireRapids-noTSX definition in your libvirt package in ubuntu | 14:59 |
sean-k-mooney | based on domstats it thinkg the cloase model is Icelake-Server | 15:00 |
bauzas | doh, spent a while for debugging *why* my mtty change was failing | 16:01 |
bauzas | sean-k-mooney: fwiw, I need to enable the nova plugins https://review.opendev.org/c/openstack/nova/+/897708/13/.zuul.yaml#490 :) | 16:01 |
bauzas | call me stupid | 16:02 |
sean-k-mooney | hehe yep that would help | 16:02 |
sean-k-mooney | we enebale it in nova-next i think by default | 16:03 |
sean-k-mooney | that is where we wanted to eventually test the mtty stuff | 16:03 |
bauzas | nope, was added by dan in its mtty patch :) | 16:03 |
bauzas | but yeah, given I created another job, I need also to use it :) | 16:03 |
bauzas | the mtty modules were then not created :) | 16:03 |
sean-k-mooney | that at least explains why no rp inventories were created | 16:03 |
bauzas | correct | 16:04 |
sean-k-mooney | did you push an updated review with it enabeld | 16:08 |
bauzas | not yet, will do tomorrow morning | 16:09 |
bauzas | just testing it on my noble instance | 16:10 |
bauzas | (restacking by now) | 16:10 |
jangutter | Hi folks, a newbie question here: we've got a high-concurrency environment with a lot of instances booting at the same time. We have custom placement resources (but we also see this with things like VCPU), and we see hypervisors sometimes overcommitted. | 16:32 |
jangutter | I'm assuming that this is because the hypervisor doesn't hard-enforce the placement resources, and that this is because of a collision at scheduling time? | 16:33 |
sean-k-mooney | jangutter: so placement is enforcing atomic allcoations | 16:58 |
sean-k-mooney | so the api will not allow an over commit to happen | 16:58 |
sean-k-mooney | so that implies that the allcoation ratio or other config options like reseved count are being modifed with instnace on the hosts | 16:58 |
sean-k-mooney | the compute agent and schduler do not ned to hard enforce the placement allocation claims becasue placement does | 16:59 |
sean-k-mooney | i.e. if two schdluer process both have an allcoation candiate for n vcpu and only n are aviable only one of the two schduler proces will be able to take the allcoation candate and turn it into an allocation | 17:00 |
sean-k-mooney | the other will get a generation confclict and/or an allcoation error | 17:00 |
sean-k-mooney | the converation of an allocation candiate to an alloction happens in the conductor | 17:01 |
sean-k-mooney | before we call the compute. | 17:01 |
sean-k-mooney | so the conductor will just try the next alternate host in the allocation candate set until it can atomicaly claim | 17:01 |
sean-k-mooney | by default the schdler will return 3 alternate host | 17:02 |
sean-k-mooney | the primary host and 2 addtional hosts | 17:02 |
jangutter | Oho, so something's hinky on our conductor! | 17:03 |
sean-k-mooney | placement allows the allcoation ratio or reseved amount to be updated such that the inventory is oversubscibed but does not allow a new allcoation to be made in that state | 17:03 |
sean-k-mooney | jangutter: that or your runing placment with an active active galeara and the db is not consitent | 17:03 |
jangutter | Bingo on the second one! | 17:04 |
sean-k-mooney | ... we recently found out that is not safe | 17:04 |
sean-k-mooney | jangutter: oh your using A/A galea? | 17:04 |
jangutter | indeed! | 17:04 |
sean-k-mooney | ok there is a workaround you can try | 17:04 |
sean-k-mooney | once sec | 17:04 |
sean-k-mooney | https://github.com/openstack-k8s-operators/placement-operator/commit/a5b0bf43b21f495db52eac58c44e52261744605c | 17:05 |
sean-k-mooney | mysql_wsrep_sync_wait = 1 | 17:06 |
sean-k-mooney | jangutter: so we were thinking of deployign with a/a for our new installer | 17:06 |
sean-k-mooney | then we found that most service dont work properly id you do | 17:06 |
sean-k-mooney | mysql_wsrep_sync_wait = 1 | 17:06 |
jangutter | thanks, that makes absolute sense now. | 17:06 |
sean-k-mooney | might work as that forces galera to ensure that it sync on every read | 17:07 |
sean-k-mooney | or rather read form the most up to date write set whatever its called internally | 17:07 |
jangutter | (we'll have to check our performance stats on that of course, but you can't argue against CAP) | 17:07 |
sean-k-mooney | looking at nova code while we were debuging it i dont think we have the code to proerly handal A/A galare with delayed consitency in nova | 17:08 |
sean-k-mooney | ya so even if its a few seconds of a replciation dely it make the db nolonger atomic | 17:08 |
sean-k-mooney | and placemnt which is designed to rely on that for the allcoations really cant tollerate that | 17:09 |
sean-k-mooney | jangutter: still at workday? | 17:10 |
jangutter | what gave it away :-p | 17:10 |
sean-k-mooney | nothing we just have not chatted in a while | 17:10 |
sean-k-mooney | i was wondering if you were still in dublin and working with them | 17:10 |
jangutter | correct on both counts, also somehow the amount of board games available has grown.... | 17:11 |
sean-k-mooney | does it ever decrease i tought they just started taking up space in fried houes instead :) | 17:12 |
sean-k-mooney | *friends | 17:12 |
sean-k-mooney | jangutter: in case its of use what we ended up doing in our new installer is this https://github.com/openstack-k8s-operators/mariadb-operator/pull/229/files#diff-deca2c41d0839d5052bb1da5ac7ad924f126ef4adcf10fea54d00a1a737f668b | 17:13 |
sean-k-mooney | jangutter: we reverted to active passive with a script that is invoked by galera when teh leader changes | 17:13 |
sean-k-mooney | that updated the endpoint in the kubernets service to point only to the new leader | 17:13 |
jangutter | oh man, I have PTSD about "leader election" and etcd. | 17:15 |
jangutter | It makes sense, 99% of the time you have a leader, but you want a smooth handover during updates. | 17:16 |
sean-k-mooney | jangutter: yep and its using a feature of galera that allows it to execute a user script when the leader changes | 17:16 |
sean-k-mooney | so we are now just providing a script for it to execute and leveragin the fact that k8s adds servie user tokens/config into the pods to allow it to auth and do the service update | 17:17 |
sean-k-mooney | it kind of sucks that gelera does not just proxy the connection ot the current leader internally but if the hack works... | 17:18 |
jangutter | we used an analogous way to backup the db by selecting a non-active node and transferring the state from it. the scary thing is that a chunk of this is load-bearing bash scripts. | 17:19 |
sean-k-mooney | https://imgs.xkcd.com/comics/dependency.png | 17:19 |
jangutter | 100% true. | 17:19 |
sean-k-mooney | i wish there was an image macro that would just replace the right arrow | 17:20 |
gmann | stephenfin: ack, will do 2nd round of reviews. | 17:31 |
*** bauzas_ is now known as bauzas | 18:07 | |
*** bauzas_ is now known as bauzas | 19:41 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!