*** dasm|off is now known as Guest1645 | 04:00 | |
opendevreview | Rajat Dhasmana proposed openstack/python-novaclient master: Add support to rebuild boot volume 2.94 https://review.opendev.org/c/openstack/python-novaclient/+/827163 | 06:24 |
---|---|---|
bauzas | good morning Nova | 07:05 |
Uggla | good morning | 07:22 |
bauzas | mmm, I forgot to ask about your preferences for that *yet again* virtual PTG | 07:44 |
bauzas | I mean, how many slots we should book | 07:44 |
bauzas | I'm about to book 4 x 4 slots (Tues-Fri) but I'll cancel some if we agree | 07:44 |
bauzas | just done, see the scheduled tracks https://ptg.opendev.org/ptg.html | 07:49 |
bauzas | I can unbook the slots | 07:49 |
gibi | morning | 07:50 |
gibi | bauzas: 4x4 works for me | 07:50 |
gibi | sean-k-mooney: thanks for testing the user data out. | 07:50 |
bauzas | gibi: we'll see how much we need once we have an agenda | 07:51 |
gibi | bauzas: sure | 07:51 |
gibi | sean-k-mooney, dansmith: in the light of the actual regeneration issue I agree to punt the user_data and land rebuild bfv first | 07:51 |
sean-k-mooney[m] | im locally trying to fix the patch by the way as a followup patch | 07:52 |
sean-k-mooney[m] | ok what i was trying wont work but i know what to do instead so trying that now | 08:03 |
hkominos | Hi guys. Quick question. When overriding policies for nova, Should one redefine the whole policy.json file or Can I just define the single policy that I want to override?My current policy file is empty {} | 08:34 |
bauzas | hkominos: no, you should just update what you need* | 08:37 |
gibi | hkominos: you only need to add to the file that you want to override | 08:37 |
gibi | bauzas: you won :) | 08:37 |
bauzas | \o/ | 08:37 |
hkominos | thanks you guys ! | 08:40 |
bauzas | gmann: good spot for the WIP status on my election patch | 08:40 |
bauzas | gmann: that's because I flagged the WIP status directly in Gerrit when I created the PS1 in Aug | 08:41 |
bauzas | now the patch is in active state | 08:41 |
sean-k-mooney[m] | gibi: this is such a can of worms | 09:00 |
sean-k-mooney[m] | i think to make this work we need to delete the old config drive first but that means we neeed to deal with all the image backends seperately since they dont have a rm() function | 09:01 |
bauzas | sean-k-mooney: a tl:dr about the problem ? | 09:02 |
sean-k-mooney[m] | i can add one but i was hoping this would be simple and then i rememebred this should also work for lvm and rbd so i cant just delete the file | 09:02 |
sean-k-mooney[m] | bauzas: the user data update feature is trying to rebuild the config deive on hard reboot if you updated the user data | 09:03 |
sean-k-mooney[m] | btu that fails with a permission denied\ | 09:03 |
sean-k-mooney[m] | i belive that is because the disk is owned by qemu | 09:03 |
sean-k-mooney[m] | so nova cant updated it and this need to work for rbd and lvm anyway so i cant really just assume its a file on disk and chown it | 09:04 |
sean-k-mooney[m] | the user_data feature is blocking the BFV rebuild feature because of the micorversion order | 09:05 |
sean-k-mooney[m] | so i was trying to unblock that but we are just going to have to change the order | 09:05 |
sean-k-mooney[m] | make BFV rebuild 2.93 and likely punt mutable user_data to AA | 09:06 |
sean-k-mooney[m] | there are other issue with rpc/testing in the userdata patch that likely wont get adressed without a FFE | 09:06 |
bauzas | agreed on flipping the microversions | 09:07 |
bauzas | at least for not blocking bfv rebuild | 09:07 |
gibi | sean-k-mooney[m]: do we store the config drive in lvm / rbd? I thought it is always a file on disk | 09:07 |
gibi | but agree if this is complicate then punt it | 09:08 |
bauzas | for userdata, we can't just tell "sorry you can't update your data because config drive" as this is a config-driven API behaviour | 09:08 |
bauzas | could we have a separate flag that would say "I allow you to update your userdata" and leave the responsibility to the ops to enable it ? | 09:08 |
kashyap | gibi: I think it's a file on the disk, too, the config-drive | 09:08 |
bauzas | with a caveat saying there could be perm issues with configdrive | 09:08 |
sean-k-mooney[m] | gibi: good question i belvie we do looking at the code but ill check. i know swap gets created on rbd | 09:09 |
sean-k-mooney[m] | and this appears to be using the generic imagebackend code | 09:09 |
sean-k-mooney[m] | but ill confimr again | 09:09 |
sean-k-mooney[m] | bauzas i dont think we should hack around this | 09:10 |
sean-k-mooney[m] | the feature is broken currently if you use config deive and i think its important that works | 09:10 |
sean-k-mooney[m] | espcially since we added a stadard trait for this | 09:11 |
bauzas | that's my point | 09:11 |
sean-k-mooney[m] | we could simple not report that for the libvirt driver i guess | 09:11 |
bauzas | in theory, we could only allow to update userdate if configdriver isn't in use | 09:11 |
bauzas | userdata* | 09:11 |
sean-k-mooney[m] | so in this cycle no driver would report it so you could only use this if you did not have config drive | 09:11 |
sean-k-mooney[m] | bauzas: that works today i think | 09:12 |
bauzas | but I don't wanna leak a config-driven behaviour | 09:12 |
sean-k-mooney[m] | it would not be config driven behavior | 09:12 |
bauzas | sean-k-mooney: then this is documentation | 09:12 |
sean-k-mooney[m] | we just set the compute capablity trait to false for libvirt | 09:12 |
* bauzas needs to look at the patches | 09:12 | |
sean-k-mooney[m] | and then when config drive works we set it to true | 09:12 |
sean-k-mooney[m] | bauzas: either way im not sure its reasonable of use to ask whoami-rajat to wait any longer given the state of the user data patch | 09:13 |
bauzas | agreed again | 09:13 |
bauzas | we should ask to flip the microversions | 09:13 |
sean-k-mooney[m] | so flip microversion and split user data patch in two. setting the capablity trait to false in all drivers in the first one and the second patch and then implemnte the config drive rebuild adn rpc change | 09:15 |
sean-k-mooney[m] | ahtough that still feels a bit wrong | 09:16 |
bauzas | rpc change becaaaaause ? | 09:16 |
sean-k-mooney[m] | because currently it uses a dirty flag in the instance system metadata to say the config drive need to be rebuilt | 09:17 |
sean-k-mooney[m] | and dansmith made a very valid point tha tthat a shadow rpc interface and we should have added that as a parmater to the reboot rpc call | 09:17 |
sean-k-mooney[m] | so that was already requested as a followup | 09:17 |
gibi | I'm OK to only allow user data update for non config drive instances in Zed. That is a good enough compromise. Also config drive is just half config drive, as it can also be requested via the API | 09:17 |
gibi | *config drive is just half config driven | 09:18 |
sean-k-mooney[m] | yes it can thats how i was testing | 09:18 |
sean-k-mooney[m] | it can be forced via nova.conf but its user requestable as you said | 09:19 |
gibi | as a side note I found a bug in the allocation candidate filtering in the PCI series. | 09:19 |
gibi | so I think what is realistic there is to land the healing part | 09:20 |
gibi | up until https://review.opendev.org/c/openstack/nova/+/850468/20 | 09:20 |
gibi | stephenfin: Sean is +2 til ^^ so if you have time :) | 09:20 |
sean-k-mooney[m] | yes that was the partion i tought made sense if we did not land it all | 09:21 |
stephenfin | okay | 09:21 |
sean-k-mooney[m] | thats everything except the schduler part | 09:21 |
gibi | also in the light of the recent shadow RPC discussion I feel that storing data in InstancePCIRequest.extra_info is also shadowy in my series | 09:21 |
gibi | sean-k-mooney[m]: yepp, that already gives visibility of the PCI resource inventories in placement which is good to have | 09:22 |
sean-k-mooney[m] | it would have been nice to have the split pools by PF change | 09:22 |
gibi | we could land that, that is not effected by the bug | 09:22 |
sean-k-mooney[m] | that would be nice since it gives preference to PFs without VFs when you ask for a PF | 09:23 |
gibi | it does not do any externally visible change\ | 09:23 |
gibi | I'm not sure it is a real preference or just an ordering change | 09:23 |
gibi | I have to look if we sort the pool | 09:23 |
sean-k-mooney[m] | i always tought the way we allocated was more or less deterministic | 09:23 |
sean-k-mooney[m] | at least it has been stable enough for use to use in the functional tests without intermient failures | 09:24 |
gibi | it is stable | 09:24 |
gibi | but it might be not stable do to sorting | 09:24 |
gibi | but due to simple iteration order of devices / pools | 09:24 |
sean-k-mooney[m] | ack | 09:25 |
bauzas | sean-k-mooney: so I've read https://review.opendev.org/c/openstack/nova/+/816157/15 | 09:25 |
sean-k-mooney[m] | so i guess we can look at that again and confirm if it is of benifit | 09:25 |
bauzas | sean-k-mooney: can you request for the patch split and the other microversion ? | 09:25 |
gibi | sean-k-mooney[m]: yes, I will look at the ordering | 09:25 |
bauzas | I can leave some comments but I'm way behind | 09:25 |
gibi | I have to drop for an hour now for an early lunch but I will be back after | 09:25 |
sean-k-mooney[m] | bauzas: sure | 09:26 |
bauzas | thanks | 09:26 |
bauzas | I'll review quickly the bfv rebuild series | 09:26 |
bauzas | both you and gibi gave +2s but I'll do my glance quickly | 09:26 |
bauzas | and I'll tell whoami-rajat to respin the API patch with 2.93 | 09:27 |
gibi | bauzas: you could look at the viommu one too | 09:27 |
bauzas | gibi: yup, on my list | 09:27 |
sean-k-mooney[m] | https://review.opendev.org/c/openstack/nova/+/816157/15#message-f9bcfb9f6d0115f20223c4cd1e91d5c5428798d7 | 09:32 |
sean-k-mooney[m] | done ^ | 09:32 |
sean-k-mooney[m] | whoami-rajat: dansmith do you have time to update the BFV serise to use 2.93 and rebase to master | 09:32 |
bauzas | thanks | 09:32 |
sean-k-mooney[m] | its too early for them but they should see it when they get online | 09:33 |
sean-k-mooney[m] | gibi i think you are right by the way. there is no preference its just the fact we create the pool with just the PF before we create the pool with the VF in those tests i think so its just down to pciadress/pool ordering i think. | 09:36 |
sean-k-mooney[m] | we could add a prefernce but that is out of scope of the spec so dont worry about it | 09:37 |
sean-k-mooney[m] | ... its a annoying i think i see how to fix configdrive | 09:45 |
sean-k-mooney[m] | for RBD we do infact import the local config drive into ceph and it will creatly delete and recreate it if required | 09:45 |
sean-k-mooney[m] | if i use a tempory path to to generate teh new config drive and implemnente import file for the other backends then that should fix it | 09:46 |
sean-k-mooney[m] | for flat/qcow that jsut an atomic move to the normal localtion via privesep to workaround the permission issue | 09:47 |
sean-k-mooney[m] | for lvm its not much harder to do but its a little more work | 09:47 |
sean-k-mooney[m] | so im still going to try and do that so that they have a refernce patch for how to fix it | 09:48 |
bauzas | sean-k-mooney: gibi: fwiw https://review.opendev.org/c/openstack/nova/+/820368/comment/bd35c4c9_0ad9dd3d/ I'm asking to update https://docs.openstack.org/nova/latest/user/support-matrix.html#operation_rebuild in the API patch | 10:03 |
sean-k-mooney[m] | bauzas: gibi ok i have a working version of the config drive part | 10:11 |
sean-k-mooney[m] | i can push it for review and then work on tests to see what people think | 10:11 |
bauzas | ok | 10:14 |
bauzas | I need to bail out for lunch but I'll be back later | 10:14 |
opendevreview | sean mooney proposed openstack/nova master: support configdrive rebuilding https://review.opendev.org/c/openstack/nova/+/855351 | 10:26 |
whoami-rajat | sean-k-mooney[m], ack, on it | 10:44 |
opendevreview | Rajat Dhasmana proposed openstack/nova master: Add support for volume backed server rebuild https://review.opendev.org/c/openstack/nova/+/820368 | 11:35 |
opendevreview | Rajat Dhasmana proposed openstack/nova master: Add conductor RPC interface for rebuild https://review.opendev.org/c/openstack/nova/+/831219 | 11:35 |
opendevreview | Rajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances https://review.opendev.org/c/openstack/nova/+/830883 | 11:35 |
whoami-rajat | sean-k-mooney[m], bauzas gibi dansmith ^^ rebased to 2.93 | 11:37 |
bauzas | ++ | 11:37 |
whoami-rajat | once this merges, i will change novaclient, openstackclient, tempest changes as well | 11:38 |
whoami-rajat | bauzas, I wasn't sure what your ask was regarding the support matrix change, can you check if this is the change you intended ? https://review.opendev.org/c/openstack/nova/+/830883/29..30/doc/source/user/support-matrix.ini | 11:39 |
whoami-rajat | https://review.opendev.org/c/openstack/nova/+/830883/30/doc/source/user/support-matrix.ini | 11:39 |
bauzas | whoami-rajat: well, the status should be 'complete' for the libvirt drivers, right? | 11:45 |
whoami-rajat | bauzas, yeah, there were too many, so wasn't sure, will update for all libvirt drivers | 11:48 |
bauzas | whoami-rajat: say 'unknown' for the ones you don't know | 11:49 |
bauzas | like ppc64, s390x and lxc | 11:49 |
whoami-rajat | to be honest, I'm not sure about different architectures | 11:49 |
whoami-rajat | ack will do | 11:50 |
gibi | I guess on those architecture where the libvirt virt driver supports boot from volume there the rebuild will work too. at least I don't remember any arch dependent code in the rebuild patches | 11:52 |
sean-k-mooney | the arch shoudl not matter | 11:53 |
sean-k-mooney | the virt types might | 11:53 |
sean-k-mooney | as i expect this to work for qemu and kvm | 11:53 |
sean-k-mooney | we could list them as unknon but i stongly suspect it will work on arrch64 at the very least or as gibi suggested anywhere wehre boot form volume is supported | 11:55 |
sean-k-mooney | whoami-rajat: we support attach and detach volume on all teh kvm and qemu combindations | 11:57 |
sean-k-mooney | do you have any reason to belive that it would not work or would depend on the architecure | 11:58 |
gibi | we dont have a row in the matrix about boot from volume support | 11:59 |
sean-k-mooney | ya i noticed that | 11:59 |
sean-k-mooney | this could be adressed in a followup patch yes | 11:59 |
gibi | yes | 11:59 |
gibi | the whole matrix thing is just doc | 12:00 |
gibi | that can be done even after FF | 12:00 |
gibi | before RC1 | 12:00 |
gibi | or as a doc bugfix later | 12:00 |
whoami-rajat | sean-k-mooney, i don't think so, if BFV works then this should work too as we're doing some attach detach operations and on cinder side we're doing dd to copy image data to volume | 12:00 |
whoami-rajat | more or less same as what we do in BFV | 12:01 |
sean-k-mooney | ack thats what i tought woudl be the case | 12:01 |
whoami-rajat | so should i just mark all kvm qemu drivers as supported? | 12:01 |
sean-k-mooney | i think we could mark them complete unless we get a bug report saying it does not work | 12:01 |
sean-k-mooney | thats what i would do but bauzas might perfer unknonw | 12:02 |
whoami-rajat | ack, will update that | 12:02 |
bauzas | I'm not opiniateds | 12:02 |
bauzas | and I thought, given whoami-rajat was about to update his API change, it was OK to ask for this for the API chbange | 12:03 |
bauzas | and not by a FUP | 12:03 |
sean-k-mooney | either is fine | 12:03 |
gibi | I'm OK with the patch as is if we go with a FUP, or I can reapply my +2 if it is respined | 12:03 |
sean-k-mooney | i was waitign for zuul to report before revieing after the microversion change | 12:03 |
opendevreview | Rajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances https://review.opendev.org/c/openstack/nova/+/830883 | 12:05 |
whoami-rajat | does this look good? ^ | 12:05 |
sean-k-mooney | yes i think so | 12:06 |
bauzas | +2d | 12:06 |
sean-k-mooney | libvirt-lxc wont work since attach/detach volumes is missing but thats a nit | 12:07 |
sean-k-mooney | unknonw is fine | 12:07 |
gibi | we need a +2+A on the first patch then the series will land | 12:09 |
sean-k-mooney | doing that now | 12:09 |
sean-k-mooney | was just revieing it | 12:09 |
sean-k-mooney | whoami-rajat: thanks for fixing the typos and renaming the detach funciton | 12:10 |
whoami-rajat | sean-k-mooney, np, i had to rebase so was just avoiding followups | 12:10 |
sean-k-mooney | whoami-rajat: and sorry for the micorversion hassel. stacking the changes was ment to prevent fighting for the same microversion but in this case it did not help | 12:11 |
opendevreview | Takashi Natsume proposed openstack/nova master: Add missing descriptions in HACKING.rst https://review.opendev.org/c/openstack/nova/+/853054 | 12:12 |
opendevreview | Takashi Natsume proposed openstack/nova master: Add a hacking rule for the setDaemon method https://review.opendev.org/c/openstack/nova/+/854653 | 12:12 |
* sean-k-mooney should really be on a donwstream call | 12:12 | |
whoami-rajat | sean-k-mooney, i can understand there were some last minute decisions to be made, everything is fine until the changes get in so no problem :) | 12:12 |
* sean-k-mooney that is why i ahve been waring this headset for 15mins in total silence | 12:12 | |
opendevreview | Rajat Dhasmana proposed openstack/nova master: Add API support for rebuilding BFV instances https://review.opendev.org/c/openstack/nova/+/830883 | 12:36 |
whoami-rajat | ^ there was a functional test failure due to the method name change in compute manager, everything else should is running fine | 12:37 |
sean-k-mooney | ack ill re reivew after donwstream call | 12:38 |
sean-k-mooney | if gibi or bauzas dont do it before | 12:39 |
gibi | I'm on it | 12:39 |
whoami-rajat | thanks | 12:40 |
dansmith | gibi: wanna hear something funny? | 13:42 |
*** Guest1645 is now known as dasm | 13:50 | |
gibi | dansmith: sure | 13:51 |
dansmith | gibi: I would have bet my lunch on user_data being in the set of things we don't query out by default and only load if required | 13:52 |
dansmith | I remember extensive discussions about it around icehouse when all this was being done | 13:52 |
dansmith | because of it's size | 13:52 |
dansmith | but I think we're actually *always* pulling that out and *always* sending it over the bus right now, which is insane because it's almost never used | 13:52 |
dansmith | *and* in any of the joined queries where we duplicate the instance fields (like why we stopped joining the metadata queries), we'd be duplicating up to 64k of user data in the DB response... | 13:53 |
sean-k-mooney | dansmith: we had a cve related to that in the past | 13:53 |
sean-k-mooney | that should be fixed now | 13:54 |
dansmith | sean-k-mooney: related to what? | 13:54 |
gibi | I don't see any special casing for user_data so probably you are right we are always pulling it out | 13:54 |
sean-k-mooney | pulling GBs of results into a join when you have large userdata | 13:54 |
dansmith | okay, then fixed how? | 13:54 |
dansmith | because AFAICT, we're always pulling it | 13:55 |
sean-k-mooney | https://review.opendev.org/c/openstack/nova/+/758928/ | 13:55 |
dansmith | sean-k-mooney: that's metadata | 13:55 |
dansmith | but you're saying that because we were also loading user_data that was much larger I guess? | 13:56 |
sean-k-mooney | yep so i think we fixed the once case wehre we had a really large join as a result | 13:56 |
dansmith | yeah so that patch will maybe avoid some duplication of it, but we don't need to be pulling it out of the DB except when we need it | 13:56 |
sean-k-mooney | yes thats also fair we dont | 13:57 |
sean-k-mooney | i guess it would be nice to do as a performace enhancement in general | 13:58 |
gibi | does it also mean that we can save the updated user data on the compute side now? | 13:58 |
sean-k-mooney | but also possibly reducing the change of another cve | 13:58 |
dansmith | well, it could seriously cut down on our rabbit load | 13:58 |
sean-k-mooney | where the user data is large yes | 13:58 |
dansmith | gibi: yeah I assume that if that patch works, it works because somewhere in reboot the instance gets saved, probably due to state | 13:59 |
dansmith | gibi: no test coverage of that though that I can see :) | 13:59 |
sean-k-mooney | unfortunetly people that use heat tend to also abuse the userdata | 13:59 |
sean-k-mooney | we have had requests in teh past to increae the limmit even more | 13:59 |
gibi | dansmith: ahh yes as we dont call save on the api side | 13:59 |
sean-k-mooney | we said no the last few times it came up | 13:59 |
dansmith | sean-k-mooney: yeah, I bet people are bzipping stuff to keep it under the limit :D | 13:59 |
sean-k-mooney | oh by the way the limit of medium text is 64MB not 64KB but i think we limit to 64KB but now i want to go check | 14:01 |
sean-k-mooney | we use mediumtext in the db schema | 14:01 |
sean-k-mooney | i think we put the limit in the api | 14:01 |
dansmith | we are limiting in the api yeah | 14:01 |
dansmith | at least in this patch | 14:02 |
sean-k-mooney | our api ref say Configuration information or scripts to use upon launch. Must be Base64 encoded. Restricted to 65535 bytes. | 14:02 |
sean-k-mooney | which is what i expected | 14:03 |
dansmith | btw, I've seen this fail several times in the last day: https://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/testr_results.html | 14:03 |
dansmith | it complains of a missing host, which the test is unshelving to by name, so I assume that's a test bug or something | 14:04 |
sean-k-mooney | Compute host ubuntu-focal-rax-dfw-0030919238 could not be found. | 14:04 |
sean-k-mooney | ya | 14:04 |
sean-k-mooney | so tempest config error maybe | 14:04 |
dansmith | so anyway, on the user_data thing, I think it would be a good idea for us to en-lazy that by default and only load it in the api and metadata api by default and I bet we'll see some rabbit load relaxed | 14:05 |
gmann | dansmith: sean-k-mooney gibi Uggla I am also seeing nova-multi-cell failing consistently for the new test added for test_unshelve_to_specific_host | 14:06 |
gmann | https://zuul.opendev.org/t/openstack/builds?job_name=nova-multi-cell&skip=0 | 14:06 |
dansmith | ack, I've probably rechecked 8 of that failure in the last 24 hours | 14:06 |
sean-k-mooney | ubuntu-focal-rax-dfw-0030919243 is the compute buntu-focal-rax-dfw-0030919238 is the contoller | 14:06 |
gmann | this test was recently merged yesterday and was passing that time | 14:07 |
gibi | it is interesting as the test looks up the other compute by the service list | 14:07 |
dansmith | might be flaky or might not work on some cloud providers if there's a name weirdness? | 14:07 |
sean-k-mooney | dansmith: so the name is correct at least it trying to unshelve to the contoller | 14:07 |
sean-k-mooney | i wonder if there is fqdn stuff going on | 14:08 |
gibi | ubuntu-focal-rax-dfw-0030919238 doesnt feel wierd | 14:08 |
gmann | yeah | 14:08 |
dansmith | gibi: I meant FQDN type things | 14:08 |
dansmith | sean-k-mooney and I are scarred for life on FQDN problems :) | 14:08 |
sean-k-mooney | ubuntu-focal-rax-dfw-0030919238 is what we see for the host value in the compute agent startup | 14:09 |
sean-k-mooney | so maybe hypervior hostname | 14:09 |
sean-k-mooney | DEBUG nova.compute.resource_tracker [None req-49d84742-3abf-4111-8a83-e2c5e0e67661 None None] Hypervisor/Node resource view: name=ubuntu-focal-rax-dfw-0030919238 | 14:10 |
sean-k-mooney | nope they all seem to line up | 14:10 |
sean-k-mooney | odd | 14:10 |
dansmith | do we do a disabled or service liveness check before we let you unshelve there? maybe the compute is stuck and not updating its counter? | 14:11 |
gibi | when we query the service we check for that it is up and enabled | 14:12 |
gmann | dansmith: you mean just before unshelve or while selecting the host? while selecting host we do check service is up and enable | 14:12 |
gmann | but after that shelve happen and then unshelve and that time no check before unshelve | 14:12 |
dansmith | gmann: yeah I mean whatever check we're doing that gives that error | 14:13 |
dansmith | is there a tempest test that is disabling a compute by chance for testing? | 14:13 |
sean-k-mooney | its proably worth checkign the schduler logs | 14:13 |
dansmith | oh is this just the scheduler "is this host okay" check? | 14:13 |
gmann | I do not think we have any such test of touching the compute services enable/disable | 14:15 |
gibi | it is not about sevice state | 14:15 |
gibi | the unshelve fails on compute_node_get_all_by_host | 14:15 |
gibi | so the compute is not in the DBV | 14:15 |
gibi | DB | 14:15 |
gibi | https://github.com/openstack/nova/blob/master/nova/compute/api.py#L4577 | 14:17 |
gibi | this where the unshelve fails | 14:17 |
sean-k-mooney | so it is acutlly trying to shelve and unshlve multiple times | 14:18 |
sean-k-mooney | it shleve offloaded and shelve to the compute | 14:19 |
sean-k-mooney | then it shelved again | 14:19 |
sean-k-mooney | and failed to unshleve to the contoler with the not found issue | 14:19 |
gibi | yes | 14:19 |
gibi | first it unselves back to where it was, then it unshelves to the other compute | 14:20 |
sean-k-mooney | and req-b3e7705d-a20b-42b3-8eb5-594181a096c8 is the request that fialed | 14:20 |
sean-k-mooney | which failed in the api right | 14:20 |
sean-k-mooney | its not in the secheuler | 14:20 |
gibi | yes | 14:21 |
gmann | yes it is failing in 2nd time unshelve on another host | 14:21 |
gibi | it is not fails on the scheduler | 14:21 |
gibi | it fails in the compute api | 14:21 |
gibi | https://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/controller/logs/screen-n-api.txt | 14:21 |
gibi | https://github.com/openstack/nova/blob/master/nova/compute/api.py#L4577 | 14:21 |
gibi | unshelve checks is the requested host exists | 14:21 |
gibi | before it goes to the conductor | 14:21 |
sean-k-mooney | by calling compute_node_get_all_by_host | 14:21 |
gibi | it calls get_first_node_by_host_for_old_compat | 14:22 |
sean-k-mooney | thats on the main db i.e. cell db right | 14:22 |
dansmith | so that is looking up by the service, and getting the first compute node on the service | 14:22 |
dansmith | not by_host_and_node | 14:22 |
* dansmith tickles his "robustify compute node hostnames" spec | 14:23 | |
sean-k-mooney | for libvirt there should only be one but ya | 14:23 |
sean-k-mooney | if shelve was a thing for vmware or ironic that would not be ideal | 14:23 |
dansmith | sean-k-mooney: right but this is very loose association which doesn't have FK constraints | 14:23 |
sean-k-mooney | why is it hitting the cell db. it could look at the host mapping table in the api db right | 14:24 |
sean-k-mooney | or cell mappings table | 14:24 |
dansmith | eh? | 14:24 |
dansmith | those just tell us which db the host is in | 14:24 |
sean-k-mooney | right but if all you care about is does it exist is that not enough | 14:24 |
dansmith | nothing about whether it's really there or up or whatever | 14:24 |
gibi | sean-k-mooney: you have a point, so the compute-api does a cell DB query without targeting the cell first | 14:24 |
sean-k-mooney | dansmith: i guess it could be down but im not sure this is checkign that | 14:25 |
dansmith | sean-k-mooney: or deleted | 14:25 |
sean-k-mooney | i would expct that to be check by the schduler to be honest | 14:25 |
dansmith | gibi: these hosts are in the same cell though so if that wasn't working, we wouldn't have found the first one right? | 14:25 |
gibi | this is the multicell job so I assume we have hosts in different cells | 14:26 |
sean-k-mooney | yes there are only two nodes in the job | 14:26 |
sean-k-mooney | so each is in a different cell | 14:26 |
dansmith | multicell meaning superconductor not actually one host per cell right? | 14:26 |
sean-k-mooney | well we can check but i tought this was 2 cells | 14:27 |
dansmith | or is this really unshelving to different cells? | 14:27 |
dansmith | can we unshelve across cells? | 14:27 |
dansmith | I thought not | 14:27 |
dansmith | so maybe we're targeted at the cell where the instance is and not the cell where the host is that we're unshelving to? | 14:28 |
dansmith | not sure how that would have ever passed | 14:28 |
sean-k-mooney | the compute is not runnign a second conductor | 14:28 |
gibi | I'm looking at the compute.api.API.unshelve and I think the context is not targeted | 14:28 |
sean-k-mooney | oh the contoler is ruuning both cell1 and cell2 conductors | 14:29 |
sean-k-mooney | and the supper conductor | 14:29 |
sean-k-mooney | https://de836787b7e59a5adc13-298f4365cc798f0001a632f171eb41d6.ssl.cf2.rackcdn.com/831219/22/check/nova-multi-cell/9d8aa66/compute1/logs/etc/nova/nova_cell1_conf.txt | 14:30 |
sean-k-mooney | transport_url = rabbit://stackrabbit:secretrabbit@10.208.192.130:5672/nova_cell2 | 14:30 |
sean-k-mooney | so ya compute1 is nova_cell2 | 14:31 |
sean-k-mooney | and contoler is transport_url = rabbit://stackrabbit:secretrabbit@10.208.192.130:5672/nova_cell1 | 14:31 |
gibi | I don't know why this ever passed the test | 14:31 |
dansmith | I also don't know how unshelve could be using an untargeted context | 14:31 |
dansmith | it should be pointing at cell0 all the time which would never work | 14:31 |
sean-k-mooney | dansmith: did you say this just started failing recently? | 14:32 |
sean-k-mooney | because we merged unshelve to host a while ago | 14:32 |
dansmith | sean-k-mooney: it just recently merged I think gmann said | 14:32 |
sean-k-mooney | the tempest test? | 14:32 |
sean-k-mooney | if so i guess we dont run the multi-cell job on tempest | 14:33 |
gmann | dansmith: sean-k-mooney gibi that passed in tempest-multinode-full-py3 job and I do not think we checked multi-cell job run for this | 14:33 |
dansmith | the base nova conf doesn't have a db connection pointing at cell0, which I assume is what the apis are using | 14:33 |
gmann | yeah we do not run multi-cell job there and only checked tempest-multinode-full-py3 job passing it | 14:34 |
sean-k-mooney | gmann: ack that makes sense | 14:34 |
gibi | yeah nova-multicell did not run on the tempest patch | 14:34 |
sean-k-mooney | ok so we have two thing one we should skip this temporaly on the multi-cell-job | 14:34 |
sean-k-mooney | and then fix the bug | 14:34 |
gibi | yes | 14:34 |
dansmith | a normal run should have the api pointing at cell0: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_8cf/831219/22/check/tempest-integrated-compute/8cfb267/controller/logs/etc/nova/nova_conf.txt | 14:34 |
dansmith | but that multicell run has nothing defined there | 14:35 |
sean-k-mooney | we at least shoudl check we are not unshelving across cells for now but that actully should work | 14:35 |
sean-k-mooney | since we use shelve for | 14:35 |
sean-k-mooney | cross cell resize | 14:35 |
dansmith | sean-k-mooney: it shouldn't work | 14:35 |
* gibi goes and file a bug | 14:35 | |
dansmith | because cross-cell migration does other stuff on top of the shelve | 14:35 |
sean-k-mooney | oh your right | 14:35 |
gmann | and cross-cell is not in that feature acope right? which is added in this cycle only. | 14:36 |
gmann | scope | 14:36 |
gmann | sean-k-mooney: agree to skip it in multi-cell job | 14:36 |
sean-k-mooney | it was not explcitly no | 14:36 |
sean-k-mooney | so we can just document the limitation i guess | 14:36 |
gmann | but will be good to add in documnt/releasenotes | 14:37 |
gmann | yes | 14:37 |
sean-k-mooney | dansmith: ohter then usign multiple port binding to test if the destination cell can bind the port. how much more does cross cell migration do over shelve. there is a bunch of stuff in the supper conductor to copy the instnace object between the cell db right | 14:38 |
dansmith | yeah it moves the thing between DBs, that's the big thing | 14:38 |
sean-k-mooney | well for now i guess its fine the error could be improved but the api wont let you currpt the db or anything like that | 14:39 |
dansmith | so this multicell job has two cells and two computes, one compute per cell? it must be disabling live migration and lots of other stuff that would normally require multinode right/ | 14:39 |
sean-k-mooney | ya its two nodes (contoller and compute) and both nodes are in a differnt cell | 14:40 |
dansmith | so it must disable a bunch of things | 14:40 |
sean-k-mooney | yep | 14:40 |
sean-k-mooney | https://github.com/openstack/nova/blob/master/.zuul.yaml#L514-L571 | 14:41 |
gmann | yeah, live migration is disabled https://github.com/openstack/nova/blob/master/.zuul.yaml#L554 | 14:41 |
sean-k-mooney | thats the deffintion | 14:41 |
dansmith | okay | 14:42 |
gibi | file the bug https://bugs.launchpad.net/nova/+bug/1988316 | 14:42 |
gibi | as it is blocks the gate I will set it to Critical | 14:43 |
dansmith | also, the api config I was looking at was on the compute, the controller is pointed at cell0 by default | 14:43 |
dansmith | gibi: ++ | 14:43 |
sean-k-mooney | gmann: does the tempest test have a compute_feature_flag or will i just add it to the exclude regex | 14:43 |
gibi | and I will push the skip | 14:43 |
gibi | to unblock the gate | 14:43 |
sean-k-mooney | oh ok | 14:43 |
sean-k-mooney | ill leave it to you so | 14:43 |
gmann | sean-k-mooney: exclude the test | 14:44 |
gmann | gibi: ok, I was doing but please you go ahead | 14:44 |
gibi | ack, I'm on it | 14:45 |
dansmith | sean-k-mooney and I can get our +2 hammers polished up | 14:45 |
gmann | gibi: we can add test class name in case more test are added for this https://github.com/openstack/nova/blob/master/devstack/nova-multi-cell-exclude-list.txt | 14:45 |
sean-k-mooney | ya UnshelveToHostMultiNodesTest | 14:45 |
sean-k-mooney | or tempest.api.compute.admin.test_servers_on_multinodes.UnshelveToHostMultiNodesTest | 14:46 |
sean-k-mooney | if we want to fully quallify it | 14:46 |
gibi | dansmith: would the current code fail to found the compute even if it is the same cell of the instance as it does not target the cell? | 14:48 |
dansmith | gibi: I think it has to be targeting the cell | 14:48 |
sean-k-mooney | it seamed to work for the host it started on | 14:48 |
gibi | I don't see where it targets it | 14:48 |
dansmith | I know it's not clear but it wouldn't be getting as far as it is if not | 14:48 |
sean-k-mooney | its presumably targeting the source cell and not finding the dest host | 14:49 |
dansmith | I confirmed it's using cell0 by default, which means it must be getting targeted | 14:49 |
sean-k-mooney | oh ok | 14:50 |
opendevreview | ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (db) https://review.opendev.org/c/openstack/nova/+/831193 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (objects) https://review.opendev.org/c/openstack/nova/+/839401 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (manila abstraction) https://review.opendev.org/c/openstack/nova/+/831194 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (drivers and compute manager part) https://review.opendev.org/c/openstack/nova/+/833090 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Attach Manila shares via virtiofs (api) https://review.opendev.org/c/openstack/nova/+/836830 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Bump compute version and check shares support https://review.opendev.org/c/openstack/nova/+/850499 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add metadata for shares https://review.opendev.org/c/openstack/nova/+/850500 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add instance.share_attach notification https://review.opendev.org/c/openstack/nova/+/850501 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add instance.share_detach notification https://review.opendev.org/c/openstack/nova/+/851028 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add shares to InstancePayload https://review.opendev.org/c/openstack/nova/+/851029 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add instance.power_on_error notification https://review.opendev.org/c/openstack/nova/+/852084 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add instance.power_off_error notification https://review.opendev.org/c/openstack/nova/+/852278 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add helper methods to attach/detach shares https://review.opendev.org/c/openstack/nova/+/852085 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add libvirt test to ensure metadata are working. https://review.opendev.org/c/openstack/nova/+/852086 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add virt/libvirt error test cases https://review.opendev.org/c/openstack/nova/+/852087 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Add share_info parameter to reboot method for each driver (driver part) https://review.opendev.org/c/openstack/nova/+/854823 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Support rebooting an instance with shares (compute and API part) https://review.opendev.org/c/openstack/nova/+/854824 | 14:52 |
opendevreview | ribaudr proposed openstack/nova master: Change microversion to 2.93 https://review.opendev.org/c/openstack/nova/+/852088 | 14:52 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Skip UnshelveToHostMultiNodesTest in nova-multi-cell https://review.opendev.org/c/openstack/nova/+/855378 | 14:52 |
gibi | gmann, dansmith, sean-k-mooney: here is the skip ^^ | 14:52 |
sean-k-mooney | do we still use that file | 14:54 |
gibi | dansmith: in the rest api code we load the instance properly from its cell could that codepath alter the context too? | 14:54 |
sean-k-mooney | oh we do | 14:54 |
dansmith | no, because we're doing scatter | 14:54 |
gmann | yes | 14:54 |
dansmith | so actually | 14:55 |
dansmith | I think gibi might be right and that the feature itself is broken | 14:55 |
dansmith | but I can't explain why it's working on single-cell, since we're targeted to cell0 by default | 14:55 |
dansmith | it looks like we don't target until we hit superconductor | 14:55 |
dansmith | https://github.com/openstack/nova/blob/cbc9b516fb5e6b079b57cc40380e1d730751cd6c/nova/conductor/manager.py#L974-L975 | 14:55 |
dansmith | so I can't explain why this ever works, because the api should have only checked for the host being present in cell0, which should never have any hosts | 14:56 |
gibi | I think I see where we target the context | 14:56 |
gibi | _get_instance_from_cell | 14:56 |
gibi | https://github.com/openstack/nova/blob/3862cfc649d0099971c91d17e6f8800d10712a26/nova/compute/api.py#L2867-L2870 | 14:56 |
gibi | this actually has a side effect to alter the context | 14:56 |
dansmith | ah, yep, okay I missed that | 14:57 |
dansmith | phew :) | 14:57 |
gibi | so unshelve actually works within a cell | 14:57 |
gibi | we need to document that it does not work across cells and it need to emit a better error | 14:57 |
dansmith | yeah | 14:57 |
sean-k-mooney | ya although that proably a latent bug | 14:58 |
sean-k-mooney | i.e. if we use the old way of targeting an az | 14:58 |
sean-k-mooney | and that az was in a differnt cell | 14:58 |
sean-k-mooney | it likely would have failed in teh same place | 14:58 |
dansmith | latent bug in terms of raising a useful error you mean? | 14:58 |
sean-k-mooney | yes | 14:58 |
dansmith | yeah | 14:58 |
sean-k-mooney | its just now that we have a tempest test that hits it we are seeing it | 14:58 |
sean-k-mooney | but we proably were not doing cross az unshelve in the multi cell job | 14:59 |
gibi | I've update the bug | 15:00 |
dansmith | sean-k-mooney: can you +W the skip so it's headed in after tests? | 15:00 |
dansmith | or gmann | 15:01 |
gibi | Uggla: probably the above discussion about unshelve is interesting to you too ;) | 15:01 |
gibi | Uggla: also https://bugs.launchpad.net/nova/+bug/1988316 :) | 15:01 |
gmann | dansmith: gibi done | 15:03 |
gibi | gmann: thanks | 15:03 |
dansmith | sweet | 15:03 |
gmann | I think its worth to add multi-cell jo on tempest side too and check the new tests for cross-cell thigns | 15:04 |
gmann | job | 15:04 |
gmann | I will do | 15:04 |
gibi | nah then we had the necessary gate block at FF week done | 15:04 |
gibi | could have been worse | 15:06 |
gmann | ? did not get | 15:06 |
gmann | you mean we should not add in tempest gate ? | 15:06 |
gibi | nah I'm just joking | 15:06 |
dansmith | I believe gibi is being funny :) | 15:07 |
gmann | :) | 15:07 |
gibi | so we tend to break the gate during FF as we are rushing | 15:07 |
dansmith | but in all seriousness, gmann adding multicell to tempest gate might be a bit heavy and nova-centric | 15:07 |
dansmith | not sure it's really necessary | 15:07 |
dansmith | maybe in experimental or something | 15:07 |
gmann | yeah experimental is good idea | 15:07 |
bauzas | add this in periodic | 15:09 |
bauzas | periodic-weelky | 15:09 |
bauzas | and then I'll check its state every week before the meeting | 15:09 |
sean-k-mooney | dansmith: sorry was geting a drink yes i can | 15:09 |
bauzas | better than experimental where not all of us would check it | 15:09 |
sean-k-mooney | oh gmann already did | 15:10 |
gmann | bauzas: with experimental we can check it during the new test addition. it run in nova gate as voting so periodic in tempest side would not be much benefit. if anything broken then nova gate will check it nd inform tempest | 15:11 |
sean-k-mooney | im not sure periodic-weekly helps | 15:11 |
sean-k-mooney | we will notice it on the nova side before it runs | 15:11 |
gmann | yeah | 15:11 |
bauzas | okok | 15:11 |
bauzas | maybe I misunderstood | 15:12 |
dansmith | right, periodic won't really help | 15:12 |
gmann | bauzas: we are talking to add multi-cell job in tempest experimental pipeline so that we can check new tests (migration/unshelve one ) for cross cells env | 15:12 |
bauzas | aaaaah | 15:13 |
bauzas | OK, then yes | 15:13 |
bauzas | fine with me | 15:13 |
bauzas | I was thinking it was a nova job | 15:13 |
bauzas | lgmt then to be experimental | 15:13 |
gmann | bauzas: with the new change in RBAC (dropped system scope from policy), we do not need this blueprint change anymore. I have added the notes there https://blueprints.launchpad.net/nova/+spec/allow-project-admin-list-hypervisors | 16:29 |
gmann | bauzas: please close/mark it not needed | 16:30 |
gmann | added the same notes in etherpad too for your easy tracking https://etherpad.opendev.org/p/nova-zed-blueprint-status | 16:32 |
sean-k-mooney | gmann i just set it to obsolete | 16:33 |
gmann | sean-k-mooney: thanks | 16:34 |
bauzas | ricolin: just said -1 for https://review.opendev.org/c/openstack/nova/+/830646 but if you don't have time today or tomorrow morning, lemme know | 16:58 |
bauzas | ricolin: we could just need a follow-up patch for my nits | 16:58 |
bauzas | and I'll +2 | 16:58 |
ricolin | bauzas: thanks, I will try to update it ASAP | 17:15 |
whoami-rajat | dansmith, do you have any idea about the nova multi cell breaking constantly or is it just flaky? | 17:16 |
dansmith | whoami-rajat: there's a fix in the queue | 17:16 |
dansmith | whoami-rajat: no point in rechecking until that lands | 17:16 |
whoami-rajat | ah ok | 17:16 |
whoami-rajat | wasn't aware | 17:16 |
dansmith | it has been in check for hours, not even in the gate yet | 17:16 |
whoami-rajat | doesn't sound good | 17:17 |
dansmith | whoami-rajat: it's in the gate now | 17:30 |
whoami-rajat | great | 17:30 |
whoami-rajat | can i have the link? | 17:30 |
dansmith | it's 855378 | 17:31 |
whoami-rajat | ack | 17:31 |
sean-k-mooney | whoami-rajat: sicne your change is approved even if its not merged by COB tommrow at m3 it can be rechecked to land it | 17:56 |
sean-k-mooney | so we will keep an eye on it until its landed | 17:56 |
sean-k-mooney | whoami-rajat: if you have a depends on form the nova client and osc patchs against the top patch we can review those in parralel | 17:57 |
whoami-rajat | sean-k-mooney, thanks, just wanted to update other project patches after this merged, guess it's safe to update the clients and tempest in parallel so will get to that | 17:57 |
whoami-rajat | sean-k-mooney, yep, i will just update them quickly and let you know | 17:58 |
sean-k-mooney | yep depends on will prevent them merging until the api change lands and then they can just be rechecked at that point | 17:58 |
whoami-rajat | ack, sounds good | 18:00 |
opendevreview | Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646 | 18:14 |
opendevreview | Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507 | 18:14 |
sean-k-mooney | ricolin: you still have not adress my feedback on https://review.opendev.org/c/openstack/nova/+/844507/15 | 18:15 |
sean-k-mooney | the way you are doing the traits reporting is wrong | 18:15 |
opendevreview | Rajat Dhasmana proposed openstack/python-novaclient master: Add support to rebuild boot volume 2.93 https://review.opendev.org/c/openstack/python-novaclient/+/827163 | 18:16 |
whoami-rajat | novaclient: https://review.opendev.org/c/openstack/python-novaclient/+/827163 | 18:18 |
whoami-rajat | osc: https://review.opendev.org/c/openstack/python-openstackclient/+/831014 | 18:18 |
whoami-rajat | sean-k-mooney, ^ | 18:18 |
sean-k-mooney | ack ill go review those now | 18:19 |
whoami-rajat | thanks! | 18:20 |
ricolin | sean-k-mooney: okay, checking the comments again | 18:28 |
dansmith | that skip is about to pop | 19:31 |
dansmith | next ten minutes or so | 19:32 |
sean-k-mooney | whoami-rajat: i looked over the chagnes novaclinet looks good. minor issue with osc, and does the tempest change really need its own job? i would just add that test to tempest full if cinder is deployed and nova is new enough | 19:43 |
sean-k-mooney | none of that is blocking the nova change but that just my feedback | 19:43 |
sean-k-mooney | im going to go have dinner/breakfest | 19:43 |
opendevreview | Merged openstack/nova master: Skip UnshelveToHostMultiNodesTest in nova-multi-cell https://review.opendev.org/c/openstack/nova/+/855378 | 19:50 |
opendevreview | Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646 | 20:18 |
opendevreview | Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507 | 20:18 |
ricolin | sean-k-mooney: can you check if I do make it right this time, thanks ^^^ | 20:18 |
opendevreview | Rico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest https://review.opendev.org/c/openstack/nova/+/830646 | 20:50 |
opendevreview | Rico Lin proposed openstack/nova master: Add traits for viommu model https://review.opendev.org/c/openstack/nova/+/844507 | 20:50 |
*** dasm is now known as dasm|off | 21:40 | |
opendevreview | Merged openstack/nova master: Add documentation and releasenotes for RBAC change https://review.opendev.org/c/openstack/nova/+/854882 | 23:27 |
*** efried1 is now known as efried | 23:49 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!