brinzhang | bauzas: I agree with gibi and sean-k-mooney, I have no difficulty discussing it, thanks | 00:29 |
---|---|---|
*** ganso_ is now known as ganso | 02:29 | |
*** viks___ is now known as viks__ | 02:31 | |
gibi | morning | 07:25 |
bauzas | good morning | 08:00 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813419 | 08:09 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813419 | 08:13 |
opendevreview | Balazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813437 | 08:27 |
-opendevstatus- NOTICE: zuul needed to be restarted, queues were lost, you may need to recheck your changes | 08:47 | |
stephenfin | bauzas: Morning o/ Care to look at https://review.opendev.org/c/openstack/nova/+/814547 | 08:56 |
bauzas | stephenfin: ok thanks for finding it ! | 08:56 |
bauzas | stephenfin: just a thought, you're not explaining in https://review.opendev.org/c/openstack/nova/+/814547/1//COMMIT_MSG why we now have a regression | 08:58 |
stephenfin | oh, sorry, the regression is because I removed I6ce930fa86c82da1008089791942b1fff7d04c18 | 08:59 |
stephenfin | I mention that at the end of the commit message. It's kind of implicit though, admittedly | 08:59 |
stephenfin | I thought I'd fixed the issue that made I6ce930fa86c82da1008089791942b1fff7d04c18 necessary. Evidently not :( | 09:00 |
opendevreview | Rajat Dhasmana proposed openstack/nova-specs master: Add spec for volume backed server rebuild https://review.opendev.org/c/openstack/nova-specs/+/809621 | 09:00 |
bauzas | stephenfin: ok, then I'll leave a comment telling it and then I'll approve | 09:10 |
bauzas | done. | 09:12 |
stephenfin | ty | 09:36 |
opendevreview | Merged openstack/nova master: db: Increase timeout for migration tests https://review.opendev.org/c/openstack/nova/+/814547 | 09:40 |
opendevreview | Wenping Song proposed openstack/nova master: Support concurrently add hosts to aggregates https://review.opendev.org/c/openstack/nova/+/815105 | 10:04 |
gibi | sean-k-mooney[m]: do I understand correctly that neutron's sriov-nic-agent only sends plugtime plug/unplug events for vnic_type=direct ports but not for vnic_type=direct-physical ports | 10:16 |
gibi | ? | 10:16 |
gibi | sean-k-mooney[m]: https://github.com/openstack/neutron/blob/6d8e830859cd4ac9708701b8e344fdc68cbcaebb/neutron/plugins/ml2/drivers/mech_sriov/mech_driver/mech_driver.py#L135-L137 | 10:18 |
sean-k-mooney[m] | hum that is a good question. i guess that woud be the case yes since for PFs the agent does not configure anything since anything it did would be undone whne we detach the device from the host kernel and attach it to the guest | 10:20 |
sean-k-mooney[m] | i have never actully check its behavior in that regard | 10:21 |
gibi | in my local env I see plug/unplug event during nova hard reboot for VF ports but not for PF ports so probably this is the case | 10:22 |
sean-k-mooney[m] | yes so you might need to make an excption in your workaround patch | 10:23 |
sean-k-mooney[m] | perhaps change it form a boolean to a list of vnic_types | 10:23 |
sean-k-mooney[m] | odl only support vnic_type normal and vhost_user | 10:24 |
sean-k-mooney[m] | well vhost-user | 10:24 |
gibi | I think the doc in the patch still correct when we say set the flag only for ml2/ovs or networking-odl | 10:24 |
gibi | I might extend that with mech_sriov + vnic_type direct | 10:25 |
sean-k-mooney[m] | right but if you filter by vnic type you can use it when you have odl and sriov on the same host | 10:25 |
gibi | yeah, I can ignore direct-physical ports when waiting for plug | 10:26 |
sean-k-mooney[m] | ya i guess that also works | 10:26 |
gibi | sean-k-mooney[m]: what would be your way to filter? | 10:26 |
sean-k-mooney[m] | if we make the config option a list of vnic_types to wait for on hard reboot we just do | 10:28 |
sean-k-mooney[m] | if vif.vnic_type in CONF.wait_on_reboot: … | 10:28 |
gibi | hm yeah that is also a way | 10:29 |
sean-k-mooney[m] | im not sure if we need to have different behavor for other vnic types like the cyborg ones | 10:29 |
sean-k-mooney[m] | or baremetal though thtat is used only by ironic | 10:30 |
sean-k-mooney[m] | i wouold have to look at the spec again but when we are using cyborg provided smart nics neutron still sends the events right? | 10:31 |
gibi | hm neutron seems to support accelerator-direct with mech_sriov, and direct means a VF so I assume there is plug time events | 10:35 |
sean-k-mooney[m] | i would assume so too but its not actully mentioned in https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/sriov-smartnic-support.html | 10:35 |
sean-k-mooney[m] | gibi so you could either limit this to the case we know work (normal,direct,vhost-user) or you could filet out the case we know wont work (direct-physical,acclerator-direct-physical) | 10:38 |
gibi | yeah | 10:38 |
gibi | as the config today needs to be opt-in | 10:38 |
gibi | I guess opting in to supported vnic types are better | 10:38 |
gibi | when you say vhost-user why is not enough to simply filter for vnic_type direct? | 10:39 |
sean-k-mooney[m] | in that case the list would be normal,direct,macvtap,acclerator-direct,vhost-user | 10:40 |
sean-k-mooney[m] | well direct shoud work right | 10:40 |
sean-k-mooney[m] | hardware offloaded ovs with ml2/ovs support direct and will send plug time events and the sriov nic agent should also | 10:40 |
sean-k-mooney[m] | and vhost-user should also work with ml2/ovs and ml2/odl | 10:41 |
sean-k-mooney[m] | the sriov nica agent should support macvtap plug time events | 10:41 |
sean-k-mooney[m] | ml2/ovs should also send them for vdpa | 10:42 |
gibi | for the deployer probably it is easier to just list vnic_types and not go into details like vhost-user | 10:42 |
sean-k-mooney[m] | vhost-user is a vnic type | 10:42 |
gibi | sean-k-mooney[m]: is it? | 10:42 |
sean-k-mooney[m] | yes | 10:42 |
sean-k-mooney[m] | its not a vif_type | 10:42 |
gibi | blob/6d8e830859cd4ac9708701b8e344fdc68cbcaebb/neutron/plugins/ml2/drivers/mech_sriov/mech_driver/mech_driver.py#L164 | 10:43 |
gibi | sorry | 10:43 |
gibi | wrong buffe4r | 10:43 |
gibi | https://github.com/openstack/neutron-lib/blob/f01b2e9025d33aeff3bf22ea2568bda036878819/neutron_lib/api/definitions/portbindings.py#L131 | 10:43 |
sean-k-mooney[m] | so i think if you want to hard code it just filter out direct-physical,baremetal and acclerator-direct-physical | 10:43 |
gibi | se here are the vnic_types | 10:43 |
gibi | I don't see vhost-user as vnic type in that list | 10:44 |
sean-k-mooney[m] | oh sorry maybe your right i have not looked at dpdk in 2 years or more | 10:45 |
sean-k-mooney[m] | i might be miss rememebvring let me check the ml2/driver but i guess its vnic_normal | 10:45 |
gibi | I thin it is mapped to normal yes | 10:46 |
gibi | anyhow I think we are in agreement to have this filtering based on vnic_type | 10:46 |
gibi | I think I will amend the current patch with that | 10:46 |
sean-k-mooney[m] | ya you are right it is | 10:46 |
sean-k-mooney[m] | so we should just skip waiting then for *-physical and baremetal | 10:47 |
sean-k-mooney[m] | looking at the other vnic_types i dont think vnic_type smartnic is used with ovs or odl | 10:48 |
gibi | OK, I will discuss this the the downstream folks to and see if they prefer a configurable vnic_type or they are OK with a hardcode | 10:48 |
sean-k-mooney[m] | i think that is used by ironic | 10:48 |
sean-k-mooney[m] | ack | 10:49 |
gibi | yes, smartnic is ironic afaik | 10:49 |
gibi | so we can filter out that too | 10:49 |
sean-k-mooney[m] | ya most likely | 10:49 |
sean-k-mooney[m] | you could make the config an exclude list and default to the set we know wont work | 10:50 |
sean-k-mooney[m] | actully no | 10:50 |
sean-k-mooney[m] | that would enable it by default which we do not want | 10:50 |
sean-k-mooney[m] | ok ill be afk for 20 mins or so chat to you later | 10:51 |
gibi | ack, thanks! | 10:52 |
frickler | kashyap: couple of more findings: a) no change with the Nehalem cpu settings patch from clarkb | 11:13 |
frickler | b) same issue with qemu-6.1 compiled from source | 11:13 |
kashyap | frickler: Hi | 11:13 |
frickler | c) the delta doesn'n really increase with large flavors, i.e. with 512M or 1G, the cirros process still stays at 600M | 11:14 |
kashyap | frickler: The Nehalem thing here is not relevant (unless you're using CentOS9). | 11:15 |
kashyap | frickler: Good to know that you've actually tested it w/ compiled with source | 11:15 |
frickler | the latter is likely why this issue isn't more widely seen. it just affects CIs that try to start a larger number of small instances | 11:15 |
frickler | ... in a limited memory environment | 11:15 |
kashyap | frickler: Right. So, this is TCG - this is not super amazingly well-tested upstream. Because a lot of folks use hardware accel. That said: | 11:17 |
kashyap | frickler: Can you please file an upstream QEMU bug here (do you have a GitLab account?) - https://gitlab.com/qemu-project/qemu/-/issues | 11:18 |
kashyap | frickler: That'll help me investigate the issue with a TCG dev. | 11:19 |
kashyap | Also, please include the bits you posted yesterday. (https://paste.opendev.org/raw/810150/) | 11:21 |
kashyap | frickler: I wonder if we can replicate this outside of OpenStack CI: like artificially triggering a script that'll start a ton of CirrOS instances? | 11:22 |
frickler | kashyap: I'll do the bug report, though likely not today, I'll let you know then | 11:23 |
kashyap | Thanks! Do mention the buggy version where you saw it first. And also the 6.1 compiled-from-source test. | 11:23 |
kashyap | It'll help with bisecting. | 11:23 |
frickler | kashyap: well I replicated with a local devstack deployment. I can also try to just create an instance with virt-manager | 11:24 |
kashyap | Yes, that'll be more preferable, if possible. | 11:24 |
frickler | kashyap: ok, thx for your feedback so far | 11:25 |
kashyap | No problem. These TCG bugs (if it is indeed a bug) are hard to suss out. | 11:25 |
sean-k-mooney1 | frickler: you are seeing this just with a normal boot right | 11:37 |
sean-k-mooney1 | you dont need to boot many vms to trigger it | 11:37 |
sean-k-mooney1 | you are seeing the large memory usage with just a singel instnace | 11:37 |
sean-k-mooney1 | so this should not be hard to replicate right | 11:37 |
kashyap | sean-k-mooney1: I don't think it's just one boot. He said "CIs that try to start a larger no. of small instances" | 11:37 |
sean-k-mooney1 | frickler: out of interest do you have swap avaiable in these hosts | 11:37 |
kashyap | sean-k-mooney1: Good question ;-) The "ghost of swap"... | 11:38 |
sean-k-mooney1 | kashyap: its only an issue for cis because those small instance that used to fit nolonger do | 11:38 |
sean-k-mooney1 | kashyap: when i first spoke to frickler about this i think they mentioned it hapens for any small vms created | 11:38 |
*** sean-k-mooney1 is now known as sean-k-mooney | 11:39 | |
sean-k-mooney | i.e. in ci the vm used to take say 128mb for ram is now using 600 so we cant run 4 of them in parallel anymore | 11:39 |
* kashyap nods (on both points) | 11:39 | |
sean-k-mooney | so what im wondiring is in environment without swap are we seeing more resident memory usage | 11:40 |
frickler | sean-k-mooney: yes, I see this with a single instance. the CI example is just where we noticed it first, with jobs OOMing with tempest running parallel tests | 11:40 |
sean-k-mooney | frickler: we had a very weird customer issue where we saw python process have very large resident memory usage when no swap was presnet but when it was allocated they did not have high memory usage and also did not use any swap | 11:41 |
sean-k-mooney | it was like have swap avaiabel stop the memory allcoator preallcoating the memory | 11:42 |
frickler | hmm, indeed I have no swap on my test host. but we do have swap enabled on CI instances | 11:42 |
sean-k-mooney | ah ok i was going to say could you add a 1G swap file temporay to your devstack and see if it change behavior | 11:43 |
sean-k-mooney | if its in the ci then no point its not relatted | 11:43 |
*** tosky_ is now known as tosky | 12:07 | |
bauzas | (late) reminder: final PTG day for nova sessions starting in 12 mins at https://www.openstack.org/ptg/rooms/newton | 12:49 |
sean-k-mooney | dansmith: for the health check you want me to support http over tcp ranther then just a tcp socket right. assuming yes im assuming if it make sense ot make this a real wsgi application or just use https://eventlet.net/doc/modules/wsgi.html#eventlet.wsgi.server to call a binay specifci health check function | 12:57 |
dansmith | sean-k-mooney: yes http and I'd keep it uuber simple (so the latter) | 12:58 |
sean-k-mooney | ok | 12:58 |
sean-k-mooney | brb just going to make a coffee and ill join | 13:00 |
bauzas | nova session started | 13:02 |
dansmith | sean-k-mooney: out of curiosity, does haproxy support some sort of bare tcp socket health check/ | 13:03 |
dansmith | I would kinda expect not | 13:03 |
dansmith | I thought even systemd wanted http, but can use a script too | 13:03 |
sean-k-mooney | dansmith: i think i tcan but orginaly haproxy was not part of my orignal usecases | 13:10 |
dansmith | okay | 13:10 |
sean-k-mooney | i was orginally thining of this as a camand/contol interface with comand objects echanged more like the rpc bus | 13:11 |
sean-k-mooney | with nc as or nova-manage as the cli | 13:11 |
gibi | sean-k-mooney: fyi, sriov agent only unrealiably sends vif plug for VFs, as it polls the hypervisor. If the unplug/plug is fast enough then the agent might miss the state when the device was down | 13:12 |
sean-k-mooney | gibi: more fun | 13:12 |
sean-k-mooney | ok | 13:12 |
gibi | it is fun all the way down :D | 13:12 |
sean-k-mooney | so we might want to only wait for vnic_type=normal then | 13:13 |
gibi | it seems soo | 13:13 |
sean-k-mooney | long term we really do need to fix this interface and enforce a stricter contract | 13:13 |
sean-k-mooney | rather then guessing cause there are so many factors to consider | 13:14 |
gibi | yes, we need neturon to be either enforce that events always sent, or declare in the port what event can be expected. There is no way nova can maintain a sane mapping alone | 13:14 |
bauzas | dmitriis: saw the chat ? | 13:25 |
dmitriis | bauzas: looking, 1 sec | 13:26 |
bauzas | dmitriis: I'm about to propose to postpone your topic after 3pm UTC | 13:26 |
dmitriis | bauzas: got a conflict at 3PM but I can make it work | 13:27 |
bauzas | dmitriis: maybe later then ? | 13:27 |
bauzas | the idea is just to avoid discussing your stuff *before* :) | 13:27 |
dmitriis | bauzas: let's do it at 3PM, later is more complicated :^) | 13:28 |
bauzas | ack, moving your topic then :) | 13:28 |
dmitriis | bauzas: ack, ty for pinging | 13:28 |
stephenfin | Is it just me or is tbarron's sound clipping real bad? I can understand him though (i.e. can be fixed later) | 13:33 |
tbarron | stephenfin: It may be from my end, sorry. Yesterday zoom was having trouble with my rural location. | 13:38 |
stephenfin | tbarron: nw, I was just concerned something was broken on my end :) I could understand everything just fine | 13:39 |
tbarron | cool | 13:39 |
opendevreview | Stephen Finucane proposed openstack/nova master: db: Remove models that were moved to the API database https://review.opendev.org/c/openstack/nova/+/812149 | 13:43 |
opendevreview | Stephen Finucane proposed openstack/nova master: db: Remove models for removed services, features https://review.opendev.org/c/openstack/nova/+/812150 | 13:43 |
opendevreview | Stephen Finucane proposed openstack/nova master: db: Remove nova-network models https://review.opendev.org/c/openstack/nova/+/812151 | 13:43 |
opendevreview | Balazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot https://review.opendev.org/c/openstack/nova/+/813437 | 13:45 |
kashyap | clarkb: I'm innundated with a few things; I will reply to your email on the list on Monday. Hope that's okay | 13:56 |
clarkb | kashyap: yup no worries | 15:15 |
clarkb | have a good weekend | 15:16 |
bauzas | dansmith: dunno if you're fancy happy joining us, but we're discussing a topic your knowledge could be helpful : instance v3.0 object bump | 17:04 |
dansmith | sorry, I'm tied up | 17:04 |
bauzas | or you're stuck in the TC meeting | 17:05 |
bauzas | heh, no worries | 17:05 |
dansmith | I definitely have opinions | 17:05 |
bauzas | dansmith: notes on the etherpad could be appreciated tho :) | 17:05 |
bauzas | https://etherpad.opendev.org/p/nova-yoga-ptg L646 | 17:05 |
sean-k-mooney | https://etherpad.opendev.org/p/nova-yoga-ptg-backup | 17:26 |
sean-k-mooney | no colors but there is a snapshot ^ | 17:26 |
clarkb | I just restored it to the version that bauzas identified as good (the original etherpad url I mean) | 17:28 |
bauzas | yeah \o/ | 17:28 |
sean-k-mooney | yep that looks ok | 17:28 |
bauzas | I just wrote our last conclusions | 17:28 |
bauzas | we can call it a wrap | 17:28 |
sean-k-mooney | clarkb++ thanks | 17:28 |
bauzas | sean-k-mooney: just copy it again, so we have a backup | 17:28 |
* bauzas doesn't wanna implore the infra team for the 3rd time (on the same etherpad) :D | 17:29 | |
sean-k-mooney | i just did an plain text export form the timeline and imported it again in adifferent page | 17:29 |
bauzas | yeah that works | 17:29 |
sean-k-mooney | but ill try that in html form and see if it can keep colours | 17:29 |
bauzas | we have the highlights with the history | 17:29 |
bauzas | so nothing is technically lost | 17:29 |
sean-k-mooney | in etherpad format it goes to the latest point in history not the one you have selected in the timeline | 17:30 |
sean-k-mooney | ya so html format does not keep colors either | 17:31 |
sean-k-mooney | but we have teh backup in any case | 17:31 |
sean-k-mooney | and the orginal is restored so we are good | 17:31 |
bauzas | yup | 17:32 |
bauzas | on that note, /me calls it a week | 17:32 |
bauzas | \o | 17:32 |
gibi | me too | 17:32 |
gibi | o/ | 17:32 |
bauzas | I'm just sad to hear that the next PTG will still be virtual, but that's life | 17:33 |
bauzas | I'm just exhausted and I miss our whiteboards and hallway discussions | 17:33 |
bauzas | but I'll open a beer to enjoy the last day of the PTG as if it was physical :) | 17:34 |
mnaser_ | hm | 19:22 |
mnaser_ | anyone would have an idea as to why metadata service seems to be stalling / very slow | 19:22 |
mnaser_ | 2021-10-22 19:23:13.422 11 INFO nova.metadata.wsgi.server [req-f65469d1-4082-4b9e-ab3b-3c710c1f0366 - - - - -] 10.30.107.187,10.101.2.141 "GET /latest/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 148 time: 25.8355188 | 19:23 |
mnaser_ | it almost feels like all 'requests' stall for a bit and then they all burst out at once | 19:23 |
mnaser_ | memcache is fine, mysql is fine | 19:23 |
mnaser_ | plenty of conductors | 19:24 |
mnaser_ | 2021-10-22 19:25:15.880 13 INFO nova.metadata.wsgi.server [-] 192.168.1.5,10.101.2.142 "GET /openstack HTTP/1.1" status: 200 len: 235 time: 5.0295317 | 19:25 |
mnaser_ | especailly this, it seems very 5s-y | 19:25 |
clarkb | mnaser_: I've noticed that on a couple of devstack jobs recently too but haven't had a chance to dig into it beyond noting that was the error returned by tempest | 19:46 |
clarkb | specifically some tests says metadata service fails to respond in time and the test times out | 19:47 |
prometheanfire | nova fails with the new oslo-concurrency 4.5.0 https://zuul.opendev.org/t/openstack/build/9a385f3324fb46a3abf6257a09020d38 | 20:44 |
prometheanfire | looks like an extra value was used (blocking)? | 20:45 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!