opendevreview | Yongli He proposed openstack/nova master: Accelerator smartnic SRIOV support https://review.opendev.org/c/openstack/nova/+/804320 | 02:26 |
---|---|---|
gibi | my broadband acting up in the last 24 hours so I might not immediatly see pings | 07:24 |
*** rpittau|afk is now known as rpittau | 07:54 | |
aarents | Hi gibi, Thks for comment on https://review.opendev.org/c/openstack/nova/+/764435 I replied | 07:57 |
lyarwood | gibi: \o hey how was PTO? | 09:38 |
lyarwood | https://review.opendev.org/c/openstack/nova/+/804275 (and the various patches either side) could use reviews this week if anyone has time | 09:50 |
lyarwood | ops that should be https://review.opendev.org/c/openstack/nova/+/804230/ but that API change is also ready | 09:51 |
lyarwood | I was just going to add novaclient support for the microversion alongside | 09:51 |
lyarwood | another bugfix https://review.opendev.org/c/openstack/nova/+/802317 is also ready for review FWIW | 09:52 |
* lyarwood sorts out the novaclient change before doing some master reviews of his own | 09:52 | |
gibi___ | aarents: ack, I will get back to that | 10:22 |
gibi___ | lyarwood: o/ PTO was good, thanks. now I have broadband issues :/ but I will get to both of your reviews | 10:23 |
gibi___ | stephenfin: I respun the pps series yesterday and fixed your nits along the way. So if you have time for a quick re-review then that would be appreciated | 10:26 |
stephenfin | gibi___: will do | 10:29 |
gibi___ | stephenfin: thanks | 10:29 |
Gowthami__ | Hi All, Hope you are doing fine. https://review.opendev.org/c/openstack/nova/+/764482/ commit is made for the bug https://launchpad.net/bugs/1581977. The tempest introduced along with commit has been failing on "IBM PowerKVM CI" unable to ping the floating ip. The "guest-instance-1.domain.com" vm created in the https://review.opendev.org/c/openstack/tempest/+/795699 ( ServersTestFqdnHostnames.test_create_server_with_fq | 10:30 |
Gowthami__ | Hi All, Hope you are doing fine. https://review.opendev.org/c/openstack/nova/+/764482/ commit is made for the bug https://launchpad.net/bugs/1581977. The tempest introduced along with commit has been failing on "IBM PowerKVM CI" unable to ping the floating ip. The "guest-instance-1.domain.com" vm created in the ( ServersTestFqdnHostnames.test_create_server_with_fqdn_name) is active but couldn't ping the vm from its na | 10:32 |
Gowthami__ | The tempest is being executed on adevstack vm and please find the error link: https://oplab9.parqtec.unicamp.br/pub/ppc64el/openstack/nova/82/764482/2/check/tempest-dsvm-full-focal-py3/de4be5a/job-output.txt openstack console log show doesn't have any error "Trying to load: from: /pci@800000020000000/scsi@3 ... Successfully loaded\" May I ask if you could suggest way forward for this ? | 10:32 |
lyarwood | Gowthami__: that's from within the guestOS itself | 10:37 |
lyarwood | Gowthami__: so whatever guest image you're using doesn't seem to boot fully in this example | 10:37 |
lyarwood | Gowthami__: I would highly doubt that is due to the test and/or fix you have referenced above | 10:38 |
lyarwood | Gowthami__: and it's likely more of an issue with your test env (lack of resources given to each test instance?) or guest image (a recent update maybe?) | 10:38 |
lyarwood | 2021-08-17 07:31:10.919117 | devstack-focal-newcloud | ++ stackrc:source:691 : DEFAULT_IMAGE_NAME=cirros-0.4.0-ppc64le-disk | 10:40 |
lyarwood | 2021-08-17 07:31:10.921287 | devstack-focal-newcloud | ++ stackrc:source:692 : DEFAULT_IMAGE_FILE_NAME=cirros-0.4.0-ppc64le-disk.img | 10:40 |
lyarwood | 2021-08-17 07:31:10.923262 | devstack-focal-newcloud | ++ stackrc:source:693 : IMAGE_URLS+=http://download.cirros-cloud.net/0.4.0/cirros-0.4.0-ppc64le-disk.img | 10:40 |
lyarwood | I'd recommend trying to use 0.5.2 tbh | 10:41 |
lyarwood | 2021-08-17 07:31:10.883720 | devstack-focal-newcloud | ++ stackrc:source:672 : CIRROS_VERSION=0.4.0 | 10:42 |
lyarwood | missed that this job is hardcoded to 0.4.0, | 10:43 |
gibi___ | stephenfin: I have a request in https://review.opendev.org/c/openstack/nova/+/799684/5/nova/tests/unit/db/api/test_migrations.py | 10:55 |
stephenfin | gibi / gibi___: replied | 11:00 |
stephenfin | (done in a follow-up https://review.opendev.org/c/openstack/nova/+/800484/4/tox.ini) | 11:00 |
songwenping__ | stephenfin: hi, have you already fixed the oslo.db for 8.5.0 version about the mysql conflict message changed? | 11:04 |
stephenfin | songwenping__: that's probably better asked on #openstack-oslo, but the answer is the patches have been merged but they have not been released yet | 11:06 |
songwenping__ | got it, thanks. | 11:07 |
stephenfin | songwenping__: https://review.opendev.org/c/openstack/releases/+/804844 | 11:08 |
opendevreview | Merged openstack/nova stable/wallaby: virt: Add destroy_secrets kwarg to destroy and cleanup https://review.opendev.org/c/openstack/nova/+/796257 | 11:10 |
gibi___ | stephenfin: thanks | 11:13 |
songwenping__ | stephenfin: thanks, wait for the release patch merged. | 11:14 |
gibi___ | stephenfin: I have comments in https://review.opendev.org/c/openstack/nova/+/800078 | 11:29 |
gibi___ | stephenfin, lyarwood : I'm done with the alembic series, I'm mostly +2. | 11:33 |
gibi___ | stephenfin: thanks for working on it | 11:33 |
lyarwood | ACK I think I still had some left to review in that series, I'll try to finish it today | 11:37 |
gibi___ | lyarwood: yepp, that is why I pinged you, as I saw you were doing active review previously on that series | 11:38 |
lyarwood | ah I see, thanks | 11:38 |
* lyarwood is still catching up after being offline sick | 11:38 | |
Gowthami__ | <lyarwood> Thank you . Will try with 0.5.2 and also increase the resources too. | 11:50 |
lyarwood | Gowthami__: yeah FWIW if you do try 0.5.2 you need to raise the resouces anyway https://github.com/cirros-dev/cirros/issues/53 | 11:51 |
gibi___ | aarents: I have still concerns about https://review.opendev.org/c/openstack/nova/+/764435/5/nova/virt/libvirt/driver.py#9970 | 11:57 |
sean-k-mooney | gibi___: my understainding is we are nver ment to attempt to rollback a live migration once we have actully started it in qemu | 12:00 |
sean-k-mooney | we can rollback if we call migrate on libvirt and it imideatly returns with an error | 12:00 |
sean-k-mooney | but once it start we dont rollback unless it times out | 12:00 |
sean-k-mooney | besided timeout to we have other cases where we rollback after the migration has started today? | 12:01 |
sean-k-mooney | im not sure that aarents patch will solve the issue they are trying to solve in this case either | 12:03 |
sean-k-mooney | but for different reasons, the migration may continue as they said and the vm can end up on the destionation | 12:03 |
sean-k-mooney | so reverting the db state may or may not be the correct thing to do | 12:04 |
sean-k-mooney | for example im concerned about what happens with post copy | 12:04 |
sean-k-mooney | the instance would be still migrating but running on the dest and we would have already executed part or all of post_live_migrateion assuimg we recived the post_copy_reume event before the monitor connection died | 12:06 |
sean-k-mooney | whihc shoudl mean the host is already updated. | 12:06 |
aarents | gibi___: Hum, but I think I call live_migration_abort() of libvirt driver not from manager and it only call libvirt.api but now I have doubt | 12:07 |
sean-k-mooney | in the libvirt dirver it just does https://review.opendev.org/plugins/gitiles/openstack/nova/+/refs/changes/35/764435/5/nova/virt/libvirt/driver.py#9434 | 12:07 |
aarents | sean-k-mooney: yes thanks for the link | 12:08 |
sean-k-mooney | so i think that is fine it will jsut call libvirt | 12:12 |
sean-k-mooney | althoguh if the monitor connect is down it may not be able to mange the vm | 12:12 |
* gibi_ hates vodafone | 12:13 | |
aarents | sean-k-mooney: yes in that case it will not work | 12:13 |
sean-k-mooney | which is the case you are trying to fix right. in the even the monitor connection drops you want to about the migration job | 12:14 |
sean-k-mooney | you can tell libvirt to about the migration | 12:14 |
sean-k-mooney | but it may or may not be able ot comply | 12:14 |
sean-k-mooney | i assume the except Exception: is to catch the libvirt error that is raised when that happens | 12:15 |
gibi_ | aarents: oops sorry I jumped to the wrong driver.live_migration_abort call. | 12:16 |
aarents | this will work only with network,RPC,DB issues not for libvirt issue | 12:16 |
sean-k-mooney | aarents: so for those cases im not sure we want to abort the migration | 12:17 |
sean-k-mooney | aarents: unless you want to abort all other operation when that happens | 12:17 |
aarents | sean-k-mooney: or it may work if there is only one flap from libvirt | 12:17 |
sean-k-mooney | spawns, deletes ectra | 12:17 |
sean-k-mooney | its a larger chagne but to me what feels like a more robost change would be to suspend the green thread if the connection is closed and resume it when we reconnect and only try to send the rpc call then | 12:20 |
sean-k-mooney | realistically if the rpc bus is down there is notight on the compute we can do to update the db state | 12:20 |
aarents | sean-k-mooney: honestly, the change is just ensuring to kill job regardless if state in can or cannot update in DB, we loss some instances due to that as explain in bug | 12:24 |
gibi_ | hm, so assuming we have the RPC down. the patch aborts the libvirt job. then raises the exception as today. That exception expected to update the instance and migration states which will not happen while the RPC is down. the nova compute RPC call to update the DB will time out eventually I guess. | 12:24 |
aarents | gibi_: yes | 12:24 |
gibi_ | so the nova DB will still see the migration as runnig | 12:24 |
gibi_ | but the compute already aborted it | 12:25 |
gibi_ | does the conductor time out the migration eventually too? | 12:25 |
sean-k-mooney | aarents: have you tested this with post-copy enabled | 12:25 |
sean-k-mooney | gibi_: i think the timeout happened at the comptue level | 12:25 |
sean-k-mooney | not the conductor | 12:25 |
gibi_ | sean-k-mooney: so there is no cleanup triggered be the conductor for this aborted migration | 12:26 |
gibi_ | s/be/by/ | 12:26 |
sean-k-mooney | im not sure | 12:26 |
aarents | gibi_: yes there will be inconsitency that need operator intervention, but vm will be safe because still referenced in source host & running on source host | 12:26 |
gibi_ | aarents: I see. that was the missing piece | 12:26 |
sean-k-mooney | aarents: again i dont know if that is always correct | 12:27 |
gibi_ | aarents: so this change does not try to fix an DB inconsistency but try to save the VM | 12:27 |
sean-k-mooney | you have ignored my post-copy question | 12:27 |
aarents | sean-k-mooney: good question | 12:28 |
aarents | gibi_: exactly, I was not clear | 12:28 |
gibi_ | sean-k-mooney: so you suggest that the save move would be to abort the non post-copy migrations and let the post-copy migrations run forward | 12:28 |
gibi_ | /save/safe/ | 12:28 |
sean-k-mooney | gibi_: yes | 12:28 |
sean-k-mooney | but only afte we are in the post copy phase | 12:28 |
gibi_ | sean-k-mooney: I guess we document that post-copy means no way back | 12:28 |
sean-k-mooney | well we can abort until we enter the post copy phase | 12:29 |
aarents | sean-k-mooneyI don't have so much experiance about post copy in operation | 12:29 |
sean-k-mooney | but when we hit post copy suspend we call post_live_migration | 12:29 |
sean-k-mooney | and update the host and prot bindings | 12:29 |
sean-k-mooney | aarents: we dont have access to the last known state of the instance at thsi point do we | 12:31 |
sean-k-mooney | form a libvirt perspecitiv | 12:31 |
sean-k-mooney | assuming not then i would make the abort condtional on "not postcopy_enabled" | 12:31 |
sean-k-mooney | to be on the safe side | 12:31 |
sean-k-mooney | gibi_: looking at https://github.com/openstack/nova/blob/master/nova/conductor/tasks/live_migrate.py i dont see anything that looks like cleanup logic once a migration has started | 12:33 |
aarents | sean-k-mooney: no we don't have access to the instance state | 12:33 |
sean-k-mooney | we handel messigng time ectra form check_can_live_migrate_destination and other cases but there seams to be no overall timeout enforced by the conductor | 12:34 |
sean-k-mooney | which kind of makes sense since the live migration timout option are virt diriver specific | 12:34 |
sean-k-mooney | as is whereter we abort or force complete when the timeout expires | 12:35 |
gibi_ | sean-k-mooney: ack, I got it now that aarents' goal is to save the VM running state instead of avoiding the DB inconsistency in case of RPC/DB issue | 12:35 |
gibi_ | sean-k-mooney, aarents: I'm fine with the patch with addition of the "not postcopy_enabled" condition as sean-k-mooney suggests | 12:36 |
sean-k-mooney | gibi_: i think i would be oke with it with that added also | 12:37 |
gibi_ | cool | 12:37 |
aarents | gibi_: sean-k-mooney yep this "not postcopy_enabled" condition make sense | 12:37 |
aarents | I will add that, thanks | 12:38 |
lyarwood | https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainAbortJob FWIW | 12:39 |
lyarwood | In case the job is a migration in a post-copy mode, virDomainAbortJob will report an error (see virDomainMigrateStartPostCopy for more details). | 12:39 |
lyarwood | https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainMigrateStartPostCopy has some more context | 12:40 |
lyarwood | On the other hand once the guest is running on the destination host, the migration can no longer be rolled back because none of the hosts has complete state. If this happens, libvirt will leave the domain paused on both hosts with VIR_DOMAIN_PAUSED_POSTCOPY_FAILED reason. It's up to the upper layer to decide what to do in such case. Because of this, libvirt will refuse to cancel post-copy migration via virDomainAbortJob. | 12:40 |
lyarwood | so tbh I don't think we need to check anything | 12:41 |
lyarwood | oh wait I missed that live_migration_abort is raising the error back, sigh | 12:42 |
sean-k-mooney | ya although we are catching and ignoring that with a log in aarents patch | 12:45 |
sean-k-mooney | i guess we could rely on that behavior but i would prefer to have a comment to that effect honestly | 12:45 |
sean-k-mooney | jsut to not forget that | 12:45 |
sean-k-mooney | as the next time i see the abbort ill get suspicios about post copy again. | 12:46 |
gibi_ | aarents: are you seeing this ^^ :) | 12:49 |
aarents | gibi_: Yes so I will add a comment that say that abort may not work in case of post copy ? | 12:50 |
gibi_ | aarents: I guess you need to catch the error returned from abort and ignore it | 12:50 |
aarents | So I drop the warning | 12:52 |
aarents | ? | 12:52 |
gibi_ | aarents: sorry, so you already catching the error from abort, that is OK | 12:53 |
gibi_ | keep the warning too | 12:53 |
gibi_ | just add a note as sean-k-mooney requested | 12:53 |
aarents | And is there a concensus about sean-k-mooney suggestion to change except Exception with except libvirt.libvirtError: ? | 12:53 |
aarents | gibi_: ok | 12:53 |
gibi_ | yepp go with libvirtError | 12:53 |
aarents | ok cool | 12:55 |
opendevreview | Stephen Finucane proposed openstack/nova master: docs: Add documentation on database migrations https://review.opendev.org/c/openstack/nova/+/800078 | 13:05 |
opendevreview | Stephen Finucane proposed openstack/nova master: db: Final cleanups https://review.opendev.org/c/openstack/nova/+/800484 | 13:05 |
opendevreview | Stephen Finucane proposed openstack/nova master: tests: Enable SADeprecationWarning warnings https://review.opendev.org/c/openstack/nova/+/804708 | 13:05 |
opendevreview | Stephen Finucane proposed openstack/nova master: WIP tests: Enable SQLAlchemy 2.0 deprecation warnings https://review.opendev.org/c/openstack/nova/+/804709 | 13:05 |
stephenfin | gibi_: ^ | 13:05 |
opendevreview | Lee Yarwood proposed openstack/nova master: api: Introduce microversion 2.89 adding attachment_id to responses https://review.opendev.org/c/openstack/nova/+/804275 | 13:07 |
gibi | stephenfin: ack | 13:09 |
gibi | stephenfin: thanks, now I'm +2 all the way | 13:10 |
*** thelounge555 is now known as thelounge55 | 13:35 | |
opendevreview | Stephen Finucane proposed openstack/nova master: tests: Enable SQLAlchemy 2.0 deprecation warnings https://review.opendev.org/c/openstack/nova/+/804709 | 13:53 |
opendevreview | Stephen Finucane proposed openstack/nova master: Replace use of Engine.scalar(), Engine.execute() https://review.opendev.org/c/openstack/nova/+/804878 | 13:53 |
opendevreview | Elod Illes proposed openstack/nova stable/rocky: [stable-only] Fix lower-constraints job https://review.opendev.org/c/openstack/nova/+/769910 | 13:55 |
opendevreview | Alexandre arents proposed openstack/nova master: libvirt: Abort live-migration job when monitoring fails https://review.opendev.org/c/openstack/nova/+/764435 | 14:00 |
opendevreview | Elod Illes proposed openstack/nova stable/rocky: [stable-only] Fix lower-constraints job https://review.opendev.org/c/openstack/nova/+/769910 | 14:28 |
gibi | melwitt: hi! I left feedback in https://review.opendev.org/c/openstack/nova/+/713301 the most concerning for me is the dependency on an oslo.limit patch as non client libraries are going to feature freeze this week | 14:50 |
ganso | melwitt, gibi hi! if you have a few minutes could please take a look at this 1-liner fix https://review.opendev.org/c/openstack/nova/+/804303 ? Thanks in advance | 15:49 |
gibi | ganso: looking | 15:50 |
spatel | sean-k-mooney hey! i am upgrading minor version of victoria and during upgrade at this step i hit this issue - https://paste.opendev.org/show/808150/ | 15:52 |
gibi | FYI, nova meeting starts in 5 minutes here in the channel | 15:54 |
gibi | ganso: does the 1 vcpu + multiqueue case works for other than vif_type=tap? | 15:56 |
ganso | gibi that code path is only reached when vif_type=tap. If using openvswitch, it wasn't impacted by the previous patch (that introduced the regression), neither this. I am not sure if the same problem happens with ovs, but since the original problem didn't, I believe this one also doesn't | 15:57 |
gibi | ganso: ok, let me try with ovs | 15:59 |
sean-k-mooney | vif_type=tap is not currently used with ovs | 15:59 |
sean-k-mooney | it was added tempoery and then removed | 16:00 |
sean-k-mooney | i belive the only thing that uses vif_type=tap today is calico | 16:00 |
gibi | sean-k-mooney: yepp the bug mentions calico | 16:00 |
gibi | sean-k-mooney: https://bugs.launchpad.net/nova/+bug/1939604 | 16:00 |
melwitt | gibi: ah, right. it's not a hard dep, so we can untie it. it's nice-to-have since it will cache limit for a repeated try N, N-1, N-2, etc for a multi create | 16:00 |
sean-k-mooney | gibi: ah ok ill look at it after the meeting | 16:01 |
gibi | melwitt: OK, it is easier then. lets see if the oslo patch lands before the deadline and if not then just remove the depends-on | 16:01 |
gibi | but now lets have a meeting | 16:01 |
gibi | #startmeeting nova | 16:01 |
opendevmeet | Meeting started Tue Aug 17 16:01:38 2021 UTC and is due to finish in 60 minutes. The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot. | 16:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 16:01 |
opendevmeet | The meeting name has been set to 'nova' | 16:01 |
melwitt | gibi: ++ thanks | 16:01 |
gibi | sean-k-mooney: thanks | 16:01 |
gibi | #topic Bugs (stuck/critical) | 16:02 |
gibi | no critical bug open | 16:02 |
gibi | #link 15 new untriaged bugs (+4 since the last meeting): #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New | 16:02 |
gibi | is there any specific bug to discuss today? | 16:03 |
gibi | I see ganso has one in the open discussion, lets bring that up here | 16:04 |
gibi | (ganso): bug "Compute node deletes itself if rebooted without DNS": https://bugs.launchpad.net/nova/+bug/1939920 | 16:04 |
gibi | was this a design choice? acceptable solutions discussion | 16:04 |
gibi | EOM | 16:04 |
ganso | gibi thanks | 16:04 |
ganso | so, IMO this is a critical bug, and after reading the code and the way it works it kinda feels like a design choice | 16:05 |
ganso | because seems like it was intentionally implemented for it to scan for "orphan compute nodes" and delete them, clear the allocations and RP, etc | 16:05 |
gibi | yes that was intentional | 16:05 |
ganso | but it is producing this effect which is very undesirable | 16:05 |
gibi | so in your infra the compute host can change hostname and that causing the issue | 16:06 |
ganso | as I suggested in the bug, a possible solution I see if to compare the host field in the nova.compute_nodes table | 16:06 |
ganso | if it is the same, then we would skip this | 16:06 |
ganso | gibi: it is not that it "can" change the hostname. But it happens due to external reasons | 16:06 |
ganso | like, lack of connectivity when it boots, a DNS outage, etc | 16:07 |
melwitt | it will recover in that it will create a new compute node etc. the main thing that is "unique" is the hostname, that's what's stored in the instance.host and a whole lot of other places. so changing the name you break all the associations and in reality you essentially have a new/different service and compute node | 16:07 |
melwitt | if the associations were done using UUID it would be a different story. but unfortunately it is what it is and would take a large work to change it IMHO | 16:08 |
ganso | melwitt: right, so the instance.host captures the entire FQDN, and the FQDN is what is changing, therefore when that changes, running instances are no longer identifiable as running in that node | 16:08 |
melwitt | right | 16:08 |
ganso | melwitt: so that is another side-effect of that FQDN changing problem, but I am not proposing changing that. I am just proposing to skip this "deletion" step if the compute_nodes.host field does not change | 16:09 |
ganso | if will avoid part of the issues | 16:09 |
ganso | melwitt it will not avoid the issue you described, but 1 issue is better than 2 I think | 16:10 |
dansmith | I'm missing the distinction I think | 16:10 |
gibi | but compute_nodes.host comes from the DB isn't it? so it won't ever change | 16:11 |
sean-k-mooney | ganso: right so nova does not support compute hosts changing hostname today | 16:11 |
ganso | gibi: doesn't it derive from the FQDN it reads from the system? | 16:11 |
sean-k-mooney | so if it is changing for external reasons that is not expected to work out of the box | 16:11 |
dansmith | sean-k-mooney: ++ | 16:12 |
ganso | sean-k-mooney: right, but I'm not proposing that it does support, but just stop doing what it is doing today. That thing about orphan compute nodes isn't supposed to address changing hostnames either | 16:12 |
gibi | ganso: when the ComputeNode is created then yes, it is coming from the hostname reported by libvirt, but never changes after | 16:13 |
sean-k-mooney | ganso: even if we did not clean up the orpah compute nodes teh instnace.host is used to make rpc calls to the host that the instance is on | 16:13 |
sean-k-mooney | so unless you hardcode the chost paramter in the nova.conf so it does not change | 16:14 |
ganso | dansmith: when FQDN changes from "host.domain" to "host.domain1" or just "host" it causes the compute node to delete itself from the DB, clear allocations, RP, etc, and the new name will not match the instances.host field as melwitt mentioned. Out of all those consequences, I'd suggest skipping the compute node deletion, because this is an error state, to avoid deleting up all allocations and RPs, so the node can more easily go | 16:14 |
ganso | back to normal once the FQDN is fixed and the service is restarted | 16:14 |
sean-k-mooney | that will still break | 16:14 |
sean-k-mooney | ganso: the compute service will not do that by default | 16:14 |
sean-k-mooney | the compute service will auto register | 16:14 |
ganso | sean-k-mooney: yes, that will still be broken, as it is today, no need to fix that right now | 16:14 |
sean-k-mooney | bvut it wont auto delete | 16:14 |
ganso | sean-k-mooney: well it does, it thinks there was an orphan and deletes it | 16:15 |
sean-k-mooney | ganso: what deletes it | 16:15 |
sean-k-mooney | i think i missed that | 16:15 |
ganso | sean-k-mooney: https://github.com/openstack/nova/blob/b0099aa8a28a79f46cfc79708dcd95f07c1e685f/nova/compute/manager.py#L9997 | 16:15 |
sean-k-mooney | is this a clustered hypervior | 16:16 |
ganso | sean-k-mooney: "host.domain" changes to "host", so it deletes "host.domain" from the compute nodes table and creates a new one, as if the node was brand new | 16:16 |
sean-k-mooney | e.g. ironic or hyperv or something like vmware | 16:16 |
dansmith | ganso: because that's a hostname change | 16:16 |
ganso | sean-k-mooney: no, it is just a regular compute node with a libvirt compute service | 16:16 |
dansmith | ganso: arrange for that not not happen, that's the solution, IMHO | 16:17 |
ganso | dansmith: unfortunately it is beyond control | 16:17 |
sean-k-mooney | well for the libvirt driver that is entirly unsupported | 16:17 |
sean-k-mooney | the other way to fix this is to make sure your cannonical hostname is not the fqdn | 16:17 |
ganso | my proposal is to leave it in an error state to prevent it from deleting allocations and RP | 16:17 |
sean-k-mooney | e.g. in /etc/host set <ip> <short hostname> <fqdn> | 16:18 |
dansmith | sean-k-mooney: or /etc/domainname, but yeah, totally fixable, IMHO | 16:19 |
sean-k-mooney | im still configuse how nodenames is a list in this case | 16:19 |
ganso | sean-k-mooney: hmm I see, that would override the one currently being provided by the domain provider | 16:19 |
sean-k-mooney | or rather how when the the fqdn changes we are actully geting anything back from the db | 16:20 |
sean-k-mooney | i was expecting it to not match anything | 16:20 |
dansmith | can't you set the hostname nova uses in the config anyway? to hard-code it per host so it doesn't change, I thought we had that | 16:20 |
sean-k-mooney | unless you have something linke host1.<domain1> host1.<doamin2> | 16:20 |
ganso | dansmith: looking in the code now | 16:20 |
sean-k-mooney | dansmith: you can set the hostname used by the compute service | 16:21 |
dansmith | we might not want to hold up the meeting to discuss this to completion | 16:21 |
sean-k-mooney | not the hypervior_hostname | 16:21 |
sean-k-mooney | which in this case comes form libviret | 16:21 |
dansmith | sean-k-mooney: ah, so that's fixed but the nodename is always from hostname, right okay | 16:21 |
ganso | oh yea, console_host | 16:21 |
ganso | default=socket.gethostname() | 16:21 |
sean-k-mooney | ganso: not console host but ill get the link and send it to you | 16:21 |
sean-k-mooney | i think we can move on and come back to this after the meeting | 16:21 |
ganso | sean-k-mooney: thanks, yes. Thanks for the suggestions! | 16:22 |
gibi | lets come back to this | 16:22 |
gibi | thanks sean-k-mooney dansmith melwitt | 16:22 |
sean-k-mooney | ganso: https://github.com/openstack/nova/blob/master/nova/conf/netconf.py#L52-L70 | 16:22 |
gibi | any other bug that needs attention? | 16:22 |
gibi | #topic Gate status | 16:23 |
gibi | Nova gate bugs #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure | 16:23 |
gibi | I dont see new gate bugs in that list | 16:23 |
gibi | and also I pushed plenty of patches yesterday without many failures from Zuul | 16:24 |
gibi | so I think master CI looks good | 16:24 |
sean-k-mooney | :) | 16:24 |
gibi | any recent failures? | 16:24 |
melwitt | yeah I think the troubling one is the libvirt/qemu one.. that we're doing non voting on live migration job over | 16:24 |
gibi | yeah, skiping that helped a lot | 16:25 |
gibi | placement period jobs are green too #link https://zuul.openstack.org/builds?project=openstack%2Fplacement&pipeline=periodic-weekly | 16:25 |
gibi | anything else about the gate? | 16:25 |
gibi | #topic Release Planning | 16:26 |
gibi | Milestone 3 and therefore Feature Freeze is at 3rd of September which is in 2 weeks. | 16:26 |
gibi | lets land things :) | 16:26 |
gibi | Non client library freeze is this week. | 16:26 |
gibi | os-vif: https://review.opendev.org/q/project:openstack/os-vif+status:open+branch:master nothing important seems to be pending | 16:26 |
gibi | os-resource-classes: https://review.opendev.org/q/project:openstack/os-resource-classes+status:open ditto nothing is pending | 16:27 |
sean-k-mooney | yes i might try to addreess https://bugs.launchpad.net/os-vif/+bug/1939542 | 16:27 |
gibi | os-traits: https://review.opendev.org/q/project:openstack/os-traits+status:open there seem pending reviews for traits needed by ongoing features e.g.: COMPUTE_GRAPHICS_MODEL_BOCHS and HW_FIRMWARE_UEFI | 16:27 |
sean-k-mooney | but im fine with backporting it too | 16:27 |
gibi | sean-k-mooney: sure, bugs are easy, as the fix is backportable | 16:27 |
gibi | but os-traits has some new trait proposal that if they not land then the feature depending on them is blocked in Xena | 16:28 |
gibi | so let's close those this week | 16:28 |
sean-k-mooney | im not sure about HW_FIRMWARE_UEFI | 16:28 |
sean-k-mooney | but ill review it | 16:29 |
sean-k-mooney | thecnially that is stating that the host has uefi boot capablity | 16:29 |
gibi | the BOCHS trait probably needs kashyap answer as stephenfin had some feedback on https://review.opendev.org/c/openstack/os-traits/+/794807 | 16:29 |
sean-k-mooney | as in the host can boot in uefi mode not that it can virtualise it | 16:29 |
sean-k-mooney | so i think HW_FIRMWARE_UEFI shoudl be COMPUTE_FIRMWARE_UEFI | 16:30 |
gibi | ohh, that is a good point | 16:30 |
gibi | stephenfin: ^^ :) | 16:30 |
sean-k-mooney | the bosh trait looks correct but ill read stpehns commnets | 16:30 |
gibi | sean-k-mooney: thanks | 16:31 |
gibi | anything else about the coming lib feature freeze? | 16:31 |
*** rpittau is now known as rpittau|afk | 16:31 | |
gibi | #topic PTG Planning | 16:32 |
gibi | every info is in the PTG etherpad #link https://etherpad.opendev.org/p/nova-yoga-ptg | 16:32 |
gibi | If you see a need for a specific cross project section then please let me know | 16:32 |
gibi | s/section/session/ | 16:33 |
gibi | any question about the PTG? | 16:34 |
gibi | #topic Stable Branches | 16:35 |
gibi | stable/queens is blocked (tempest-full-py3 @ "Starting Horizon", probably due to queens-eol of horizon) | 16:35 |
gibi | all the other branches' gate look OK | 16:35 |
gibi | EOM from elodilles | 16:35 |
elodilles | i've proposed a quick fix for queens gate: https://review.opendev.org/c/openstack/devstack/+/804889 | 16:35 |
gibi | elodilles: thanks | 16:36 |
gibi | any other news from stable-land? | 16:36 |
elodilles | nothing from me | 16:36 |
gibi | OK moving on | 16:37 |
gibi | I'm skipping libvirt subteam as bauzas_away is on PTO | 16:37 |
gibi | #topic Open discussion | 16:37 |
gibi | (melwitt): unified limits series is ready for review (https://blueprints.launchpad.net/nova/+spec/unified-limits-nova) https://review.opendev.org/q/topic:bp/unified-limits-nova | 16:37 |
gibi | I started on that ^^ and will continue tomorrow | 16:37 |
opendevreview | Merged openstack/nova stable/wallaby: libvirt: Do not destroy volume secrets during _hard_reboot https://review.opendev.org/c/openstack/nova/+/796258 | 16:38 |
gibi | but one more core is needed | 16:38 |
gibi | who feels the power? | 16:38 |
melwitt | yeah just wanted to give a quick heads up that this is up-to-date, as some know it was stalled for awhile. it's a "tech preview" status where the legacy quota APIs are read-only and there are no quota migration tools, it is DIY for operators to try out | 16:39 |
sean-k-mooney | dansmith: lyarwood do ye have time to review the unified limits series | 16:39 |
melwitt | I have added some tempest test coverage that Depends-On it that can be looked at to see it working | 16:39 |
dansmith | do I? no. should I? yes. Will I? I'll try :) | 16:39 |
sean-k-mooney | :) | 16:39 |
melwitt | hehe ++ | 16:39 |
gibi | :) | 16:40 |
melwitt | thanks all for listening, we can move on I think | 16:40 |
gibi | ok | 16:40 |
gibi | there is one more topic on the wiki | 16:41 |
gibi | (gibi): PTL nomination is open. As I noted in my Xena nomination, I will not run for the 4th time as Nova PTL. | 16:41 |
gibi | if you have questions about the role as you consider running for it then feel free to ask me | 16:41 |
gibi | nothing else on the agneda | 16:43 |
gibi | is there anything else to discuss today? | 16:43 |
gibi | if not then thanks for joining | 16:44 |
ganso | gibi: I will do more testing with the hostname config and /etc/hosts later today and I will mark that bug as invalid if successful (probably will be) =) | 16:45 |
gibi | ganso: cool, thanks | 16:45 |
gibi | #endmeeting | 16:45 |
opendevmeet | Meeting ended Tue Aug 17 16:45:20 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:45 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/nova/2021/nova.2021-08-17-16.01.html | 16:45 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/nova/2021/nova.2021-08-17-16.01.txt | 16:45 |
opendevmeet | Log: https://meetings.opendev.org/meetings/nova/2021/nova.2021-08-17-16.01.log.html | 16:45 |
sean-k-mooney | gibi: ganso so i looked at https://review.opendev.org/c/openstack/nova/+/804303 in parallel | 16:46 |
sean-k-mooney | it looks ok to me modulo some nits | 16:46 |
ganso | sean-k-mooney: thank you very much! I will address them this afternoon! | 16:47 |
melwitt | dansmith: tangentially related, I updated the oslo.limit caching patch a couple of weeks ago to address your comments if you wanted to take another look https://review.opendev.org/c/openstack/oslo.limit/+/802814 | 16:52 |
gibi | ganso, sean-k-mooney: I also checked and it seems in the non tap case vcpu=1 and multiqueue works today | 16:52 |
dansmith | melwitt: ack | 16:53 |
sean-k-mooney | gibi: ya it weird i expect it to work and just configure 1 queue | 17:04 |
ganso | gibi: thanks! so the patch looks good? | 17:17 |
gibi | ganso: yapp | 17:18 |
melwitt | johnthetubaguy[m]: not sure if you would be able to take a quick look, but are you opposed to the idea of putting global limits in keystone as well, instead of setting them in config? https://review.opendev.org/c/openstack/nova/+/712142/14#message-76a84195c59afe78a2a26cbfd8d710bb2ad10165 | 17:53 |
opendevreview | Rodrigo Barbieri proposed openstack/nova master: Fix 1vcpu error with multiqueue and vif_type=tap https://review.opendev.org/c/openstack/nova/+/804303 | 18:03 |
ganso | sean-k-mooney: I had tested nova.instances.vcpus doing resizes and seeing that the value in that variable has the same content as the new flavor. I also think nova.instances.vcpus is more performant where it does not need to join tables to get that value | 18:17 |
sean-k-mooney | melwitt: e.g. having global_vcpu_limit or sometihng in keysotne and then useing that | 18:17 |
sean-k-mooney | ganso: its a copy of the flaovr value | 18:18 |
sean-k-mooney | we likely should remove it in the future | 18:18 |
sean-k-mooney | and make it a property that just gets it form the flaovr | 18:18 |
sean-k-mooney | ganso: but the the xml generation exctra will never use instance.vcpus | 18:19 |
ganso | sean-k-mooney: what if the flavor is edited? where will the original value be saved? | 18:19 |
sean-k-mooney | ganso: the in the instnace_extra table | 18:19 |
ganso | sean-k-mooney: oh I see, so it is not directly from the flavors table | 18:19 |
sean-k-mooney | we make a copy of the flavor per instance | 18:19 |
sean-k-mooney | ganso: no its not form the api db | 18:19 |
sean-k-mooney | instance.flavor.vcpu is comming form the copy of the flaovr created when the instance was created | 18:20 |
sean-k-mooney | instnace.vcpu is identical | 18:20 |
sean-k-mooney | ganso: if other are ok with it it should work | 18:20 |
sean-k-mooney | ganso: i just tought we had deprecated instance.vcpus already | 18:21 |
sean-k-mooney | along with instance.memory_mb and the other thngs that are in the flavor | 18:21 |
ganso | sean-k-mooney: I probably would need to retest a resize to see if instances.get_flavor().vcpus gets the old or the new flavor | 18:23 |
sean-k-mooney | well for resize we have seperate flavors | 18:23 |
ganso | sean-k-mooney: I was happy that instances.vcpu was consistent for resizes | 18:23 |
sean-k-mooney | ganso: i dont know if we have testing that enforces that which is why i was nervous with using it | 18:24 |
opendevreview | Rodrigo Barbieri proposed openstack/nova master: Fix 1vcpu error with multiqueue and vif_type=tap https://review.opendev.org/c/openstack/nova/+/804303 | 21:01 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!