Wednesday, 2023-04-12

opendevreview	Sam Morrison proposed openstack/nova master: Filter out deleted instances when looking for build timouts https://review.opendev.org/c/openstack/nova/+/880125	06:48
bauzas	dansmith: gmann: sean-k-mooney: I added my thoughts on the deprecation patches but I'm afraid of the fact a backport wouldn't help operators to notice	08:17
opendevreview	Jorge San Emeterio proposed openstack/nova master: WIP: Testing whether tests on bug#1998148 still fail. https://review.opendev.org/c/openstack/nova/+/880135	08:48
Uggla	bauzas , hi, if you have some time, can you have a look at the virtiofs series and especially the interfaces ?	09:05
bauzas	sure I've seen your updates	09:06
bauzas	but I need a jar of coffee first	09:06
sean-k-mooney	Uggla: have you moved the libvirt driver changes for creating the xml config to the start fo the series	09:34
sean-k-mooney	that should be the first set of patches in the serise so that can be merged to allow scaphanda work to start in parallel	09:34
sean-k-mooney	while we wait for manialla to add the lock api	09:35
manuvakery1	Hi, I have a 2 node cluster and I am running a rally scenario against it to create and delete servers. When i set the server count to 2 I see the response time as follows	09:36
manuvakery1	nova.boot_servers : 11.941 sec (Avg)	09:36
manuvakery1	nova.delete_servers: 14.675 sec (Avg)	09:36
manuvakery1	make the server count to 5 added a huge difference in delete server	09:36
manuvakery1	nova.boot_servers: 18.803 sec (Avg)	09:36
manuvakery1	nova.delete_servers: 42.988 sec (Avg)	09:36
manuvakery1	is this expected?	09:36
manuvakery1	the compute nodes are not under any load	09:37
Uggla	sean-k-mooney, hum, not really.	09:37
sean-k-mooney	ok im not sure when we disucssed that	09:40
kashyap	bauzas: Do we have a conclusion on this? -- https://review.opendev.org/c/openstack/nova/+/879021	09:41
sean-k-mooney	but if we want to merge any code before the manial api is implmented the libvirt dirver chage for the xml generation is the lowest risk	09:41
bauzas	sean-k-mooney: I haven't went more than the 4th change, but the Manila API should be the last change in the series	09:42
noonedeadpunk	hey folks! We're reviewing our nova role, as we haven't touched in for a while, and now a bit /o\ about logic we have there. so would be glad if you can share your prespective on some things :)	09:42
sean-k-mooney	bauzas: yes the api should be last and the xml change sin the drive rhsould be first	09:42
sean-k-mooney	driver then db then rpc finaly api	09:42
bauzas	sean-k-mooney: for the libvirt driver, meh to me, since I think we should be quite okay for the 4 first patches (DB and objects)	09:42
bauzas	noonedeadpunk: shoot	09:43
noonedeadpunk	So first question - when `discover_hosts_in_cells_interval` is set to -1, I assume you need to run `nova-manage cell_v2 discover_hosts` before compute will appear in `openstack compute service list`?	09:43
noonedeadpunk	As I can recall that compute itself does report back on startup as well	09:43
sean-k-mooney	noonedeadpunk: no they will show up in the list	09:43
sean-k-mooney	but they wont be in the cell mappings	09:43
noonedeadpunk	and this option is for scheduler	09:43
bauzas	noonedeadpunk: no, but they won't be scheduled	09:43
bauzas	since they won't have a cell mapping	09:44
noonedeadpunk	aha, gotcha	09:44
bauzas	haha	09:44
sean-k-mooney	discover_hosts tell use which rabbit to use to talk to the compute and what cell db its in	09:44
noonedeadpunk	ok, then we have this set correctly :D	09:44
noonedeadpunk	but it discovers... through API?	09:45
sean-k-mooney	no	09:45
noonedeadpunk	not through conductor then?	09:45
noonedeadpunk	ah, right	09:45
sean-k-mooney	it discovers by connecting to all the cell dbs	09:45
noonedeadpunk	ok-ok, yes, I recalled :)	09:45
sean-k-mooney	https://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L1043	09:46
sean-k-mooney	calls https://github.com/openstack/nova/blob/ae42400b7663bc58d5562de99e976c95131b77a9/nova/objects/host_mapping.py#L247-L285	09:46
noonedeadpunk	and in order for compute to be discovered by discover_hosts it must already be in service list?	09:47
opendevreview	Merged openstack/placement master: Db: Drop redundant indexes for columns already having unique constraint https://review.opendev.org/c/openstack/placement/+/856770	09:47
sean-k-mooney	correct	09:47
sean-k-mooney	so the compute agent connects to the conducotr on startup	09:47
sean-k-mooney	and then it ask the conductor to register a service in the db	09:47
sean-k-mooney	"the cell db"	09:48
sean-k-mooney	the conducotor may or may not have access to the api db	09:48
sean-k-mooney	so we cant assume it can add the cell mapping	09:48
noonedeadpunk	yeah, ok, thanks for the explanation	09:50
noonedeadpunk	As I was mixing up discovery flow	09:50
bauzas	noonedeadpunk: maybe this would help you :p https://docs.openstack.org/nova/latest/admin/cells.html#faqs	09:50
bauzas	but yeah, discover_hosts is idempotent so you can call it anytime you want	09:51
bauzas	this is explained at the end of https://docs.openstack.org/nova/latest/admin/cells.html#configuring-a-new-deployment	09:52
noonedeadpunk	well, I was unsure about logic, as I could not fully understand the reason why we're waiting for compute to be present in service list when nova_discover_hosts_in_cells_interval=-1 but run cell_v2 discover_hosts only after that	09:52
bauzas	noonedeadpunk: we have two different databases now as a reminder	09:53
bauzas	so we need to make sure we synchronize them	09:53
noonedeadpunk	As for some reason I thought that discover_hosts also will add to service list, but now I know it;s not how it works)	09:53
sean-k-mooney	bauzas: technially we have 3. we alway have api + cell 0 and cell one in any real deployment :)	09:54
bauzas	sean-k-mooney: we have two different table schemas, but yeah :)	09:57
bauzas	cell0 is logically identically to any cell	09:57
bauzas	identical*	09:57
bauzas	only the usage of this cell is different	09:57
bauzas	but we're bikeshedding :)	09:58
noonedeadpunk	Another thing I am unsure about - db online_data_migrations and nova-status upgrade check and when they can/should be run.	09:59
noonedeadpunk	As far as I got - db online_data_migrations should be done only during upgrades	10:00
noonedeadpunk	But is it recommended to run `nova-status upgrade check` before that?	10:00
noonedeadpunk	As it feels like upgrade check should be run once all computes are upgraded, while online_data_migrations requires only conductor to be done?	10:01
bauzas	noonedeadpunk: this is correct	10:05
bauzas	you should run nova-status upgrade check before the db upgrade as a pre-flight check	10:05
bauzas	nova-status is a controller script, checking a few APIs and DBs	10:06
bauzas	to see whether you're safe to upgrade	10:06
bauzas	like, say we introduced a thing but wasn't the default	10:07
bauzas	before we flip the default and remove the old bits, we propose you to check that you opted in before	10:07
noonedeadpunk	aha, yes, exactly what I was unsure about. As currently we run upgrade check basically after cell_v2 discover_hosts at the very end	10:07
noonedeadpunk	which didn't make much sense to me	10:08
noonedeadpunk	is there any way to tell if online_data_migrations are needed at all? Like by some exit code of smth?	10:08
bauzas	noonedeadpunk: see the upgrade docs, this explains it way better than me now https://docs.openstack.org/nova/latest/admin/upgrades.html#rolling-upgrade-process	10:09
bauzas	noonedeadpunk: wait, I said something wrong	10:10
bauzas	noonedeadpunk: https://docs.openstack.org/nova/latest/cli/nova-status.html#upgrade	10:10
bauzas	you need to do DB schema expands/contracts before with the migrations	10:11
bauzas	then, you run the status check	10:11
bauzas	and only then you need to start the compute services	10:11
* bauzas was used to those limbo dances before but we haven't really touched nova-status since a while		10:12
bauzas	neither the online data migrations	10:12
bauzas	noonedeadpunk: as a summary, say that during a Bobcat release we may add some data online migrations	10:15
noonedeadpunk	ok, so status check before service restart but after online migrations?	10:15
bauzas	we would provide the nova-manage command to perform such data migrations	10:15
bauzas	accordingly, we would propose a nova-status check addition that would check whether the data is fully migrated	10:16
bauzas	so when upgrading to C, you would run this nova-status check that would fail if you forgot to run the online data migrations command	10:16
bauzas	same goes with Placement	10:16
bauzas	say that by Bobcat, we start using a new Placement API call	10:17
bauzas	we would need to continue supporting the old Placement calls	10:17
bauzas	but if we start using that sole version of Placement, then we would need to add a nova-status check that would fail if you forgot to update Placement	10:18
bauzas	I just hope I haven't lost you :)	10:19
wangrong	dansmith: sean-k-mooney: Thank you for your feedback in generic-vdpa spec. Regarding some of the issues you raised, I have explained them in my reply. Could you please take a look at the submission again. Thank you for your assistance. https://review.opendev.org/c/openstack/nova-specs/+/879338	10:25
sean-k-mooney	i saw i skimmed it breifly this morning	10:25
gibi	sean-k-mooney: https://bugs.launchpad.net/nova/+bug/2008716 this seems to be a real bug in our migration code but you had comments before reagarding different dbs being targeted. I don't see how that is possible given the error message states the that column is missing not the table	10:25
noonedeadpunk	bauzas: sorry, was side-pinged :(	10:26
sean-k-mooney	wangrong: in general i think we will still need to change the design singnifcalty form what you orginaly propsosed	10:26
noonedeadpunk	that makes total sense.	10:27
noonedeadpunk	So yes, once I've upgraded packages on Bobcat, I should do migrations, and then check upgrade to ensure it's safe to restart now	10:27
noonedeadpunk	ok, now will try to make some patches out of this discussion	10:35
wangrong	sean-k-mooney: ok, could you offer more detail info about the problem of current design in the review comment	10:37
sean-k-mooney	yes i need to also fully re read your comments	10:41
sean-k-mooney	wangrong: basically i think we need two feature before we can implement the feature you are proposing	10:42
sean-k-mooney	allow seting the hw_vif_model per port and allow setting hw_disk_bus per cinder volume	10:43
sean-k-mooney	if we have both feature we can add hw_vif_modle=vdpa and hw_disk_bus=vdpa	10:43
sean-k-mooney	to enabel generic vdpa support	10:43
wangrong	sean-k-mooney: yes, I think so, if we plan to use hw_xxx.	10:46
opendevreview	Oleksandr Klymenko proposed openstack/placement master: CADF audit support for Placement API https://review.opendev.org/c/openstack/placement/+/880145	11:07
noonedeadpunk	No, wait. I'm confused again. What's written in docs, is that I should run online_data_migrations _after_ service restart, and upgrade check _before_ restart. So upgrade check should be run before migrations	11:38
bauzas	noonedeadpunk: because the upgrade check would verify something else but the online data migrations for the same cycle	11:46
noonedeadpunk	yeah, ok, we just got things mixed up a bit	11:51
noonedeadpunk	we were running online_data_migrations early, before first conductor is restarted, but upgrade_check when all services are	11:51
noonedeadpunk	So i got really confused	11:51
sean-k-mooney	so you should be able to do db sync before an restarts	11:52
sean-k-mooney	im not sure about online migration	11:52
noonedeadpunk	yeah, db_syncs are done before, but were followed with online migrations...	11:52
sean-k-mooney	part of me expect thost to happen after the db models are updated i.e. after all the contoelr services are restarted	11:53
noonedeadpunk	What I'm not sure about, if upgrade_check is going to pass if only conductor/api is available, but no computes. I assume it should not care as long as cell mappings are in place.	11:53
sean-k-mooney	so im not sure that will break things but it leave a window where more data could be created because fo the old db model that is in use	11:53
noonedeadpunk	Well. That is true.	11:54
sean-k-mooney	so i think online migration shoudl be after the restarts	11:54
noonedeadpunk	But docs you've shown say it should be done after everything	11:54
sean-k-mooney	ya	11:54
sean-k-mooney	i would generally put them right at the end	11:54
noonedeadpunk	As partially data will be migrated when requested even without migrations	11:54
sean-k-mooney	when all services are fully upgraded	11:54
noonedeadpunk	yup, great, thanks for help as usual!	11:55
sean-k-mooney	so the thing is as long as you are not trying to skip several release	11:55
sean-k-mooney	we will have the fall back code	11:55
sean-k-mooney	so the online migrations coudl be done even in a seperate maintaince window	11:55
noonedeadpunk	well, it seemed working nicely old way as well for years, so I was quite tentative to touch anything there :D	11:55
sean-k-mooney	but tis better to do them as part of the normal upgrade after all the services are updated	11:56
* noonedeadpunk always skipping releases		11:56
sean-k-mooney	where you would have a probelm is if there is a db contraction	11:56
* noonedeadpunk but has to deal with consequences		11:56
sean-k-mooney	alotugh i dont think that woule be part of the online migrations	11:57
sean-k-mooney	i.e. if we removed a column	11:57
noonedeadpunk	Yeah, that what I was thinking about	11:57
noonedeadpunk	that if db schema changes - it should not be done post, but I assume it's done during db_sync	11:57
noonedeadpunk	which is done really at early stage	11:58
sean-k-mooney	ya that a good point	11:58
sean-k-mooney	schema chagnes are seperate	11:58
sean-k-mooney	the online migation are just data migrations	11:58
noonedeadpunk	ok, then will see how CI will fail dramatically with my changes :D	11:59
bauzas	noonedeadpunk: as it was written in the docs I passed, db sync are only changing the schemas	12:04
bauzas	noonedeadpunk: then either we change the data internally (either by a lazy call, or when restarting the service), or we ask operators to run online_data_migration	12:05
bauzas	but online_data_migration can run during all the cycle, until you want to upgrade to a new release	12:06
bauzas	then, once you want to upgrade, you can before run nova-status upgrade check for verifying that you're done with the online data migrations	12:07
noonedeadpunk	with N-1 online data migrations, right?	12:08
noonedeadpunk	so running upgrade check on N will verify that N-1 migrations are done	12:10
noonedeadpunk	or well, N-2 with slurp, but that's different topic :D	12:10
bauzas	yup	12:13
noonedeadpunk	ok, awesome, thanks for your time and patience :)	12:14
* bauzas needs to go off for 30 mins		12:14
dansmith	noonedeadpunk: yeah, can't run online migrations until you're past the point where conductors, api, scheduler are upgraded (or stopped before upgrade	13:35
dansmith	noonedeadpunk: the idea is that online migrations should be run after everything is done and back up	13:36
dansmith	services will migrate what they need on-demand, and the online migrations are just there to push things that don't get migrated on demand	13:36
noonedeadpunk	yup, thanks for confirming that!	13:38
noonedeadpunk	will go bug cinder folks with the same question then :-)	13:41
dansmith	ack, I didn't mean to repeat, just wasn't sure there was a clear summary of that convo, but sounds like you got it :)	13:45
noonedeadpunk	yeah, I've pushed change as a result - I should have posted it https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/880147	13:46
dansmith	noonedeadpunk: I think running the status check after online migrations also makes sense as there are cases where it will tell you that there are still pending migrations to do if they're not complete	13:57
dansmith	but regardless, the meat of that change sounds right yeah	13:57
bauzas	dansmith: thanks for having explained it better than me :)	14:52
bauzas	your summary is far easier :)	14:52
dansmith	bauzas: when you're done with the current call you're on I want to chat about some resource tracker stuff	15:12
noonedeadpunk	between not that long ago (on PTG?) we've discussed enable_new_services option. And it should be applied not for computes at the end of the day, but somewhere on conductor/scheduler/api - not sure where exactly as I have them combined at same place (with same nova.conf)	15:54
noonedeadpunk	so while it's defined in nova.conf.compute - it's not compute option	15:55
noonedeadpunk	(https://opendev.org/openstack/nova/src/branch/master/nova/conf/compute.py#L1456-L1474)	15:55
dansmith	conductor	15:57
dansmith	but we could also fix that	15:57
bauzas	dansmith: I'm here	16:00
dansmith	bauzas: so... the RT has this notion of "disabled compute nodes"	16:00
dansmith	which seems to cover things actually disabled (via ironic) as well as "no such compute node on this host"	16:00
* bauzas reloads the RT context in his mind		16:00
dansmith	we do some weird things in move claims if we think the compute node is missing,	16:01
dansmith	like accept the migration but do no claim	16:01
bauzas	iirc, this was done for the latter, not the former case	16:01
dansmith	the latter meaning a missing compute node?	16:01
dansmith	I can't think of any reason why that would make sense, as we'll just mess up our accounting	16:01
dansmith	https://review.opendev.org/c/openstack/nova/+/879682/2/nova/compute/resource_tracker.py#296	16:02
dansmith	so I'm looking at that ^ specifically	16:02
dansmith	where we just do a nopclaim and return from move_claim	16:02
dansmith	which means, AFAICT, we would have created the migration object on the target machine, claimed no resources, and agreed to accept the new incoming instance	16:03
dansmith	even though it's for a node that we don't have or know anything about?	16:03
bauzas	I'm just trying to understand the intent	16:04
dansmith	also note this from mr. pipes: https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L154	16:04
dansmith	in the instance claim, which we'll also do weird stuff for new instances with no such node	16:04
dansmith	the earlier part of the comment says it shouldn't happen	16:05
dansmith	so, like, I'm confused and feel like this is probably some very old cruft	16:05
dansmith	like, surely we can't have these things happen in a placement world and not get everything messed up, right?	16:05
bauzas	yeah, found the commit https://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f537	16:06
bauzas	ok, so now I think I remember where it comes	16:07
bauzas	the virt drivers are responsible for creating the RTs	16:07
bauzas	at compute startup, we were originally setting one RT per compute node resources dict that was returned by the virt driver	16:08
bauzas	and then mr jay factorized that	16:08
bauzas	by having one single RT instance	16:08
dansmith	...right	16:08
bauzas	so the disabled property was originally coming from the fact there could be a race where the RT was instanciated but the virt resources not fully set up	16:09
dansmith	that may be the case for startup, but we should never accept an instance boot or migration if we can't account for the resources right?	16:10
bauzas	https://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f537#diff-ed9525d7ae319fd575249ed72daf634d27182e08fea7a4f740cb3164233612b7L435	16:10
bauzas	well, that whole thing predates placement	16:12
bauzas	https://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f537#diff-ed9525d7ae319fd575249ed72daf634d27182e08fea7a4f740cb3164233612b7L412	16:12
dansmith	right	16:12
dansmith	(those links do not link to a hunk for me, btw)	16:12
bauzas	ah shit	16:12
bauzas	so, the fact is, before placement,	16:12
bauzas	we were relying on the ComputeNode records for scheduling	16:12
bauzas	if there were no ComputeNode records, then the HostStates from the scheduler weren't generated	16:13
bauzas	and no instances were able to schedule into them	16:13
dansmith	your point being that we could (should) never get into that case where we do the NopClaims because of that?	16:13
dansmith	I think that's probably also not true because of forced migrations	16:14
bauzas	the NopClaims were there for saying 'meh'	16:14
bauzas	if no compute record was existing then basically we were tracking nothing	16:14
dansmith	right but they also cause us to skip all the other stuff like PCI and migration objects, etc	16:14
dansmith	but we allow the instance to be sent there	16:15
bauzas	if the scheduler isn't having a compute node record, then it shouldn't be able to send an instance to it	16:15
bauzas	and I think this assumption is still valid	16:15
dansmith	except for times where we didn't go through the scheduler, which I think back when this was done, was the case	16:16
dansmith	but if you're right, why would we not raise an exception here instead of silently "meh" ?	16:16
bauzas	good question	16:16
dansmith	and also,	16:16
dansmith	transient changes could result in the scheduler thinking a compute node was valid for a given host, send a node there,	16:17
bauzas	the NopClaims even predates Mr Jay's work on RT	16:17
bauzas	and even my presence on OpenStack since this decade :)	16:17
dansmith	and the RT would literally create a migration record with a destination node name that does not match anything it knows about	16:17
dansmith	the nopclaim stuff itself predates tons of stuff, yeah, but the usage of it in this case is much newer	16:18
dansmith	here's my problem:	16:18
bauzas	oh	16:19
dansmith	in order to do the strict tying of instance->compute->service, we need migrations to have a strict tie from migration->compute	16:19
dansmith	which means we need the compute id in the migration	16:19
bauzas	I missed the fact we record a migration before we return the NopClaim	16:19
dansmith	right	16:19
bauzas	technically this isn't a noop	16:19
dansmith	crazy right?	16:19
bauzas	dansmith: I wonder something, lemme check	16:19
dansmith	so my code moves that to below the disabled check and refuses to create a migration record for a destination compute node we don't know anything about (and thus don't have a node id)	16:19
bauzas	my guess is that someone added the migration checks after the nopclaim implementation	16:20
bauzas	without thinking about the implication	16:20
dansmith	here, the nodename loose affiliation is us just lying about the destination, which is easy if the nodename is just a name, but the id requires stricter adherence	16:20
bauzas	(gosh hope it isn't me :) )	16:20
dansmith	what "migration checks" ?	16:22
dansmith	those lines I'm moving are the creation of the migration object	16:22
dansmith	(or the assignment of the migration to a node)	16:22
dansmith	like, creating the migration and then returning out of that early also means instance.migration_context isn't created	16:23
bauzas	yeah, I guess we need to be smarter here	16:25
bauzas	we probably piled a couple of modifications without really considering the impact	16:25
dansmith	okay, that's what I'm hoping .. that we really never get here and that we're not either (a) depending on this behavior or (b) silently sometimes hitting this code path and losing resources or something	16:26
bauzas	the fact is, say the virt driver returns nothing suddently	16:26
bauzas	we need to handle this case correctly	16:26
dansmith	so my change there means we no longer update/create the migration if we're going to bail out early	16:26
dansmith	but we might want to follow up with something to actually abort	16:26
bauzas	yes	16:26
dansmith	so I thought this might be for ironic for some reason	16:27
bauzas	the fact is, we were allowing a migration to succeed even if we weren't tracking its resources	16:27
dansmith	because they override their disabled check to mean more than just a missing compute	16:27
dansmith	if not, I can update my comment there	16:27
bauzas	do you feel brave enough to untangle this spaghetti code ?	16:27
dansmith	I've been holding off writing tests for this change in case I was missing something but it sounds like I'm good to proceed	16:28
bauzas	yeah and honestly I'm a bit afraid of the potential impacts we may create	16:28
dansmith	all I have the appetite for at the moment is doing this change to it, because I've still got work to do on the rest of the series	16:28
dansmith	potential impact of what? not creating the migration object or making this an error instead of a silent Nop ?	16:28
bauzas	the potential impact of having a migration succeed as a noop without having a migration record set	16:29
dansmith	okay I'm confused I thought you were sure we couldn't be actually running this	16:30
dansmith	however, if we are, surely it's better to have it fail, find out how we're getting there, and fix it right?	16:30
dansmith	because like I said,	16:30
dansmith	if we hit this code, we're not even creating instance.migration_context	16:30
dansmith	or the pci claims	16:31
bauzas	there are two preconditions in order to have this disabled property returning True	16:31
dansmith	and we won't have the numa stuff that goes along with it	16:31
bauzas	the first precondition, which is to have the variable to be set internally, seems enough trivial to me, as this being a race condition hack at startup	16:31
bauzas	I'm a bit concerned by the second precondition, which is that the virt driver returns yay or nay	16:32
dansmith	yeah, and "disabled" is a very strange word to use for the first case	16:32
dansmith	bauzas: for ironic it seems to be "if this node exists at all" which hopefully should never happen for transient reasons	16:32
dansmith	for libvirt it would potentially be a rename issue I think,	16:33
dansmith	which is a case where we might have been renamed and then accept a migration for a node we're not actually supposed to be hosting	16:33
dansmith	which is another good reason to make this strict	16:33
dansmith	like, if I send a message to a compute node and say "yo dawg, I'm looking to migrate an instance into your does-not-exist node", the compute node will happily create a migration object for that node, and punt on the rest of everything and silently say "roger that"	16:38
dansmith	if I set CONF.host to something that matches another compute by accident, I'll do that for nodes they legit own	16:38
bauzas	for libvirt, yup	16:40
dansmith	right, I mean for libvirt	16:40
bauzas	agreed, this is a weird exception handling	16:40
bauzas	like, you are sent to a compute service where the virt driver attached to it reports that it doesn't know the compute node target you may want to use	16:41
bauzas	this ^ is for all drivers	16:41
dansmith	right	16:42
dansmith	and it's crazy to just silently accept	16:42
dansmith	since we don't even create the migration context in the current form, I'm sure no migrations could be working in this case	16:43
bauzas	I also don't see any good reason for doing this, even with the FakeDriver	16:44
dansmith	ack	16:44
dansmith	I'll update the comments there to reflect this and proceed unless you have any other checking you want to do	16:44
bauzas	dansmith: given all we discussed, my only open concern is, shouldn't we rather hardstop the migration call here?	16:46
bauzas	instead of returning a noop ?	16:46
dansmith	bauzas: we should, but I just don't want to tangle that up with my effort to get these objects linked in the database	16:46
dansmith	because like I said, migration_context below is already skipped, and no migration can work in that case either, AFAIK	16:47
dansmith	(instance.migration_context, even though the migration object is created at the top)	16:47
bauzas	dansmith: that's the only concern I have, as said	16:47
bauzas	if we don't return an exception, then the migration will be accepted	16:48
dansmith	if we weren't creating the migration object at the top before we check for the node, I wouldn't even have had to ask about this, and it's not really related to the rest of the work	16:48
dansmith	bauzas: right but it already is	16:48
bauzas	eventually the instance status could be something like 'resize_confirm'	16:48
bauzas	without having any migration records attached to it	16:48
dansmith	bauzas: my point is that the create_migration() will proceed even with a missing node, so it's not like that is stopping us from returning the nopclaim	16:49
bauzas	and revert resize wouldn't work, since no migration record exists	16:49
dansmith	right but we can't even get that far without instance.migration_context right?	16:49
bauzas	I have to admit this is out of my knowledge	16:50
dansmith	actually, I think the conductor will fail earlier than that even,	16:51
dansmith	okay you know what.. maybe we're missing more history here	16:53
dansmith	I think maybe this _create_migration() is from before we had conductor-directed migrations?	16:53
bauzas	honestly, tomorrow I'll setup a 2-node devstack and try doing crazypants migrations	16:53
dansmith	so maybe we always pre-create the migration in the conductor (at least for resize) now?	16:53
bauzas	oh, this is surely existing from 9 years	16:53
dansmith	so maybe we should actually fail here if migration is None and not fall back to creating the migration	16:54
dansmith	(need to check the other migration types though)	16:54
bauzas	from what I recall, this was refactored by ndipanov when he wrote the NUMA migrations	16:54
bauzas	which were something like Liberty IIRC	16:54
bauzas	or even Kilo	16:55
bauzas	way before we had the conductor engines managing the migration records (in Newton or Ocata IIRC)	16:55
bauzas	so yeah you may be true	16:55
bauzas	maybe all those codepaths aren't walked anyway, if we fail earlier	16:56
bauzas	or maybe the conductor fails after	16:56
dansmith	I'm now realizing there is more in the conductor that needs to be node-id-focused	16:56
dansmith	yeah my concern is live migration	16:56
dansmith	and evac	16:56
dansmith	cross-cell definitely happens in the superconductor because it mirrors the source cell migration	16:57
bauzas	I think we still rely on conductor-based operations for allocations	16:57
bauzas	for those two move ops	16:57
dansmith	either way, the node id part is handled on the destination node, so it actually doesn't change anything	16:57
dansmith	we'll call to the compute and it will stamp the nodename we gave it on the migration even if it doesn't know that node	16:58
dansmith	the update half of that	16:58
dansmith	okay you're probably EOD at this point right?	16:59
dansmith	I have to run to something else for a bit, but maybe think on it a bit	16:59
dansmith	I guess I could also put a DNM patch to raise there and make sure at least we don't see it in any of our multinode CI tests	17:00
bauzas	dansmith: tbh, I was wanting to cycle again on your spec	17:01
bauzas	I started it before the meeting	17:01
dansmith	yes, that would also be appreciated	17:01
bauzas	and then you puzzled me :)	17:01
dansmith	I updated it for some of this migration stuff last week	17:01
bauzas	so before I call, I'll get it a round	17:01
dansmith	ack	17:02
bauzas	and done	17:06
bauzas	this was quick, I was at the migration object change	17:06
bauzas	Uggla: you're next in the review list :)	17:09
bauzas	but this will be done tomorrow	17:09
bauzas	see ya folks	17:09
opendevreview	Dan Smith proposed openstack/nova master: Populate ComputeNode.service_id https://review.opendev.org/c/openstack/nova/+/879904	17:26
opendevreview	Dan Smith proposed openstack/nova master: Add compute_id columns to instances, migrations https://review.opendev.org/c/openstack/nova/+/879499	17:26
opendevreview	Dan Smith proposed openstack/nova master: Add dest_compute_id to Migration object https://review.opendev.org/c/openstack/nova/+/879682	17:26
opendevreview	Dan Smith proposed openstack/nova master: Add compute_id to Instance object https://review.opendev.org/c/openstack/nova/+/879500	17:26
opendevreview	Dan Smith proposed openstack/nova master: Online migrate missing Instance.compute_id fields https://review.opendev.org/c/openstack/nova/+/879905	17:26
opendevreview	Dan Smith proposed openstack/nova master: DNM: Look for disabled node claims https://review.opendev.org/c/openstack/nova/+/879687	17:26
opendevreview	sean mooney proposed openstack/nova master: [WIP] add hypervisor version weigher https://review.opendev.org/c/openstack/nova/+/880231	19:25
opendevreview	Karl Kloppenborg proposed openstack/placement stable/yoga: chore: back-merge os-traits version update to placement in stable/yoga to support owner traits https://review.opendev.org/c/openstack/placement/+/880249	23:14
opendevreview	Karl Kloppenborg proposed openstack/placement stable/zed: chore: update stable/zed OS-TRAITS version to 2.10.0 https://review.opendev.org/c/openstack/placement/+/880250	23:21

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!