Monday, 2020-08-17

*** m75abrams has joined #openstack-glance		04:11
*** evrardjp has quit IRC		04:33
*** evrardjp has joined #openstack-glance		04:33
*** ratailor has joined #openstack-glance		04:56
*** udesale has joined #openstack-glance		05:43
*** udesale has quit IRC		06:10
*** nikparasyr has joined #openstack-glance		06:22
*** happyhemant has joined #openstack-glance		06:40
*** belmoreira has joined #openstack-glance		06:45
*** k_mouza has joined #openstack-glance		08:03
*** k_mouza has quit IRC		08:41
*** k_mouza has joined #openstack-glance		08:45
*** happyhemant has quit IRC		08:50
*** tkajinam has quit IRC		09:55
*** Luzi has joined #openstack-glance		09:59
*** k_mouza has quit IRC		10:10
*** k_mouza has joined #openstack-glance		10:12
*** priteau has joined #openstack-glance		10:40
*** k_mouza has quit IRC		10:52
*** k_mouza has joined #openstack-glance		10:55
*** udesale has joined #openstack-glance		11:20
*** happyhemant has joined #openstack-glance		11:25
*** tkajinam has joined #openstack-glance		11:27
*** k_mouza has quit IRC		11:57
*** udesale_ has joined #openstack-glance		12:14
*** udesale has quit IRC		12:17
*** ratailor has quit IRC		12:21
*** rosmaita has joined #openstack-glance		12:26
*** udesale_ has quit IRC		13:09
*** Luzi has quit IRC		13:26
dansmith	abhishekk: so last week while writing more tests for the locking thing, I discovered that image_update() will destroy the lock we're setting	13:39
dansmith	so if the thread we're stealing the lock from is still alive and wakes up later, it will trample over the new one	13:40
abhishekk	how is that?	13:41
abhishekk	does image_update updates task as well?	13:41
dansmith	two ways, but the most serious is that if I have my image already that has one value for os_glance_import_task, image_update() will change the property in the database to match that, reverting the newer task's lock	13:42
dansmith	and if I don't have it set, it will delete the lock from the db because prune_props=True	13:42
dansmith	which also seems to mean that someone doing an otherwise harmless update on an image will erase the lock for something in progress	13:42
dansmith	so I have some changes locally to make image_update exclude some "atomic properties" from that list	13:43
abhishekk	so at the moment, we only have one atomic property, right?	13:43
dansmith	right	13:44
abhishekk	ok	13:44
dansmith	also,	13:44
dansmith	it seems to me that the locations update is also very racy, and I have seen multiple times where competing threads will add or delete something from the locations list and drop something that another thread has added	13:45
dansmith	because it pulls the list of locations, modifies, and then saves them as a whole, which means if anything else added a location in the meantime, the save will erase that one	13:46
abhishekk	got it	13:46
dansmith	and I imagine that is something you can hit via the api even	13:46
dansmith	so writing some more tests has led me down a rabbit hole of these things	13:46
abhishekk	yes	13:46
abhishekk	So we can use this same locking mechanism in location update as well?	13:47
dansmith	meaning use a locking property to prevent anyone else from changing the locations list? that would be very heavyweight and probably error prone, especially if we were to leave a lock in place if an update failed or something	13:49
dansmith	the locations thing should be fixed by making it only add or update the location rows that need changing I think	13:50
dansmith	less desirable to make that use a lock,	13:50
dansmith	but honestly I only debugged what was happening and kinda gave up for the week, so I haven't looked deep into the location code	13:50
abhishekk	hmm, makes sense	13:50
abhishekk	there will be many corner cases though	13:51
dansmith	I assume "losing" an image location is a really bad situation right? likely causes us to leak storage, may leave data around after delete, and/or cause data loss if someone deletes another location and doesn't realize the last one has been lost?	13:53
abhishekk	Yeah, if so its really bad	13:55
*** k_mouza has joined #openstack-glance		13:57
abhishekk	have you reported this issue in launchpad yet?	13:59
abhishekk	AFAIK glance doesn't encourage cloud users to use location API (apart from services like nova or cinder)	14:01
*** k_mouza has quit IRC		14:02
dansmith	nope, I just made some local notes and stopped for the week	14:03
abhishekk	ack	14:03
*** jv_ has joined #openstack-glance		14:03
dansmith	maybe it's not a race that is likely to happen in the real world, because of workflow, but it can definitely happen if you have one import that is proceeding slowly, and the lock gets busted such that another import starts doing work	14:04
dansmith	which is kinda why I was hesitant to have this automatic lock breaking anyway, because writing the tests for that case, where the original import is still alive has uncovered several things like this	14:05
dansmith	so maybe not a practical problem in the past, but it's likely more possible to hit some of this in the future with things like copy-image	14:05
*** tkajinam has quit IRC		14:06
abhishekk	yes, can be possible with copy-image as well as multiple imports with allow_failure True	14:08
dansmith	yeah	14:08
*** k_mouza has joined #openstack-glance		14:27
openstackgerrit	Dan Smith proposed openstack/glance master: Handle atomic image properties separately https://review.opendev.org/746518	14:35
*** m75abrams has quit IRC		14:50
abhishekk	dansmith, we have added os_glance_import_task to _reserved_properties which will restrict it if someone tries to update it explicitly right?	14:51
dansmith	yeah but that doesn't fix the image save problem	14:51
openstackgerrit	Dan Smith proposed openstack/glance master: Functional test enhancement for lock busting https://review.opendev.org/746531	14:54
abhishekk	ok	14:56
dansmith	abhishekk: it's because the image and props are fetched as a whole from the db, mutated, and then saved as a whole	14:58
*** gyee has joined #openstack-glance		14:58
dansmith	if someone else saved a lock key in between your get and save, you will save back a prop dict without the lock in there, and purge_props=True will make sure to remove the lock from the database	14:58
dansmith	even if you weren't touching the lock property	14:59
abhishekk	yes, got it	14:59
dansmith	cool	14:59
dansmith	abhishekk: so the image location race, I put a mitigation for this into the second patch above, but I think that's another good reason to make a failed import revert, even if we're close to the end, because we've given up our lock and so trying to work on the image when another thread is working will be problematic	15:03
abhishekk	dansmith, looking	15:09
abhishekk	dansmith, so you mean to say if in finish task if assert_lock fails then we should revert, right?	15:16
dansmith	not just that,	15:16
dansmith	but if your remove that assert, the task may have taken a very long time but completed set_data(), but now it no longer has the lock, and it will try to continue doing things without that lock, like set_image_location, or other stuff that the task _with_ the lock may also be doing	15:17
dansmith	so I'm saying if our lock is busted, we should clean up, revert, etc regardless of if we think we can still proceed	15:18
*** belmoreira has quit IRC		15:20
dansmith	in addition to the racy location updating, there would also be racy os_glance_importing_to_stores updating (although the context will already prohibit that)	15:21
abhishekk	yep, makes sense	15:25
* abhishekk going for dinner break, will be back in 30 mins		15:30
*** happyhemant has quit IRC		15:35
* abhishekk back from dinner		15:57
*** priteau has quit IRC		15:57
openstackgerrit	Dan Smith proposed openstack/glance master: Functional test enhancement for lock busting https://review.opendev.org/746531	16:01
openstackgerrit	Dan Smith proposed openstack/glance master: Cleanup import status information after busting a lock https://review.opendev.org/746554	16:01
openstackgerrit	Rajat Dhasmana proposed openstack/glance_store master: Support Cinder multiple stores https://review.opendev.org/746556	16:03
*** belmoreira has joined #openstack-glance		16:33
*** belmoreira has quit IRC		16:36
*** k_mouza has quit IRC		16:37
dansmith	abhishekk: if you can think of other cases or things to check to add to test_images_import_locking let me know	17:00
abhishekk	dansmith, ack	17:01
dansmith	I probably need to add some comments to that last functional enhancement too	17:01
abhishekk	dansmith, yes, that will be more helpful	17:03
openstackgerrit	Dan Smith proposed openstack/glance master: Functional test enhancement for lock busting https://review.opendev.org/746531	17:11
openstackgerrit	Dan Smith proposed openstack/glance master: Cleanup import status information after busting a lock https://review.opendev.org/746554	17:11
* abhishekk signing out for the day		17:24
*** nikparasyr has left #openstack-glance		17:28
openstackgerrit	Erno Kuvaja proposed openstack/glance_store master: Ramp up rbd resize to avoid excessive calls https://review.opendev.org/746579	18:01
openstackgerrit	Luigi Toscano proposed openstack/glance master: zuul: use the new barbican simple-crypto job https://review.opendev.org/746582	18:08
jokke	mnaser: around?	18:20
mnaser	hi	18:20
jokke	mnaser: hey, mind to do sanity check on https://review.opendev.org/746579 ? I think you might have heavy interest on that ;)	18:21
mnaser	seems aggressive but i don't know just how much of a performance benefit it gives	18:22
jokke	based on the complains we've received over years, I'd assume quite a bit	18:23
jokke	but I'd like you to have a look to make sure that we do not break anything.	18:24
*** hoonetorg has joined #openstack-glance		18:29
jokke	like I think I did ;)	18:30
openstackgerrit	Erno Kuvaja proposed openstack/glance_store master: Ramp up rbd resize to avoid excessive calls https://review.opendev.org/746579	18:40
*** belmoreira has joined #openstack-glance		19:13
*** smcginnis has quit IRC		19:19
*** smcginnis has joined #openstack-glance		19:21
dansmith	mnaser: what is aggressive about the resize change? the fact that it scales up to 8G above the size of the image before shrinking?	19:51
dansmith	I expect ceph can resize without much overhead, but we'd still be calling resize ~2500 times for the current 8M resizes on a 20G image	19:52
dansmith	if anything, just doubling the chunk size almost seems like it isn't aggressive enough because you'd still call it a bunch of times ramping 8M into the GB range	19:53
dansmith	I guess it's like 14 calls for a 20G image vs. 2500 so that's clearly an improvement :)	19:54
mnaser	dansmith: yeah, the 8gb scale up, but to be honest, i don't know the implications of the resize, but i agree it will happen a lot less often	20:01
dansmith	yeah me either, I just expect the number of calls we have to make to a remote service is the limitation, not necessarily the actual resize itself	20:02
*** belmoreira has quit IRC		20:43
*** rcernin has joined #openstack-glance		22:33
*** tkajinam has joined #openstack-glance		23:07

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!