*** m75abrams has joined #openstack-glance | 04:11 | |
*** evrardjp has quit IRC | 04:33 | |
*** evrardjp has joined #openstack-glance | 04:33 | |
*** ratailor has joined #openstack-glance | 04:56 | |
*** udesale has joined #openstack-glance | 05:43 | |
*** udesale has quit IRC | 06:10 | |
*** nikparasyr has joined #openstack-glance | 06:22 | |
*** happyhemant has joined #openstack-glance | 06:40 | |
*** belmoreira has joined #openstack-glance | 06:45 | |
*** k_mouza has joined #openstack-glance | 08:03 | |
*** k_mouza has quit IRC | 08:41 | |
*** k_mouza has joined #openstack-glance | 08:45 | |
*** happyhemant has quit IRC | 08:50 | |
*** tkajinam has quit IRC | 09:55 | |
*** Luzi has joined #openstack-glance | 09:59 | |
*** k_mouza has quit IRC | 10:10 | |
*** k_mouza has joined #openstack-glance | 10:12 | |
*** priteau has joined #openstack-glance | 10:40 | |
*** k_mouza has quit IRC | 10:52 | |
*** k_mouza has joined #openstack-glance | 10:55 | |
*** udesale has joined #openstack-glance | 11:20 | |
*** happyhemant has joined #openstack-glance | 11:25 | |
*** tkajinam has joined #openstack-glance | 11:27 | |
*** k_mouza has quit IRC | 11:57 | |
*** udesale_ has joined #openstack-glance | 12:14 | |
*** udesale has quit IRC | 12:17 | |
*** ratailor has quit IRC | 12:21 | |
*** rosmaita has joined #openstack-glance | 12:26 | |
*** udesale_ has quit IRC | 13:09 | |
*** Luzi has quit IRC | 13:26 | |
dansmith | abhishekk: so last week while writing more tests for the locking thing, I discovered that image_update() will destroy the lock we're setting | 13:39 |
---|---|---|
dansmith | so if the thread we're stealing the lock from is still alive and wakes up later, it will trample over the new one | 13:40 |
abhishekk | how is that? | 13:41 |
abhishekk | does image_update updates task as well? | 13:41 |
dansmith | two ways, but the most serious is that if I have my image already that has one value for os_glance_import_task, image_update() will change the property in the database to match that, reverting the newer task's lock | 13:42 |
dansmith | and if I don't have it set, it will delete the lock from the db because prune_props=True | 13:42 |
dansmith | which also seems to mean that someone doing an otherwise harmless update on an image will erase the lock for something in progress | 13:42 |
dansmith | so I have some changes locally to make image_update exclude some "atomic properties" from that list | 13:43 |
abhishekk | so at the moment, we only have one atomic property, right? | 13:43 |
dansmith | right | 13:44 |
abhishekk | ok | 13:44 |
dansmith | also, | 13:44 |
dansmith | it seems to me that the locations update is also very racy, and I have seen multiple times where competing threads will add or delete something from the locations list and drop something that another thread has added | 13:45 |
dansmith | because it pulls the list of locations, modifies, and then saves them as a whole, which means if anything else added a location in the meantime, the save will erase that one | 13:46 |
abhishekk | got it | 13:46 |
dansmith | and I imagine that is something you can hit via the api even | 13:46 |
dansmith | so writing some more tests has led me down a rabbit hole of these things | 13:46 |
abhishekk | yes | 13:46 |
abhishekk | So we can use this same locking mechanism in location update as well? | 13:47 |
dansmith | meaning use a locking property to prevent anyone else from changing the locations list? that would be very heavyweight and probably error prone, especially if we were to leave a lock in place if an update failed or something | 13:49 |
dansmith | the locations thing should be fixed by making it only add or update the location rows that need changing I think | 13:50 |
dansmith | less desirable to make that use a lock, | 13:50 |
dansmith | but honestly I only debugged what was happening and kinda gave up for the week, so I haven't looked deep into the location code | 13:50 |
abhishekk | hmm, makes sense | 13:50 |
abhishekk | there will be many corner cases though | 13:51 |
dansmith | I assume "losing" an image location is a really bad situation right? likely causes us to leak storage, may leave data around after delete, and/or cause data loss if someone deletes another location and doesn't realize the last one has been lost? | 13:53 |
abhishekk | Yeah, if so its really bad | 13:55 |
*** k_mouza has joined #openstack-glance | 13:57 | |
abhishekk | have you reported this issue in launchpad yet? | 13:59 |
abhishekk | AFAIK glance doesn't encourage cloud users to use location API (apart from services like nova or cinder) | 14:01 |
*** k_mouza has quit IRC | 14:02 | |
dansmith | nope, I just made some local notes and stopped for the week | 14:03 |
abhishekk | ack | 14:03 |
*** jv_ has joined #openstack-glance | 14:03 | |
dansmith | maybe it's not a race that is likely to happen in the real world, because of workflow, but it can definitely happen if you have one import that is proceeding slowly, and the lock gets busted such that another import starts doing work | 14:04 |
dansmith | which is kinda why I was hesitant to have this automatic lock breaking anyway, because writing the tests for that case, where the original import is still alive has uncovered several things like this | 14:05 |
dansmith | so maybe not a practical problem in the past, but it's likely more possible to hit some of this in the future with things like copy-image | 14:05 |
*** tkajinam has quit IRC | 14:06 | |
abhishekk | yes, can be possible with copy-image as well as multiple imports with allow_failure True | 14:08 |
dansmith | yeah | 14:08 |
*** k_mouza has joined #openstack-glance | 14:27 | |
openstackgerrit | Dan Smith proposed openstack/glance master: Handle atomic image properties separately https://review.opendev.org/746518 | 14:35 |
*** m75abrams has quit IRC | 14:50 | |
abhishekk | dansmith, we have added os_glance_import_task to _reserved_properties which will restrict it if someone tries to update it explicitly right? | 14:51 |
dansmith | yeah but that doesn't fix the image save problem | 14:51 |
openstackgerrit | Dan Smith proposed openstack/glance master: Functional test enhancement for lock busting https://review.opendev.org/746531 | 14:54 |
abhishekk | ok | 14:56 |
dansmith | abhishekk: it's because the image and props are fetched as a whole from the db, mutated, and then saved as a whole | 14:58 |
*** gyee has joined #openstack-glance | 14:58 | |
dansmith | if someone else saved a lock key in between your get and save, you will save back a prop dict without the lock in there, and purge_props=True will make sure to remove the lock from the database | 14:58 |
dansmith | even if you weren't touching the lock property | 14:59 |
abhishekk | yes, got it | 14:59 |
dansmith | cool | 14:59 |
dansmith | abhishekk: so the image location race, I put a mitigation for this into the second patch above, but I think that's another good reason to make a failed import revert, even if we're close to the end, because we've given up our lock and so trying to work on the image when another thread is working will be problematic | 15:03 |
abhishekk | dansmith, looking | 15:09 |
abhishekk | dansmith, so you mean to say if in finish task if assert_lock fails then we should revert, right? | 15:16 |
dansmith | not just that, | 15:16 |
dansmith | but if your remove that assert, the task may have taken a very long time but completed set_data(), but now it no longer has the lock, and it will try to continue doing things without that lock, like set_image_location, or other stuff that the task _with_ the lock may also be doing | 15:17 |
dansmith | so I'm saying if our lock is busted, we should clean up, revert, etc regardless of if we think we can still proceed | 15:18 |
*** belmoreira has quit IRC | 15:20 | |
dansmith | in addition to the racy location updating, there would also be racy os_glance_importing_to_stores updating (although the context will already prohibit that) | 15:21 |
abhishekk | yep, makes sense | 15:25 |
* abhishekk going for dinner break, will be back in 30 mins | 15:30 | |
*** happyhemant has quit IRC | 15:35 | |
* abhishekk back from dinner | 15:57 | |
*** priteau has quit IRC | 15:57 | |
openstackgerrit | Dan Smith proposed openstack/glance master: Functional test enhancement for lock busting https://review.opendev.org/746531 | 16:01 |
openstackgerrit | Dan Smith proposed openstack/glance master: Cleanup import status information after busting a lock https://review.opendev.org/746554 | 16:01 |
openstackgerrit | Rajat Dhasmana proposed openstack/glance_store master: Support Cinder multiple stores https://review.opendev.org/746556 | 16:03 |
*** belmoreira has joined #openstack-glance | 16:33 | |
*** belmoreira has quit IRC | 16:36 | |
*** k_mouza has quit IRC | 16:37 | |
dansmith | abhishekk: if you can think of other cases or things to check to add to test_images_import_locking let me know | 17:00 |
abhishekk | dansmith, ack | 17:01 |
dansmith | I probably need to add some comments to that last functional enhancement too | 17:01 |
abhishekk | dansmith, yes, that will be more helpful | 17:03 |
openstackgerrit | Dan Smith proposed openstack/glance master: Functional test enhancement for lock busting https://review.opendev.org/746531 | 17:11 |
openstackgerrit | Dan Smith proposed openstack/glance master: Cleanup import status information after busting a lock https://review.opendev.org/746554 | 17:11 |
* abhishekk signing out for the day | 17:24 | |
*** nikparasyr has left #openstack-glance | 17:28 | |
openstackgerrit | Erno Kuvaja proposed openstack/glance_store master: Ramp up rbd resize to avoid excessive calls https://review.opendev.org/746579 | 18:01 |
openstackgerrit | Luigi Toscano proposed openstack/glance master: zuul: use the new barbican simple-crypto job https://review.opendev.org/746582 | 18:08 |
jokke | mnaser: around? | 18:20 |
mnaser | hi | 18:20 |
jokke | mnaser: hey, mind to do sanity check on https://review.opendev.org/746579 ? I think you might have heavy interest on that ;) | 18:21 |
mnaser | seems aggressive but i don't know just how much of a performance benefit it gives | 18:22 |
jokke | based on the complains we've received over years, I'd assume quite a bit | 18:23 |
jokke | but I'd like you to have a look to make sure that we do not break anything. | 18:24 |
*** hoonetorg has joined #openstack-glance | 18:29 | |
jokke | like I think I did ;) | 18:30 |
openstackgerrit | Erno Kuvaja proposed openstack/glance_store master: Ramp up rbd resize to avoid excessive calls https://review.opendev.org/746579 | 18:40 |
*** belmoreira has joined #openstack-glance | 19:13 | |
*** smcginnis has quit IRC | 19:19 | |
*** smcginnis has joined #openstack-glance | 19:21 | |
dansmith | mnaser: what is aggressive about the resize change? the fact that it scales up to 8G above the size of the image before shrinking? | 19:51 |
dansmith | I expect ceph can resize without much overhead, but we'd still be calling resize ~2500 times for the current 8M resizes on a 20G image | 19:52 |
dansmith | if anything, just doubling the chunk size almost seems like it isn't aggressive enough because you'd still call it a bunch of times ramping 8M into the GB range | 19:53 |
dansmith | I guess it's like 14 calls for a 20G image vs. 2500 so that's clearly an improvement :) | 19:54 |
mnaser | dansmith: yeah, the 8gb scale up, but to be honest, i don't know the implications of the resize, but i agree it will happen a lot less often | 20:01 |
dansmith | yeah me either, I just expect the number of calls we have to make to a remote service is the limitation, not necessarily the actual resize itself | 20:02 |
*** belmoreira has quit IRC | 20:43 | |
*** rcernin has joined #openstack-glance | 22:33 | |
*** tkajinam has joined #openstack-glance | 23:07 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!