14:00:39 <croelandt> #startmeeting glance 14:00:39 <opendevmeet> Meeting started Thu May 22 14:00:39 2025 UTC and is due to finish in 60 minutes. The chair is croelandt. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:39 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:39 <opendevmeet> The meeting name has been set to 'glance' 14:00:44 <croelandt> #topic roll call 14:00:49 <croelandt> o/ 14:00:57 <abhishekk_> o/ 14:01:15 <mhen> o/ 14:01:49 <croelandt> #link https://etherpad.openstack.org/p/glance-team-meeting-agenda 14:02:01 <croelandt> #topic Release/periodic job updates 14:02:20 <croelandt> glance-multistore-cinder-import-fip is still failing with the same nova-related packaging error as last week 14:02:29 <croelandt> do we know whether anybody is looking at this? 14:02:54 <abhishekk_> Not me 14:04:12 <croelandt> is there a channel where we can get help with stuff like this? 14:04:26 <croelandt> I'm wondering if maybe something is wrong with devstack-single-node-centos-9-stream 14:05:01 <abhishekk_> May be infra? 14:05:27 <abhishekk_> Or rosmaita 14:05:29 <croelandt> ok I'll try #openstack-infra 14:05:31 <dansmith> I thought all the centos jobs were broken 14:05:34 <croelandt> oh 14:05:44 <dansmith> (it's their normal state I think, I'm surprised when they work) 14:05:45 <croelandt> has there been an email about this? 14:05:47 <croelandt> haha 14:05:52 <croelandt> ok I see what you mean 14:05:55 <dansmith> I dunno I heard someone else talking about it 14:06:39 <croelandt> ok I'll dig around on #openstack-infra and try to figure it out 14:06:47 <croelandt> moving on 14:06:51 <croelandt> #topic cheroot to replace use of eventlet.wsgi 14:06:55 <croelandt> abhishekk_: ^ 14:07:07 <abhishekk_> Yeah, do we want it? 14:07:31 <abhishekk_> I think someone suggested us we can use cheroot if we want 14:07:41 <croelandt> so how many ways do we have to deploy Glance right now? 14:07:56 <croelandt> eventlet (to be removed soon) and uwsgi? 14:07:58 <dansmith> um what 14:08:03 <abhishekk_> I think uwsgi and mod_wsgi? 14:08:16 <dansmith> what is "cheroot" ? 14:08:59 <abhishekk_> Cheroot is a pure-Python HTTP server, used as the underlying server component for web frameworks like CherryPy. 14:09:29 <dansmith> ah, so not a direct replacement but yet another deployment mechanism I see 14:09:32 <abhishekk_> It’s WSGI-compliant 14:09:39 <abhishekk_> Yes 14:09:44 <croelandt> so why would we want that if we got uwsgi and mod_wsgi? 14:10:14 <dansmith> right, and also, if we're wsgi compliant we don't really need to worry (much) about which wsgi container people use 14:10:26 <abhishekk_> If we still want to use WSGI based server i think 14:10:37 <croelandt> honestly I'm lost here 14:10:43 <dansmith> yeah 14:10:48 <croelandt> Aren't we using WSGI already? 14:10:51 <dansmith> yes 14:11:01 <croelandt> so why add yet another deployment mechanism? 14:11:44 <abhishekk_> i don’t recall but someone suggested that if we want then we can think of cheroot 14:11:51 <dansmith> well, if chreroot is just a wsgi container, then were not and it should work in this one, and all the many others that are out there 14:12:06 <dansmith> but I don't know why we would focus on cheroot any more than gunicorn or mod_wsgi, etc etc 14:12:14 <abhishekk_> May be we can discuss this next PTG 14:12:33 <dansmith> okay but.. there shouldn't be anything we need to do 14:12:46 <dansmith> if someone wants to use cheroot with glance, they probably can 14:12:50 <croelandt> I'd expect that if WSGI is a standard, we can plug anything 14:12:51 <abhishekk_> So right now can we remove wsgi which is using eventlet? 14:13:08 <croelandt> I would keep it for a few cycles in case things go south 14:13:15 <croelandt> so users can easily switch back to eventlet if needed 14:13:16 <dansmith> abhishekk_: in pure wsgi mode we're not really using eventlet 14:14:03 <abhishekk_> Right, but code is still there and as we are migrating our functional tests there are some files which are strictly using that 14:14:27 <abhishekk_> So should we not migrate those and keep it there in repo? 14:14:39 <dansmith> right, is your suggestion to use cheroot for the functional test harness? 14:14:51 <dansmith> i.e. not in production, but for the test spinup? 14:15:13 <abhishekk_> Kind of 14:15:51 <croelandt> how does that work since we're migrating away from using a "real" server? 14:16:34 <abhishekk_> Ok may be during next meeting i will come up with one example 14:16:51 <dansmith> maybe abhishekk_ is suggesting using cheroot to maintain some amount of standalone-like runtime? 14:17:25 <dansmith> otherwise I'm definitely confused about the overlap between eventlet, wsgi, and cheroot 14:17:28 <abhishekk_> Yeah 14:18:09 <croelandt> Now I'm confused whether this is related to deployment or funcitonal testing 14:18:21 <abhishekk_> Ok, let’s revisit this next werk 14:18:23 <croelandt> also we're in the middle of migrating all the funcitonal tests away from using a real server :D 14:18:37 <dansmith> I think I'd prefer not to have (and maintain) any sort of standalone thing in parallel to the way we expect glance to be run in production, unless it's very specifically more for just in the test environment or something 14:18:52 <croelandt> yeah maybe if you could put up a document on how we deploy, and what you want to change - something that we could review before the next meeting 14:19:01 <croelandt> dansmith: +1 14:19:08 <dansmith> but either way, many of the other api projects do functional testing without standing up a full server, and thus have a lot less complexity for doing this in general 14:19:15 <abhishekk_> Yes we are and there are still some tests where we need actual server unless we mock the usage of it 14:19:15 <croelandt> dansmith: I think we should also observe what stephenfin does for Devstack and make sure we make it easy for him 14:19:33 <croelandt> ok maybe let's identify those tests 14:19:39 <croelandt> and see why we cannot do like all the other projects 14:19:43 <dansmith> yeah 14:19:54 <abhishekk_> Ack 14:20:13 <croelandt> ok moving on 14:20:17 <croelandt> #topic RBD Delete issue while hash calculation is in progress 14:20:19 <croelandt> abhishekk_: still you :) 14:20:39 <abhishekk_> test_wsgi for example 14:21:02 <abhishekk_> What should we do about this case 14:21:27 <abhishekk_> Unless we solve this issue, new location api is of no use to consumers of glance 14:22:19 <croelandt> ok so this is the issue with turning on do_secure_hash by default, right? 14:22:59 <abhishekk_> I am still inclined towards adding new method in glance_store rbd driver to move image to trash and then from glance delete call we can check if any task is progress for that image then move it to trash 14:23:01 <dansmith> it is on by default, as it should. be 14:23:05 <abhishekk_> Yes 14:23:32 <croelandt> Is this only an issue with RBD? 14:23:37 <abhishekk_> Yes 14:23:51 <dansmith> abhishekk: is this. not a problem also with cinder? 14:24:25 <abhishekk_> I don’t think we have encountered this with cinder as glance backend 14:24:45 <dansmith> but, shouldn't it be? If the hashing is running you can't delete and unmount right? 14:24:57 <abhishekk_> InUseByStore is only raised by rbd 14:25:29 <croelandt> we should probably check with Rajat/Brian 14:25:43 <dansmith> I understand that the exact problem and trace is rbd specific, I'm talking about the general problem of delete while hash is running 14:26:00 <abhishekk_> I am not sure about it, but as per my understanding only rbd restricts us from deleting the image (but actually it deletes it) 14:26:11 <croelandt> so if we send it to trash 14:26:20 <croelandt> will RBD allow us to do that while it's computing the has? 14:26:22 <croelandt> hash* 14:26:33 <dansmith> but the cinder driver will dry to unmount and delete the volume during an image delete right? 14:26:39 <abhishekk_> I will check with cinder as glance backend 14:26:56 <dansmith> I'm saying, let's spend a minute to find out if there are other backends that might also be affected, so we can make sure we fix it right 14:27:06 <abhishekk_> Yes and it will delete it successfully imo 14:27:42 <abhishekk_> Also we have a code at location import in hash calculation task to catch not found and log a warning and continue rather than failing 14:27:49 <croelandt> so 1) the image goes to trash 2) the hash is still being computed 3) the image is "really" deleted? 14:28:35 <dansmith> croelandt: that's the expectation yes 14:28:44 <abhishekk> https://github.com/openstack/glance/blob/master/glance/async_/flows/location_import.py#L94 14:28:58 <dansmith> croelandt: I think rosmaita told me that he thought that might not work, that we can only trash something that just has other clones but isn't actually open for reading 14:29:06 <dansmith> croelandt: I also feel like this must be an existing issue with download.. 14:29:24 <croelandt> This might sound stupid, but when Glance deletes an image, can't it start by cancelling the hash calculation? 14:29:25 <dansmith> croelandt: if I go to download an image and then slow-walk the data stream so it takes an hour, can't I block delete of that image? 14:29:41 <dansmith> croelandt: I've already suggested that too - we can, but it takes more work 14:29:42 <abhishekk> AFAIK pranali has tested this with Octopus version of ceph and it does not have that issue 14:29:53 <croelandt> dansmith: oh really? 14:29:57 <dansmith> croelandt: just like the import-from-worker, we need to know which worker and call to *that* one to stop the task 14:30:10 <dansmith> croelandt: if we recorded that, then yes, we could call to cancel and that would be better, IMHO 14:30:26 <croelandt> how hard is it to keep a list of workers and tasks? 14:30:36 <dansmith> croelandt: we don't even need to do that, 14:30:38 <croelandt> isn't that something we can log at some point? 14:30:48 <dansmith> croelandt: like the distributed import, we just record *which* one is hashing an image, on the image 14:30:55 <dansmith> we already do this for distributed import 14:31:23 <dansmith> we record on the image "hey it's me $conf.self_ref_url who staged this image, let me know if you want me to import it" 14:31:28 <abhishekk> so we should add another property? 14:31:31 <dansmith> we could do the same for the hash task 14:31:34 <croelandt> I mean 14:32:00 <croelandt> that seems easier to say "hey, let's forget about the hash" than to say "let's compute this hash we'll never use because the user changed their mind and decided to delete the image" 14:32:06 <dansmith> this is why I want to know if cinder suffers here, because the cinder backend has no "trash" like rbd does 14:32:21 <croelandt> cancelling the hash calculation would be backend-agnostic 14:32:28 <dansmith> exactly 14:32:28 <croelandt> which is a huge plus imo 14:32:42 <croelandt> I don't want to find out in 2 years that "lol s3 changed soemthing and now we have s3-specific code" 14:32:59 <dansmith> not likely for s3 but substitute anything else in there, agreed :D 14:33:03 <croelandt> yeah 14:33:08 <croelandt> or $Shiny_new_backend 14:33:27 <croelandt> or $ai_backend_that_hallucinates_your_data 14:33:39 <croelandt> abhishekk_: you say this used to work, does Ceph know about this? 14:33:47 <croelandt> Is this a change that appears in the release notes? 14:34:14 <abhishekk> I don't know about ceph, but as per pranali it was working with octopus 14:34:25 <croelandt> ok I'll talk to Francesco maybe 14:34:41 <abhishekk> we can confirm that if we manage to add a job which deploys octopus for us? 14:35:05 <croelandt> hm it would be nice to do that 14:35:14 <croelandt> can we easily specify a ceph version for our -ceph jobs? 14:35:27 <dansmith> I want to know about cinder :) 14:35:40 <abhishekk> I think I will check dnm patches of pranali she might have added the octopus related job at that time 14:35:50 <croelandt> dansmith: writing that down as well 14:36:11 <abhishekk> may be rajat can help us with cinder case, 14:36:17 <croelandt> but truly avoiding driver-specific code would be nice 14:36:26 <abhishekk> we have cinder multistore job which does not break imo 14:36:34 <abhishekk> ack, 14:36:51 <croelandt> easier to understand & debug if all drivers behave similarly 14:36:53 <abhishekk> so as per dan we should add one more property to record a worker which is calculating hash 14:37:11 <dansmith> the ceph job isn't actually failing either right? 14:37:13 <croelandt> not sure how hard that is and maybe we'll find it impractical but that sounds good 14:37:18 <dansmith> I mean, reliably 14:37:23 <croelandt> dansmith: the Tempest tests were failing, weren't there? 14:37:33 <croelandt> https://review.opendev.org/c/openstack/tempest/+/949595 14:37:37 <dansmith> they must have been flaky 14:37:42 <croelandt> #link https://review.opendev.org/c/openstack/tempest/+/949595 14:37:45 <abhishekk> ceph nova job is failing intermmittently for snapshot, backp tests 14:37:59 <croelandt> wasn't this patch a way to workaround the issue? 14:38:02 <dansmith> right, so intermittent means the cinder one not being broken doesn't prove anything to me 14:38:49 <abhishekk> ack 14:38:55 <dansmith> looking at the cinder driver real quick, 14:39:01 <dansmith> I don't see how it could not be failing the same way 14:39:23 <dansmith> also remember that upstream we use basically zero-length images so the hash can complete faster than we can delete a server... 14:39:23 <abhishekk> May be I will deploy glance with cinder and test this manually 14:39:31 <croelandt> so maybe we're lucky and the hash calculation is just fast enough? 14:39:45 <croelandt> yeah ok 14:40:04 <croelandt> in real life, it's also probably best not to compute the hash for no reason, right? 14:40:17 <abhishekk> is it ok to test with lvm as backend or should test with glance and cinder both using rbd? 14:40:28 <dansmith> you mean keep hashing it after delete? 14:40:45 <dansmith> croelandt: ^ 14:40:46 <croelandt> dansmith: yeah, if a user deletes an image and we keep computing the hash, it's basically useless 14:40:53 <croelandt> and wasted resources 14:40:56 <dansmith> croelandt: its also a DoS opportunity for them 14:41:22 <dansmith> abhishekk: I don't think it matters, but you need to (a) make sure the hashing is working for cinder and (b) that it's actually running when you try to delete the image 14:41:26 <croelandt> by doing this N times simultaneously? 14:41:45 <dansmith> croelandt: I can create and delete images all day long yeah 14:42:02 <abhishekk> AFAIK this new location API will use service credentials so that's less possible> 14:42:08 <dansmith> croelandt: especially if I use web-download to source the material it doesn't even cost me bandwidth :) 14:42:09 <croelandt> yeah and for one image the CPU load may not be high but if you do that enough... 14:42:33 <croelandt> so yeah maybe let's kill the hash computation task 14:42:38 <dansmith> abhishekk: right but I can create lots of instances and snapshot them 14:42:44 <abhishekk> :D 14:42:55 <dansmith> abhishekk: I'm not saying it's an acute problem, but croelandt is right that it's wasted resource 14:43:06 <abhishekk> agree 14:43:11 <croelandt> yeah and we think it might not be triggered, but some smart ass is going to find a workaround 14:43:25 <croelandt> so might as well not do it 14:43:31 <abhishekk> OK so action plan is 14:43:55 <abhishekk> 1, test with cinder 14:44:07 <abhishekk> 2. kill the hash computation during delete 14:44:20 <abhishekk> anything else? 14:44:37 <dansmith> I would still experiment with the rbd trash solution 14:44:54 <dansmith> rosmaita wasn't sure you could trash an active in-use image, but lets figure out if that's an option 14:45:22 <croelandt> also I'm worried this might change at some point in Ceph's life :) 14:45:27 <abhishekk> No, we will not trash active, we will mark it as deleted in glance and then move it to trash 14:45:28 <dansmith> #2 seems like a good idea to me, but it will be more work too and thus will take longer 14:45:47 <dansmith> croelandt: because it already has? 14:45:49 <croelandt> dansmith: I'm confused, didn't you like the driver-agnostic solution better? 14:46:03 <dansmith> croelandt: absolutely 14:46:16 <dansmith> croelandt: it's not just something abhishekk can write up in an afternoon (I suspect) 14:46:37 <abhishekk> may be 3-4 afternoons ? 14:46:47 <dansmith> and I just want to know what the trash solution is, although "marking as deleted in glance" is not really a thing AFAIK, so I'm curious about that 14:47:23 <abhishekk> that will be patchy solution :/ 14:47:41 <dansmith> abhishekk: did you mean do what I suggested before, delete in glance, ignore the InUseByStore and make the hash task delete in glance when it finds out the image is deleted in glance? 14:48:09 <abhishekk> yeah 14:48:28 <dansmith> that's good too although it does have a leaky hole 14:48:45 <dansmith> if we mark as deleted and then the hash task crashes or stops because an operator kills it, 14:48:56 <dansmith> then we leak the image on rbd with no real good way to clean up (AFAIK) 14:49:24 <abhishekk> I think rbd deletes the image it does not keep it 14:49:45 <dansmith> why would it? 14:49:46 <abhishekk> it deletes the image and then raises InUse exception :/ 14:49:52 <dansmith> oh right, that bug 14:49:55 <abhishekk> yeah 14:49:58 <dansmith> that's the buggy behavior, 14:50:08 <dansmith> but we can't depend on that forever, 14:50:14 <dansmith> so if they fix that behavior we start leaking 14:50:18 <abhishekk> exactly 14:50:58 <abhishekk> So should we stick with cancelling the hash computation before delete call goes to actual backend? 14:51:22 <dansmith> so, I was hoping that we could make delete move the image to trash only (if rosmaita is wrong). if it's unused it will go away immediately, and if it is, it will go away when finished (hash cancel aside) 14:51:35 <dansmith> so _that_ is what I was saying we should still investigate :) 14:52:19 <abhishekk> OK, I will write a POC for that 14:52:52 <abhishekk> So modified action plan 14:52:58 <abhishekk> 1, test with cinder 14:53:06 <abhishekk> 2, POC for moving image to trash 14:53:20 <abhishekk> 3, hash calculation cancellation before deleting the image 14:53:26 <dansmith> yep 14:53:27 <dansmith> the distributed import should be a good example for the hash cancelation thing 14:53:40 <abhishekk> IF 2 works then we can skip 3? 14:53:56 <croelandt> Ideally if 3 works can we skip 2? :) 14:53:57 <abhishekk> That depends on 1 I guess :P 14:54:05 <dansmith> idk, I think 3 is still worthwhile, but yeah.. only if 1 :) 14:54:28 <dansmith> croelandt: yeah, maybe we should just do 1, 3, and then 2 if 3 looks harder than we thought or something 14:54:33 <abhishekk> Ok, may be next Thursday we will have more data to decide on 14:54:53 <croelandt> yeah 14:54:59 <croelandt> let's move on 14:55:04 <croelandt> #topic Specs 14:55:12 <croelandt> On Monday I will merge https://review.opendev.org/c/openstack/glance-specs/+/947423 14:55:20 <croelandt> unless I see a -1 there :) 14:55:38 <dansmith> ugh, I should go review that 14:55:48 <croelandt> #topic One easy patch per core dev 14:55:54 <croelandt> #link https://review.opendev.org/c/openstack/glance/+/936319 14:56:00 <dansmith> I'm just really behind and swamped 14:56:07 <croelandt> ^ this is a simple patch by Takashi to remove a bunch of duplicated hacking checks 14:56:23 <croelandt> dansmith: yeah :-( 14:56:46 <abhishekk> We have mhen here I think, want to discuss something 14:56:55 <mhen> hi 14:56:56 <croelandt> I see there was a lenghty discussion about the spec 14:57:02 <croelandt> so feel free to go -1 if this has not been resolved 14:57:09 <croelandt> mhen: oh yeah, did you have something? I don't see a topic in the agenda 14:57:24 <mhen> no, just quick update for now 14:57:28 <mhen> I'm working on the image encryption again, currently looking into image import cases 14:57:32 <mhen> glance-direct from staging seems to be working fine and no impact, will look into cross-backend import of encrypted images next 14:57:45 <abhishekk> cool 14:58:05 <abhishekk> let us know if you need anything 14:58:09 <mhen> btw, noticed that `openstack image import --method glance-direct` has pretty bad UX: if the Glance API returns any Conflict or BadRequest (in glance/glance/api/v2/images.py there are lot of cases for this!), the client simply ignores it and shows a GET output of the image stuck in "uploading" state, which can be repeated indefinitely 14:58:27 <mhen> even with `--debug` it only briefly shows the 409 but not the message 14:58:44 <croelandt> interesting 14:58:47 <croelandt> can you file a bug for that? 14:58:50 <abhishekk> may be we need to look into that 14:59:05 <mhen> yes, I will put it on my todo list to file a bug 14:59:09 <abhishekk> could you possibly check with glance image-create-via-import as well? 14:59:22 <mhen> will try, noted 14:59:30 <abhishekk> cool, thank you!! 15:00:00 <croelandt> #topic Open Discussion 15:00:08 <croelandt> I won't be there for the next 2 Thursdays 15:00:16 <croelandt> so it's up to all of y'all whether there will be meetings :) 15:01:00 <abhishekk> Ok, I will chair the next meeting 15:01:16 <abhishekk> we will decide for the next one later 15:02:11 <croelandt> perfect 15:02:17 <croelandt> It's been a long one 15:02:22 <croelandt> see you on #openstack-glance :) 15:02:26 <croelandt> Thanks for joining 15:02:47 <abhishekk> thank you! 15:03:33 <croelandt> #endmeeting