#openstack-meeting log

14:00:39 <croelandt> #startmeeting glance
14:00:39 <opendevmeet> Meeting started Thu May 22 14:00:39 2025 UTC and is due to finish in 60 minutes.  The chair is croelandt. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:39 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:39 <opendevmeet> The meeting name has been set to 'glance'
14:00:44 <croelandt> #topic roll call
14:00:49 <croelandt> o/
14:00:57 <abhishekk_> o/
14:01:15 <mhen> o/
14:01:49 <croelandt> #link https://etherpad.openstack.org/p/glance-team-meeting-agenda
14:02:01 <croelandt> #topic Release/periodic job updates
14:02:20 <croelandt> glance-multistore-cinder-import-fip is still failing with the same nova-related packaging error as last week
14:02:29 <croelandt> do we know whether anybody is looking at this?
14:02:54 <abhishekk_> Not me
14:04:12 <croelandt> is there a channel where we can get help with  stuff like this?
14:04:26 <croelandt> I'm wondering if maybe something is wrong with devstack-single-node-centos-9-stream
14:05:01 <abhishekk_> May be infra?
14:05:27 <abhishekk_> Or rosmaita
14:05:29 <croelandt> ok I'll try #openstack-infra
14:05:31 <dansmith> I thought all the centos jobs were broken
14:05:34 <croelandt> oh
14:05:44 <dansmith> (it's their normal state I think, I'm surprised when they work)
14:05:45 <croelandt> has there been an email about this?
14:05:47 <croelandt> haha
14:05:52 <croelandt> ok I see what you mean
14:05:55 <dansmith> I dunno I heard someone else talking about it
14:06:39 <croelandt> ok I'll dig around on #openstack-infra and try to figure it out
14:06:47 <croelandt> moving on
14:06:51 <croelandt> #topic cheroot to replace use of eventlet.wsgi
14:06:55 <croelandt> abhishekk_: ^
14:07:07 <abhishekk_> Yeah, do we want it?
14:07:31 <abhishekk_> I think someone suggested us we can use cheroot if we want
14:07:41 <croelandt> so how many ways do we have to deploy Glance right now?
14:07:56 <croelandt> eventlet (to be removed soon) and uwsgi?
14:07:58 <dansmith> um what
14:08:03 <abhishekk_> I think uwsgi and mod_wsgi?
14:08:16 <dansmith> what is "cheroot" ?
14:08:59 <abhishekk_> Cheroot is a pure-Python HTTP server, used as the underlying server component for web frameworks like CherryPy.
14:09:29 <dansmith> ah, so not a direct replacement but yet another deployment mechanism I see
14:09:32 <abhishekk_> It’s WSGI-compliant
14:09:39 <abhishekk_> Yes
14:09:44 <croelandt> so why would we want that if we got uwsgi and mod_wsgi?
14:10:14 <dansmith> right, and also, if we're wsgi compliant we don't really need to worry (much) about which wsgi container people use
14:10:26 <abhishekk_> If we still want to use WSGI based server i think
14:10:37 <croelandt> honestly I'm lost here
14:10:43 <dansmith> yeah
14:10:48 <croelandt> Aren't we using WSGI already?
14:10:51 <dansmith> yes
14:11:01 <croelandt> so why add yet another deployment mechanism?
14:11:44 <abhishekk_> i don’t recall but someone suggested that if we want then we can think of cheroot
14:11:51 <dansmith> well, if chreroot is just a wsgi container, then were not and it should work in this one, and all the many others that are out there
14:12:06 <dansmith> but I don't know why we would focus on cheroot any more than gunicorn or mod_wsgi, etc etc
14:12:14 <abhishekk_> May be we can discuss this next PTG
14:12:33 <dansmith> okay but.. there shouldn't be anything we need to do
14:12:46 <dansmith> if someone wants to use cheroot with glance, they probably can
14:12:50 <croelandt> I'd expect that if WSGI is a standard, we can plug anything
14:12:51 <abhishekk_> So right now can we remove wsgi which is using eventlet?
14:13:08 <croelandt> I would keep it for a few cycles in case things go south
14:13:15 <croelandt> so users can easily switch back to eventlet if needed
14:13:16 <dansmith> abhishekk_: in pure wsgi mode we're not really using eventlet
14:14:03 <abhishekk_> Right, but code is still there and as we are migrating our functional tests there are some files which are strictly using that
14:14:27 <abhishekk_> So should we not migrate those and keep it there in repo?
14:14:39 <dansmith> right, is your suggestion to use cheroot for the functional test harness?
14:14:51 <dansmith> i.e. not in production, but for the test spinup?
14:15:13 <abhishekk_> Kind of
14:15:51 <croelandt> how does that work since we're migrating away from using a "real" server?
14:16:34 <abhishekk_> Ok may be during next meeting i will come up with one example
14:16:51 <dansmith> maybe abhishekk_ is suggesting using cheroot to maintain some amount of standalone-like runtime?
14:17:25 <dansmith> otherwise I'm definitely confused about the overlap between eventlet, wsgi, and cheroot
14:17:28 <abhishekk_> Yeah
14:18:09 <croelandt> Now I'm confused whether this is related to deployment or funcitonal testing
14:18:21 <abhishekk_> Ok, let’s revisit this next werk
14:18:23 <croelandt> also we're in the middle of migrating all the funcitonal tests away from using a real server :D
14:18:37 <dansmith> I think I'd prefer not to have (and maintain) any sort of standalone thing in parallel to the way we expect glance to be run in production, unless it's very specifically more for just in the test environment or something
14:18:52 <croelandt> yeah maybe if you could put up a document on how we deploy, and what you want to change - something that we could review before the next meeting
14:19:01 <croelandt> dansmith: +1
14:19:08 <dansmith> but either way, many of the other api projects do functional testing without standing up a full server, and thus have a lot less complexity for doing this in general
14:19:15 <abhishekk_> Yes we are and there are still some tests where we need actual server unless we mock the usage of it
14:19:15 <croelandt> dansmith: I think we should also observe what stephenfin does for Devstack and make sure we make it easy for him
14:19:33 <croelandt> ok maybe let's identify those tests
14:19:39 <croelandt> and see why we cannot do like all the other projects
14:19:43 <dansmith> yeah
14:19:54 <abhishekk_> Ack
14:20:13 <croelandt> ok moving on
14:20:17 <croelandt> #topic RBD Delete issue while hash calculation is in progress
14:20:19 <croelandt> abhishekk_: still you :)
14:20:39 <abhishekk_> test_wsgi for example
14:21:02 <abhishekk_> What should we do about this case
14:21:27 <abhishekk_> Unless we solve this issue, new location api is of no use to consumers of glance
14:22:19 <croelandt> ok so this is the issue with turning on do_secure_hash by default, right?
14:22:59 <abhishekk_> I am still inclined towards adding new method in glance_store rbd driver to move image to trash and then from glance delete call we can check if any task is progress for that image then move it to trash
14:23:01 <dansmith> it is on by default, as it should. be
14:23:05 <abhishekk_> Yes
14:23:32 <croelandt> Is this only an issue with RBD?
14:23:37 <abhishekk_> Yes
14:23:51 <dansmith> abhishekk: is this. not a problem also with cinder?
14:24:25 <abhishekk_> I don’t think we have encountered this with cinder as glance backend
14:24:45 <dansmith> but, shouldn't it be? If the hashing is running you can't delete and unmount right?
14:24:57 <abhishekk_> InUseByStore is only raised by rbd
14:25:29 <croelandt> we should probably check with Rajat/Brian
14:25:43 <dansmith> I understand that the exact problem and trace is rbd specific, I'm talking about the general problem of delete while hash is running
14:26:00 <abhishekk_> I am not sure about it, but as per my understanding only rbd restricts us from deleting the image (but actually it deletes it)
14:26:11 <croelandt> so if we send it to trash
14:26:20 <croelandt> will RBD allow us to do that while it's computing the has?
14:26:22 <croelandt> hash*
14:26:33 <dansmith> but the cinder driver will dry to unmount and delete the volume during an image delete right?
14:26:39 <abhishekk_> I will check with cinder as glance backend
14:26:56 <dansmith> I'm saying, let's spend a minute to find out if there are other backends that might also be affected, so we can make sure we fix it right
14:27:06 <abhishekk_> Yes and it will delete it successfully imo
14:27:42 <abhishekk_> Also we have a code at location import in hash calculation task to catch not found and log a warning and continue rather than failing
14:27:49 <croelandt> so 1) the image goes to trash 2) the hash is still being computed 3) the image is "really" deleted?
14:28:35 <dansmith> croelandt: that's the expectation yes
14:28:44 <abhishekk> https://github.com/openstack/glance/blob/master/glance/async_/flows/location_import.py#L94
14:28:58 <dansmith> croelandt: I think rosmaita told me that he thought that might not work, that we can only trash something that just has other clones but isn't actually open for reading
14:29:06 <dansmith> croelandt: I also feel like this must be an existing issue with download..
14:29:24 <croelandt> This might sound stupid, but when Glance deletes an image, can't it start by cancelling the hash calculation?
14:29:25 <dansmith> croelandt: if I go to download an image and then slow-walk the data stream so it takes an hour, can't I block delete of that image?
14:29:41 <dansmith> croelandt: I've already suggested that too - we can, but it takes more work
14:29:42 <abhishekk> AFAIK pranali has tested this with Octopus version of ceph and it does not have that issue
14:29:53 <croelandt> dansmith: oh really?
14:29:57 <dansmith> croelandt: just like the import-from-worker, we need to know which worker and call to *that* one to stop the task
14:30:10 <dansmith> croelandt: if we recorded that, then yes, we could call to cancel and that would be better, IMHO
14:30:26 <croelandt> how hard is it to keep a list of workers and tasks?
14:30:36 <dansmith> croelandt: we don't even need to do that,
14:30:38 <croelandt> isn't that something we can log at some point?
14:30:48 <dansmith> croelandt: like the distributed import, we just record *which* one is hashing an image, on the image
14:30:55 <dansmith> we already do this for distributed import
14:31:23 <dansmith> we record on the image "hey it's me $conf.self_ref_url who staged this image, let me know if you want me to import it"
14:31:28 <abhishekk> so we should add another property?
14:31:31 <dansmith> we could do the same for the hash task
14:31:34 <croelandt> I mean
14:32:00 <croelandt> that seems easier to say "hey, let's forget about the hash" than to say "let's compute this hash we'll never use because the user changed their mind and decided to delete the image"
14:32:06 <dansmith> this is why I want to know if cinder suffers here, because the cinder backend has no "trash" like rbd does
14:32:21 <croelandt> cancelling the hash calculation would be backend-agnostic
14:32:28 <dansmith> exactly
14:32:28 <croelandt> which is a huge plus imo
14:32:42 <croelandt> I don't want to find out in 2 years that "lol s3 changed soemthing and now we have s3-specific code"
14:32:59 <dansmith> not likely for s3 but substitute anything else in there, agreed :D
14:33:03 <croelandt> yeah
14:33:08 <croelandt> or $Shiny_new_backend
14:33:27 <croelandt> or $ai_backend_that_hallucinates_your_data
14:33:39 <croelandt> abhishekk_: you say this used to work, does Ceph know about this?
14:33:47 <croelandt> Is this a change that appears in the release notes?
14:34:14 <abhishekk> I don't know about ceph, but as per pranali it was working with octopus
14:34:25 <croelandt> ok I'll talk to Francesco maybe
14:34:41 <abhishekk> we can confirm that if we manage to add a job which deploys octopus for us?
14:35:05 <croelandt> hm it would be nice to do that
14:35:14 <croelandt> can we easily specify a ceph version for our -ceph jobs?
14:35:27 <dansmith> I want to know about cinder :)
14:35:40 <abhishekk> I think I will check dnm patches of pranali she might have added the octopus related job at that time
14:35:50 <croelandt> dansmith: writing that down as well
14:36:11 <abhishekk> may be rajat can help us with cinder case,
14:36:17 <croelandt> but truly avoiding driver-specific code would be nice
14:36:26 <abhishekk> we have cinder multistore job which does not break imo
14:36:34 <abhishekk> ack,
14:36:51 <croelandt> easier to understand & debug if all drivers behave similarly
14:36:53 <abhishekk> so as per dan we should add one more property to record a worker which is calculating hash
14:37:11 <dansmith> the ceph job isn't actually failing either right?
14:37:13 <croelandt> not sure how hard that is and maybe we'll find it impractical but that sounds good
14:37:18 <dansmith> I mean, reliably
14:37:23 <croelandt> dansmith: the Tempest tests were failing, weren't there?
14:37:33 <croelandt> https://review.opendev.org/c/openstack/tempest/+/949595
14:37:37 <dansmith> they must have been flaky
14:37:42 <croelandt> #link https://review.opendev.org/c/openstack/tempest/+/949595
14:37:45 <abhishekk> ceph nova job is failing intermmittently for snapshot, backp tests
14:37:59 <croelandt> wasn't this patch a way to workaround the issue?
14:38:02 <dansmith> right, so intermittent means the cinder one not being broken doesn't prove anything to me
14:38:49 <abhishekk> ack
14:38:55 <dansmith> looking at the cinder driver real quick,
14:39:01 <dansmith> I don't see how it could not be failing the same way
14:39:23 <dansmith> also remember that upstream we use basically zero-length images so the hash can complete faster than we can delete a server...
14:39:23 <abhishekk> May be I will deploy glance with cinder and test this manually
14:39:31 <croelandt> so maybe we're lucky and the hash calculation is just fast enough?
14:39:45 <croelandt> yeah ok
14:40:04 <croelandt> in real life, it's also probably best not to compute the hash for no reason, right?
14:40:17 <abhishekk> is it ok to test with lvm as backend or should test with glance and cinder both using rbd?
14:40:28 <dansmith> you mean keep hashing it after delete?
14:40:45 <dansmith> croelandt: ^
14:40:46 <croelandt> dansmith: yeah, if a user deletes an image and we keep computing the hash, it's basically useless
14:40:53 <croelandt> and wasted resources
14:40:56 <dansmith> croelandt: its also a DoS opportunity for them
14:41:22 <dansmith> abhishekk: I don't think it matters, but you need to (a) make sure the hashing is working for cinder and (b) that it's actually running when you try to delete the image
14:41:26 <croelandt> by doing this N times simultaneously?
14:41:45 <dansmith> croelandt: I can create and delete images all day long yeah
14:42:02 <abhishekk> AFAIK this new location API will use service credentials so that's less possible>
14:42:08 <dansmith> croelandt: especially if I use web-download to source the material it doesn't even cost me bandwidth :)
14:42:09 <croelandt> yeah and for one image the CPU load may not be high but if you do that enough...
14:42:33 <croelandt> so yeah maybe let's kill the hash computation task
14:42:38 <dansmith> abhishekk: right but I can create lots of instances and snapshot them
14:42:44 <abhishekk> :D
14:42:55 <dansmith> abhishekk: I'm not saying it's an acute problem, but croelandt is right that it's wasted resource
14:43:06 <abhishekk> agree
14:43:11 <croelandt> yeah and we think it might not be triggered, but some smart ass is going to find a workaround
14:43:25 <croelandt> so might as well not do it
14:43:31 <abhishekk> OK so action plan is
14:43:55 <abhishekk> 1, test with cinder
14:44:07 <abhishekk> 2. kill the hash computation during delete
14:44:20 <abhishekk> anything else?
14:44:37 <dansmith> I would still experiment with the rbd trash solution
14:44:54 <dansmith> rosmaita wasn't sure you could trash an active in-use image, but lets figure out if that's an option
14:45:22 <croelandt> also I'm worried this might change at some point in Ceph's life :)
14:45:27 <abhishekk> No, we will not trash active, we will mark it as deleted in glance and then move it to trash
14:45:28 <dansmith> #2 seems like a good idea to me, but it will be more work too and thus will take longer
14:45:47 <dansmith> croelandt: because it already has?
14:45:49 <croelandt> dansmith: I'm confused, didn't you like the driver-agnostic solution better?
14:46:03 <dansmith> croelandt: absolutely
14:46:16 <dansmith> croelandt: it's not just something abhishekk can write up in an afternoon (I suspect)
14:46:37 <abhishekk> may be 3-4 afternoons ?
14:46:47 <dansmith> and I just want to know what the trash solution is, although "marking as deleted in glance" is not really a thing AFAIK, so I'm curious about that
14:47:23 <abhishekk> that will be patchy solution :/
14:47:41 <dansmith> abhishekk: did you mean do what I suggested before, delete in glance, ignore the InUseByStore and make the hash task delete in glance when it finds out the image is deleted in glance?
14:48:09 <abhishekk> yeah
14:48:28 <dansmith> that's good too although it does have a leaky hole
14:48:45 <dansmith> if we mark as deleted and then the hash task crashes or stops because an operator kills it,
14:48:56 <dansmith> then we leak the image on rbd with no real good way to clean up (AFAIK)
14:49:24 <abhishekk> I think rbd deletes the image it does not keep it
14:49:45 <dansmith> why would it?
14:49:46 <abhishekk> it deletes the image and then raises InUse exception :/
14:49:52 <dansmith> oh right, that bug
14:49:55 <abhishekk> yeah
14:49:58 <dansmith> that's the buggy behavior,
14:50:08 <dansmith> but we can't depend on that forever,
14:50:14 <dansmith> so if they fix that behavior we start leaking
14:50:18 <abhishekk> exactly
14:50:58 <abhishekk> So should we stick with cancelling the hash computation before delete call goes to actual backend?
14:51:22 <dansmith> so, I was hoping that we could make delete move the image to trash only (if rosmaita is wrong). if it's unused it will go away immediately, and if it is, it will go away when finished (hash cancel aside)
14:51:35 <dansmith> so _that_ is what I was saying we should still investigate :)
14:52:19 <abhishekk> OK, I will write a POC for that
14:52:52 <abhishekk> So modified action plan
14:52:58 <abhishekk> 1, test with cinder
14:53:06 <abhishekk> 2, POC for moving image to trash
14:53:20 <abhishekk> 3, hash calculation cancellation before deleting the image
14:53:26 <dansmith> yep
14:53:27 <dansmith> the distributed import should be a good example for the hash cancelation thing
14:53:40 <abhishekk> IF 2 works then we can skip 3?
14:53:56 <croelandt> Ideally if 3 works can we skip 2? :)
14:53:57 <abhishekk> That depends on 1 I guess :P
14:54:05 <dansmith> idk, I think 3 is still worthwhile, but yeah.. only if 1 :)
14:54:28 <dansmith> croelandt: yeah, maybe we should just do 1, 3, and then 2 if 3 looks harder than we thought or something
14:54:33 <abhishekk> Ok, may be next Thursday we will have more data to decide on
14:54:53 <croelandt> yeah
14:54:59 <croelandt> let's move on
14:55:04 <croelandt> #topic Specs
14:55:12 <croelandt> On Monday I will merge  https://review.opendev.org/c/openstack/glance-specs/+/947423
14:55:20 <croelandt> unless I see a -1 there :)
14:55:38 <dansmith> ugh, I should go review that
14:55:48 <croelandt> #topic One easy patch per core dev
14:55:54 <croelandt> #link https://review.opendev.org/c/openstack/glance/+/936319
14:56:00 <dansmith> I'm just really behind and swamped
14:56:07 <croelandt> ^ this is a simple patch by Takashi to remove a bunch of duplicated hacking checks
14:56:23 <croelandt> dansmith: yeah :-(
14:56:46 <abhishekk> We have mhen here I think, want to discuss something
14:56:55 <mhen> hi
14:56:56 <croelandt> I see there was a lenghty discussion about the spec
14:57:02 <croelandt> so feel free to go -1 if this has not been resolved
14:57:09 <croelandt> mhen: oh yeah, did you have something? I don't see a topic in the agenda
14:57:24 <mhen> no, just quick update for now
14:57:28 <mhen> I'm working on the image encryption again, currently looking into image import cases
14:57:32 <mhen> glance-direct from staging seems to be working fine and no impact, will look into cross-backend import of encrypted images next
14:57:45 <abhishekk> cool
14:58:05 <abhishekk> let us know if you need anything
14:58:09 <mhen> btw, noticed that `openstack image import --method glance-direct` has pretty bad UX: if the Glance API returns any Conflict or BadRequest (in glance/glance/api/v2/images.py there are lot of cases for this!), the client simply ignores it and shows a GET output of the image stuck in "uploading" state, which can be repeated indefinitely
14:58:27 <mhen> even with `--debug` it only briefly shows the 409 but not the message
14:58:44 <croelandt> interesting
14:58:47 <croelandt> can you file a bug for that?
14:58:50 <abhishekk> may be we need to look into that
14:59:05 <mhen> yes, I will put it on my todo list to file a bug
14:59:09 <abhishekk> could you possibly check with glance image-create-via-import as well?
14:59:22 <mhen> will try, noted
14:59:30 <abhishekk> cool, thank you!!
15:00:00 <croelandt> #topic Open Discussion
15:00:08 <croelandt> I won't be there for the next 2 Thursdays
15:00:16 <croelandt> so it's up to all of y'all whether there will be meetings :)
15:01:00 <abhishekk> Ok, I will chair the next meeting
15:01:16 <abhishekk> we will decide for the next one later
15:02:11 <croelandt> perfect
15:02:17 <croelandt> It's been a long one
15:02:22 <croelandt> see you on #openstack-glance :)
15:02:26 <croelandt> Thanks for joining
15:02:47 <abhishekk> thank you!
15:03:33 <croelandt> #endmeeting