openstackgerrit | Merged openstack/glance master: Change database migration version to wallaby https://review.opendev.org/c/openstack/glance/+/771907 | 00:17 |
---|---|---|
*** k_mouza has joined #openstack-glance | 00:29 | |
*** k_mouza has quit IRC | 00:33 | |
*** lifeless has quit IRC | 00:55 | |
*** lifeless has joined #openstack-glance | 00:57 | |
*** priteau has quit IRC | 01:16 | |
*** baojg has joined #openstack-glance | 01:37 | |
*** gyee has quit IRC | 01:59 | |
*** nicolasbock_ has joined #openstack-glance | 02:21 | |
*** nicolasbock has quit IRC | 02:21 | |
*** irclogbot_2 has quit IRC | 02:21 | |
*** CeeMac has quit IRC | 02:21 | |
*** nicolasbock_ is now known as nicolasbock | 02:21 | |
*** jdillaman has quit IRC | 02:22 | |
*** jdillaman has joined #openstack-glance | 02:22 | |
*** CeeMac has joined #openstack-glance | 02:22 | |
*** irclogbot_2 has joined #openstack-glance | 02:23 | |
*** rcernin has quit IRC | 02:34 | |
*** rcernin has joined #openstack-glance | 02:36 | |
*** lifeless has quit IRC | 04:36 | |
*** lifeless has joined #openstack-glance | 04:38 | |
*** whoami-rajat__ has joined #openstack-glance | 04:59 | |
*** ratailor has joined #openstack-glance | 05:24 | |
*** udesale has joined #openstack-glance | 05:28 | |
*** baojg has quit IRC | 05:49 | |
*** baojg has joined #openstack-glance | 05:49 | |
*** baojg has quit IRC | 05:49 | |
*** baojg has joined #openstack-glance | 05:50 | |
*** baojg has quit IRC | 05:50 | |
*** udesale_ has joined #openstack-glance | 05:50 | |
*** baojg has joined #openstack-glance | 05:51 | |
*** baojg has quit IRC | 05:51 | |
*** baojg has joined #openstack-glance | 05:51 | |
*** baojg has quit IRC | 05:52 | |
*** baojg has joined #openstack-glance | 05:52 | |
*** baojg has quit IRC | 05:52 | |
*** baojg has joined #openstack-glance | 05:53 | |
*** udesale has quit IRC | 05:53 | |
*** baojg has quit IRC | 05:53 | |
*** baojg has joined #openstack-glance | 05:54 | |
*** baojg has quit IRC | 05:54 | |
*** baojg has joined #openstack-glance | 05:55 | |
*** baojg has quit IRC | 05:55 | |
*** baojg has joined #openstack-glance | 05:55 | |
*** baojg has quit IRC | 05:56 | |
*** baojg has joined #openstack-glance | 05:56 | |
*** baojg has quit IRC | 05:56 | |
*** baojg has joined #openstack-glance | 05:57 | |
*** baojg has quit IRC | 05:57 | |
*** k_mouza has joined #openstack-glance | 05:57 | |
*** baojg has joined #openstack-glance | 05:58 | |
*** baojg has quit IRC | 05:58 | |
*** baojg has joined #openstack-glance | 05:58 | |
*** baojg has quit IRC | 05:59 | |
*** baojg has joined #openstack-glance | 05:59 | |
*** baojg has quit IRC | 06:00 | |
*** baojg has joined #openstack-glance | 06:00 | |
*** baojg has quit IRC | 06:00 | |
*** k_mouza has quit IRC | 06:02 | |
*** baojg has joined #openstack-glance | 06:07 | |
*** rcernin has quit IRC | 06:18 | |
*** rcernin has joined #openstack-glance | 06:22 | |
*** baojg has quit IRC | 06:45 | |
*** udesale__ has joined #openstack-glance | 07:05 | |
*** m75abrams has joined #openstack-glance | 07:05 | |
*** udesale_ has quit IRC | 07:07 | |
*** udesale__ has quit IRC | 07:12 | |
*** rcernin has quit IRC | 07:42 | |
*** ralonsoh has joined #openstack-glance | 08:01 | |
* abhishekk will be back around 1400 UTC | 08:13 | |
*** rcernin has joined #openstack-glance | 08:15 | |
*** tkajinam has quit IRC | 08:58 | |
*** udesale has joined #openstack-glance | 10:20 | |
*** rcernin has quit IRC | 10:24 | |
*** udesale has quit IRC | 10:46 | |
*** baojg has joined #openstack-glance | 10:47 | |
*** udesale has joined #openstack-glance | 10:48 | |
*** k_mouza has joined #openstack-glance | 10:48 | |
*** rcernin has joined #openstack-glance | 10:50 | |
*** rcernin has quit IRC | 11:08 | |
*** udesale has quit IRC | 11:18 | |
*** k_mouza has quit IRC | 11:48 | |
*** k_mouza_ has joined #openstack-glance | 11:48 | |
*** ratailor has quit IRC | 11:51 | |
openstackgerrit | Felix Huettner proposed openstack/glance master: Fix missing backend deletion of disabled images https://review.opendev.org/c/openstack/glance/+/772872 | 12:23 |
*** priteau has joined #openstack-glance | 12:42 | |
whoami-rajat__ | hi dansmith abhishekk rosmaita , can you review these functional tests for cinder multiple stores, it's been sitting there for a long time because of different reasons but now finally the CI is passing. https://review.opendev.org/c/openstack/glance/+/750144 | 13:02 |
rosmaita | whoami-rajat__: ack | 13:05 |
whoami-rajat__ | thanks! | 13:12 |
*** udesale has joined #openstack-glance | 13:14 | |
*** baojg has quit IRC | 13:35 | |
abhishekk | whoami-rajat__, ack | 13:37 |
*** k_mouza has joined #openstack-glance | 13:43 | |
*** k_mouza_ has quit IRC | 13:46 | |
abhishekk | jokke, rosmaita, dansmith, smcginnis weekly meeting in 5 minutes at #openstack-meeting | 13:56 |
*** rajivmucheli has joined #openstack-glance | 14:05 | |
*** rajivmucheli has quit IRC | 14:06 | |
openstackgerrit | Pranali Deore proposed openstack/glance-tempest-plugin master: Add some test jobs for glance-tempest-plugin https://review.opendev.org/c/openstack/glance-tempest-plugin/+/772883 | 14:07 |
*** rajivmucheli has joined #openstack-glance | 14:07 | |
*** k_mouza has quit IRC | 14:42 | |
*** k_mouza has joined #openstack-glance | 14:43 | |
*** k_mouza has quit IRC | 14:45 | |
*** rcernin has joined #openstack-glance | 14:46 | |
*** k_mouza has joined #openstack-glance | 14:46 | |
*** rcernin has quit IRC | 15:00 | |
*** k_mouza has quit IRC | 15:09 | |
*** k_mouza_ has joined #openstack-glance | 15:09 | |
*** rajivmucheli has quit IRC | 15:14 | |
* abhishekk going for dinner | 15:23 | |
dansmith | gmann: this rbac thing actually broke the copy permissions that the job is setting out to test: https://review.opendev.org/c/openstack/glance/+/764074 | 15:42 |
gmann | dansmith: yeah, i opened it for debugging but forget. thanks | 15:43 |
dansmith | okay, cool, just wasn't sure if it was obvious where the problem was | 15:43 |
gmann | yeah, something very suspicious there as it is breaking the existing policy override which it should not do | 15:45 |
dansmith | maybe because the job needs to enable that perm and it's doing it the old way? | 15:45 |
dansmith | well, I it's good that we have at least one job with a policy override then :) | 15:45 |
gmann | yeah | 15:46 |
dansmith | echo $'{"copy_image": ""}' > /etc/glance/policy.json | 15:46 |
dansmith | is what it's doing | 15:46 |
gmann | ah yeah | 15:47 |
gmann | but policy.json should be pick if it exist even we changed the default | 15:47 |
dansmith | yeah, I haven't looked deeply at the patch so I dunno why | 15:48 |
dansmith | maybe because it's empty? | 15:48 |
dansmith | the perm? | 15:48 |
gmann | https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f39/764074/9/check/nova-ceph-multistore/f39db38/controller/logs/etc/glance/policy.json | 15:48 |
gmann | so file is there with expected policy | 15:49 |
dansmith | that's not json | 15:49 |
dansmith | oh wait, | 15:49 |
dansmith | maybe firefox is rendering it for me | 15:49 |
dansmith | yeah, heh | 15:49 |
gmann | {"copy_image": ""} is there | 15:49 |
dansmith | yeah, firefox parsed it for me when I loaded it initially, so I wasn't seeing actual json | 15:50 |
gmann | hope devstack is not setting it in config | 15:51 |
gmann | i mean yaml | 15:51 |
dansmith | meaning creating an empty yaml file so this gets ignored? | 15:52 |
dansmith | I don't see one in etc/glance | 15:52 |
gmann | yeah but nothing here so all good until this - https://zuul.opendev.org/t/openstack/build/f39db383d98941cc998a0f25bfb0c15e/log/controller/logs/etc/glance/glance-api_conf.txt | 15:52 |
gmann | https://zuul.opendev.org/t/openstack/build/f39db383d98941cc998a0f25bfb0c15e/log/controller/logs/screen-g-api.txt#392 | 15:54 |
gmann | dansmith: ^^ it is looking for policy.yaml | 15:54 |
*** m75abrams has quit IRC | 15:55 | |
dansmith | gmann: so not falling back to json as it should? | 16:01 |
gmann | yeah, I think I need to add more log in this as lot of if else condition for picking file https://github.com/openstack/oslo.policy/blob/master/oslo_policy/policy.py#L363 | 16:01 |
gmann | let me add it with more logging and see why fallback not working | 16:02 |
* dansmith nods | 16:02 | |
dansmith | abhishekk: lmk when you're back from dinner | 16:09 |
*** mgagne has joined #openstack-glance | 16:10 | |
*** udesale has quit IRC | 16:17 | |
abhishekk | dansmith, I am now | 16:20 |
dansmith | abhishekk: so I've been looking at this situation where we have nodes that staged an image and then the image was deleted or imported from another | 16:21 |
dansmith | like if someone doesn't have a shared dir for staging, | 16:21 |
dansmith | or one note stages the image and then the image is deleted without being able to delete the staging file | 16:21 |
dansmith | I think we discussed this briefly back when doing the copy-to-store stuff | 16:21 |
dansmith | but as I mentioned, I confirmed that it's a problem, would you agree that's a bug? | 16:22 |
abhishekk | yes | 16:22 |
dansmith | if it happens persistently, you could end up filling the disk with abandoned images | 16:22 |
dansmith | okay, so I will file a bug for that | 16:22 |
abhishekk | ack | 16:22 |
dansmith | abhishekk: so here's my next question, related to this staging stuff: | 16:22 |
dansmith | why is staging a glance_store at all? | 16:23 |
dansmith | the one benefit I see is that we can do a store copy from $staging to $dest | 16:23 |
dansmith | but otherwise, alllll the code that deals with staging seems to have to snip off the file:/// and use os-level commands on the files themselves | 16:23 |
dansmith | and | 16:23 |
dansmith | AFAICT, we can never have staging be anything other than file-on-disk, if we want things like format inspect (during import) and format conversion to work | 16:24 |
abhishekk | as far as I recollect we were using upload call to upload the data in staging area | 16:24 |
dansmith | and even if you discount those, making staging be a network location like swift would be massively inefficient because you could copy across the network multiple times | 16:24 |
jokke | I'd disagree it being a bug, it's problematic 'though but the requirements are clearly laid down in documentation for the deployments to avoid it | 16:24 |
abhishekk | kind of known issue | 16:25 |
jokke | that it is | 16:25 |
dansmith | seems like a bug to me, because it can happen even if properly config'd, but nfs gets unmounted or something similar | 16:25 |
abhishekk | and that's why we are recommending to use web-download instead of glance-direct at the moment | 16:26 |
openstackgerrit | Ghanshyam proposed openstack/glance master: DNM: testing json->yaml migration failure https://review.opendev.org/c/openstack/glance/+/772912 | 16:26 |
dansmith | I forget, does web download download it first to staging? | 16:26 |
jokke | it's kind of like saying that VM stopping running because server breaks is nova bug | 16:27 |
abhishekk | web-download downloads in the staging first but all requests goes to same node | 16:27 |
dansmith | jokke: leaking resources is not the same as a vm crashing | 16:27 |
dansmith | abhishekk: but if I power off that worker in the middle of web-download, then when it starts back up, the image could be deleted by the user and the image will still be in staging on that node, forever right? | 16:28 |
abhishekk | yes | 16:29 |
dansmith | jokke: would you agree that is a bug? | 16:29 |
jokke | The very specific case that we don't have means to automatically clean up after unclean shutdown, I think the actual bug is that we have no means to resume where we left | 16:31 |
jokke | being the resume then completed or reverted is different story | 16:31 |
dansmith | no, | 16:32 |
dansmith | I'm saying the following sequence: | 16:32 |
dansmith | 1. I start a web-download import, which downloads the image to staging and is in the process of uploading to, say, ceph | 16:32 |
dansmith | 2. The node running the import loses power | 16:32 |
dansmith | 3. I lose patience and delete the image via another worker which still has power | 16:32 |
dansmith | 4. That worker comes back online and leaks 1TB of data until the operator notices | 16:33 |
dansmith | (1TB because I get to choose the image size :) | 16:33 |
jokke | worst case it leaks almost 2TB | 16:33 |
dansmith | what we should do, is on startup scan our staging area, and if we find any images there that *also* have a deleted image record, we delete the staging file | 16:34 |
jokke | as th location gets registered when the upload finishes so you will have the data in staging and partially in the store (I got to choose how far the upload got before the power went down) | 16:34 |
dansmith | presumably the delete of the image would have seen the partial location for the image and deleted that from ceph too, right? | 16:34 |
dansmith | so no leak of the 0.9TB we already wrote | 16:34 |
dansmith | well, that's a problem too then | 16:35 |
dansmith | I thought it created the location active=false so it could store the destination location, but if not that's another problem I guess | 16:35 |
dansmith | so, does that mean you're on board with this as a bug now? | 16:35 |
jokke | which one, that we allow user to delete the image while there is supposedly import going or that we don't proactively delete user data? ;) | 16:37 |
*** coreycb has quit IRC | 16:37 | |
dansmith | do we block the delete of the in-progress image? | 16:37 |
*** coreycb has joined #openstack-glance | 16:38 | |
jokke | Currently we don't we just deal with it, but that's one way to approach the problem | 16:38 |
dansmith | I think that users find not being able to delete things pretty annoying | 16:38 |
dansmith | and generally, the philosophy is that you can delete anything in any state, and the server collects it if something is in progress | 16:39 |
dansmith | especially for an operation like this where we're deleting it because it's stuck or not progressing... | 16:39 |
jokke | there is indeed bug somewhere there we know of, but which is the actual bug, that we let user to delete record we are supposedly processing or that we don't automatically delete data on start is the debatable part of it | 16:39 |
dansmith | right now you can't get the delete back to the original worker, | 16:39 |
dansmith | which means you can't let them ever delete it even after the node comes back :) | 16:40 |
dansmith | to your other point, I don't think that deleting from the staging dir is "deleting user data" | 16:41 |
*** dosaboy has quit IRC | 16:41 | |
*** dosaboy has joined #openstack-glance | 16:41 | |
dansmith | but, if I work on a fix to make nodes clean their staging dir on restart, is that going to be acceptable? (cc: abhishekk ) | 16:42 |
abhishekk | at the moment I think this is best possible solution we could have | 16:42 |
dansmith | I agree, it's very "eventual consistency" | 16:43 |
dansmith | which is supposedly what we're doing here | 16:43 |
dansmith | much like the OS doesn't scrub your bits from disk before unlink() returns | 16:43 |
jokke | Just out of curiosity and not pointing any fingers to any direction. Back to the Nova example I made, when nova comes back up and sees that the VM ain't running will it just go and delete the ephemeral data used on that VM or how are you dealing with these there? | 16:43 |
dansmith | it absolutely does | 16:44 |
dansmith | if there is a matching instance record that is now deleted | 16:44 |
dansmith | so, | 16:44 |
dansmith | instance is running, compute crashes, client decides they'll recreate elsewhere, so they delete the original | 16:44 |
dansmith | we can't actually delete when the node is down, but we let them nuke the record right then | 16:44 |
dansmith | nova starts back up, scans the instance records, disk images, libvirt domains, and cleans up after the delete | 16:45 |
dansmith | there is a config that lets you tell it to leave those things in place, for forensics if you want, but it defaults to "reap" | 16:46 |
dansmith | since staging is temporary space, I'm not sure the same makes sense for glance, but if that's important to you then that's fine | 16:46 |
jokke | So this is exactly one of the reasons why I think it's important to have the location record for the staging data. 1) if delayed delete is enabled, we will let the user to nuke the image and scrubber will eventually clean it up or 2) if the data cannot be deleted due to the store not being available (aka the host where the data is sitting in) the delete won't go through before we can act on it | 16:47 |
dansmith | I thought the scrubber was broken and potentially deprecated? | 16:49 |
dansmith | during the wsgi discussions, I asked abhishekk about not being able to make it work and I thought it was needing some love | 16:49 |
jokke | Well that's a bug | 16:50 |
jokke | I wasn't aware of | 16:50 |
abhishekk | scrubber is not tested since ages, at least not by me | 16:50 |
dansmith | so, the scrubber does what? deletes user data from actual stores when the image record is deleted instead of starting that synchronously with the image delete? | 16:51 |
dansmith | abhishekk: it was crashing when I tried it | 16:51 |
abhishekk | ack | 16:51 |
abhishekk | the concept of delayed delete is if it is enabled it marks image as soft delete when user deletes the image and later scrubber picks up those images and deletes those from the backend | 16:52 |
abhishekk | it was added so that user could restore the image if it is still in pending delete state (soft deleted), but this later part of restoring was never addressed | 16:53 |
jokke | yup, so the image status goes to deleted, the metadata gets whacked and the actual location removal from stores will happen out of band instead of keeping the client hostage for it | 16:53 |
dansmith | right, so if the scrubber was working does it properly handle stores that are non-local? | 16:53 |
jokke | no! the restore was never real thing and was tried much later as side effect of it | 16:54 |
abhishekk | ok | 16:54 |
jokke | it was implemented as synchronous deletes of big images were equally taxing on the clients as uploads | 16:54 |
dansmith | like, does it say "well, I see a location for this, but I can't access it because it's not on my local disk, so I better leave this location in place so another node will try to clean it" ? | 16:54 |
jokke | and held the client as hostage until the data was deleted from the store | 16:54 |
jokke | dansmith: the scrubber basically went and tried to delete the locations, if it fails it just hops to the next image on the list and tries again on the next cycle it runs. Once all the locations of that deleted image was whacked it stops trying and the image is cosidered as deleted | 16:57 |
jokke | everything else is cleaned during the actual delete call | 16:57 |
jokke | Allowed for example ceph backed images that had usage by cow to be deleted and the data was finally whacked when the ceph reference count dropped to 0 and ceph allows the data to be deleted | 16:58 |
jokke | o you could delete image without killing a VM still running from that image | 16:59 |
dansmith | so, if we go the location route to solve this staging residue bug, then people without delayed delete have to turn on and run the scrubber to periodically scan everything in order to avoid leaking storage right? | 17:00 |
dansmith | as opposed to just cleaning the local staging dir on startup, right after we know we could have had a problem | 17:01 |
dansmith | but it sounds like the staging leak is a legit bug, regardless of the solution | 17:01 |
abhishekk | either way IMO it is beneficial to scan staging_dir and clean at the startup | 17:02 |
dansmith | agreed | 17:02 |
*** gyee has joined #openstack-glance | 17:02 | |
dansmith | I tend to think that it would also be good to clean files there that are related to import-unfinished image records, not only deleted | 17:02 |
dansmith | because if someone didn't delete the image while we were down, we still want to clean up the mess in staging | 17:03 |
jokke | I've been thinking that ^^ | 17:03 |
jokke | just wondering if we have meaningful way to resme or if we just need to go and clean up ... the benefit of resume is that we would not be whacking user data without their agreement of it. | 17:05 |
dansmith | it's staging data.. it's /tmp | 17:05 |
dansmith | resume is always nice, but unless or until you're going to make that work reliably, I'd rather have a clean staging dir :) | 17:06 |
abhishekk | and that will be lot of work to do as well | 17:07 |
dansmith | so should I spend time on this or not? | 17:12 |
abhishekk | yes | 17:13 |
jokke | dansmith: somehow this feels like you're looking to go incredible leghts of work to avoid us having actual record that we have data related to the image record that has not been cleaned up and will require an action. Or am I missing something? | 17:13 |
dansmith | I assume you're talking about the location for staging data, | 17:16 |
dansmith | but as I noted above, that only helps if we're going to require people to turn on the scrubber (and fix the scrubber, and make sure the scrubber does the right thing), in order to avoid the leak | 17:16 |
jokke | Also should we go with the locations route be it wallaby or xena, all that work will be void as we definitely should not go and delete data when we do have location record for it | 17:17 |
dansmith | if that's the plan, then that's cool, but I think there's a lot of work in front of us for that to happen | 17:17 |
dansmith | in staging when we have a deleted image? I think that's totally fine | 17:17 |
dansmith | but okay, I'm hearing "don't work on this" so I won't | 17:17 |
abhishekk | jokke, even we go locations approach I think we need to have this check on startup to check delayed delete is not enabled then clear this staging area?? | 17:18 |
jokke | abhishekk: there won't be deleted images with data left behind as the delete will fail if the location can't be removed. So that code should never be executed in either case | 17:20 |
dansmith | images should always be deletable | 17:21 |
jokke | dansmith: that's We can debate that with the Images API v3 spec, but that is not the case with v1 or v2 ;) | 17:22 |
jokke | - that's | 17:23 |
dansmith | yep, well, I'll just drop it | 17:23 |
dansmith | I'd like to file the bug to document the concern, you can close it and I'll leave it alone | 17:23 |
jokke | dansmith: sounds good, we can either solve the bug in the refactoring happening on the distributed import to clean up the debt we've accumulated or if that does not happen then we have bug to follow up on | 17:26 |
jokke | Like I totally agree we need a solution to clean up after ourselves, I just don't think you running to opposite direction again right now is good use of either of our time | 17:27 |
* abhishekk going offline for the day | 17:38 | |
abhishekk | dansmith, jokke ^^ | 17:38 |
dansmith | o/ | 17:38 |
jokke | night o/~ | 17:38 |
abhishekk | o/~ | 17:38 |
*** k_mouza_ has quit IRC | 19:08 | |
*** k_mouza has joined #openstack-glance | 19:48 | |
*** k_mouza has quit IRC | 19:54 | |
*** whoami-rajat__ has quit IRC | 19:58 | |
*** rcernin has joined #openstack-glance | 20:03 | |
*** rcernin has quit IRC | 20:30 | |
*** rcernin has joined #openstack-glance | 20:41 | |
*** ralonsoh has quit IRC | 20:48 | |
*** zzzeek has quit IRC | 20:57 | |
*** zzzeek has joined #openstack-glance | 20:57 | |
*** rcernin has quit IRC | 21:34 | |
*** rcernin has joined #openstack-glance | 21:51 | |
*** k_mouza has joined #openstack-glance | 23:09 | |
*** k_mouza has quit IRC | 23:14 | |
*** knikolla has quit IRC | 23:36 | |
*** knikolla has joined #openstack-glance | 23:36 | |
*** PrinzElvis_ has joined #openstack-glance | 23:37 | |
*** PrinzElvis has quit IRC | 23:37 | |
*** PrinzElvis_ is now known as PrinzElvis | 23:37 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!