Thursday, 2021-01-28

openstackgerrit	Merged openstack/glance master: Change database migration version to wallaby https://review.opendev.org/c/openstack/glance/+/771907	00:17
*** k_mouza has joined #openstack-glance		00:29
*** k_mouza has quit IRC		00:33
*** lifeless has quit IRC		00:55
*** lifeless has joined #openstack-glance		00:57
*** priteau has quit IRC		01:16
*** baojg has joined #openstack-glance		01:37
*** gyee has quit IRC		01:59
*** nicolasbock_ has joined #openstack-glance		02:21
*** nicolasbock has quit IRC		02:21
*** irclogbot_2 has quit IRC		02:21
*** CeeMac has quit IRC		02:21
*** nicolasbock_ is now known as nicolasbock		02:21
*** jdillaman has quit IRC		02:22
*** jdillaman has joined #openstack-glance		02:22
*** CeeMac has joined #openstack-glance		02:22
*** irclogbot_2 has joined #openstack-glance		02:23
*** rcernin has quit IRC		02:34
*** rcernin has joined #openstack-glance		02:36
*** lifeless has quit IRC		04:36
*** lifeless has joined #openstack-glance		04:38
*** whoami-rajat__ has joined #openstack-glance		04:59
*** ratailor has joined #openstack-glance		05:24
*** udesale has joined #openstack-glance		05:28
*** baojg has quit IRC		05:49
*** baojg has joined #openstack-glance		05:49
*** baojg has quit IRC		05:49
*** baojg has joined #openstack-glance		05:50
*** baojg has quit IRC		05:50
*** udesale_ has joined #openstack-glance		05:50
*** baojg has joined #openstack-glance		05:51
*** baojg has quit IRC		05:51
*** baojg has joined #openstack-glance		05:51
*** baojg has quit IRC		05:52
*** baojg has joined #openstack-glance		05:52
*** baojg has quit IRC		05:52
*** baojg has joined #openstack-glance		05:53
*** udesale has quit IRC		05:53
*** baojg has quit IRC		05:53
*** baojg has joined #openstack-glance		05:54
*** baojg has quit IRC		05:54
*** baojg has joined #openstack-glance		05:55
*** baojg has quit IRC		05:55
*** baojg has joined #openstack-glance		05:55
*** baojg has quit IRC		05:56
*** baojg has joined #openstack-glance		05:56
*** baojg has quit IRC		05:56
*** baojg has joined #openstack-glance		05:57
*** baojg has quit IRC		05:57
*** k_mouza has joined #openstack-glance		05:57
*** baojg has joined #openstack-glance		05:58
*** baojg has quit IRC		05:58
*** baojg has joined #openstack-glance		05:58
*** baojg has quit IRC		05:59
*** baojg has joined #openstack-glance		05:59
*** baojg has quit IRC		06:00
*** baojg has joined #openstack-glance		06:00
*** baojg has quit IRC		06:00
*** k_mouza has quit IRC		06:02
*** baojg has joined #openstack-glance		06:07
*** rcernin has quit IRC		06:18
*** rcernin has joined #openstack-glance		06:22
*** baojg has quit IRC		06:45
*** udesale__ has joined #openstack-glance		07:05
*** m75abrams has joined #openstack-glance		07:05
*** udesale_ has quit IRC		07:07
*** udesale__ has quit IRC		07:12
*** rcernin has quit IRC		07:42
*** ralonsoh has joined #openstack-glance		08:01
* abhishekk will be back around 1400 UTC		08:13
*** rcernin has joined #openstack-glance		08:15
*** tkajinam has quit IRC		08:58
*** udesale has joined #openstack-glance		10:20
*** rcernin has quit IRC		10:24
*** udesale has quit IRC		10:46
*** baojg has joined #openstack-glance		10:47
*** udesale has joined #openstack-glance		10:48
*** k_mouza has joined #openstack-glance		10:48
*** rcernin has joined #openstack-glance		10:50
*** rcernin has quit IRC		11:08
*** udesale has quit IRC		11:18
*** k_mouza has quit IRC		11:48
*** k_mouza_ has joined #openstack-glance		11:48
*** ratailor has quit IRC		11:51
openstackgerrit	Felix Huettner proposed openstack/glance master: Fix missing backend deletion of disabled images https://review.opendev.org/c/openstack/glance/+/772872	12:23
*** priteau has joined #openstack-glance		12:42
whoami-rajat__	hi dansmith abhishekk rosmaita , can you review these functional tests for cinder multiple stores, it's been sitting there for a long time because of different reasons but now finally the CI is passing. https://review.opendev.org/c/openstack/glance/+/750144	13:02
rosmaita	whoami-rajat__: ack	13:05
whoami-rajat__	thanks!	13:12
*** udesale has joined #openstack-glance		13:14
*** baojg has quit IRC		13:35
abhishekk	whoami-rajat__, ack	13:37
*** k_mouza has joined #openstack-glance		13:43
*** k_mouza_ has quit IRC		13:46
abhishekk	jokke, rosmaita, dansmith, smcginnis weekly meeting in 5 minutes at #openstack-meeting	13:56
*** rajivmucheli has joined #openstack-glance		14:05
*** rajivmucheli has quit IRC		14:06
openstackgerrit	Pranali Deore proposed openstack/glance-tempest-plugin master: Add some test jobs for glance-tempest-plugin https://review.opendev.org/c/openstack/glance-tempest-plugin/+/772883	14:07
*** rajivmucheli has joined #openstack-glance		14:07
*** k_mouza has quit IRC		14:42
*** k_mouza has joined #openstack-glance		14:43
*** k_mouza has quit IRC		14:45
*** rcernin has joined #openstack-glance		14:46
*** k_mouza has joined #openstack-glance		14:46
*** rcernin has quit IRC		15:00
*** k_mouza has quit IRC		15:09
*** k_mouza_ has joined #openstack-glance		15:09
*** rajivmucheli has quit IRC		15:14
* abhishekk going for dinner		15:23
dansmith	gmann: this rbac thing actually broke the copy permissions that the job is setting out to test: https://review.opendev.org/c/openstack/glance/+/764074	15:42
gmann	dansmith: yeah, i opened it for debugging but forget. thanks	15:43
dansmith	okay, cool, just wasn't sure if it was obvious where the problem was	15:43
gmann	yeah, something very suspicious there as it is breaking the existing policy override which it should not do	15:45
dansmith	maybe because the job needs to enable that perm and it's doing it the old way?	15:45
dansmith	well, I it's good that we have at least one job with a policy override then :)	15:45
gmann	yeah	15:46
dansmith	echo $'{"copy_image": ""}' > /etc/glance/policy.json	15:46
dansmith	is what it's doing	15:46
gmann	ah yeah	15:47
gmann	but policy.json should be pick if it exist even we changed the default	15:47
dansmith	yeah, I haven't looked deeply at the patch so I dunno why	15:48
dansmith	maybe because it's empty?	15:48
dansmith	the perm?	15:48
gmann	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f39/764074/9/check/nova-ceph-multistore/f39db38/controller/logs/etc/glance/policy.json	15:48
gmann	so file is there with expected policy	15:49
dansmith	that's not json	15:49
dansmith	oh wait,	15:49
dansmith	maybe firefox is rendering it for me	15:49
dansmith	yeah, heh	15:49
gmann	{"copy_image": ""} is there	15:49
dansmith	yeah, firefox parsed it for me when I loaded it initially, so I wasn't seeing actual json	15:50
gmann	hope devstack is not setting it in config	15:51
gmann	i mean yaml	15:51
dansmith	meaning creating an empty yaml file so this gets ignored?	15:52
dansmith	I don't see one in etc/glance	15:52
gmann	yeah but nothing here so all good until this - https://zuul.opendev.org/t/openstack/build/f39db383d98941cc998a0f25bfb0c15e/log/controller/logs/etc/glance/glance-api_conf.txt	15:52
gmann	https://zuul.opendev.org/t/openstack/build/f39db383d98941cc998a0f25bfb0c15e/log/controller/logs/screen-g-api.txt#392	15:54
gmann	dansmith: ^^ it is looking for policy.yaml	15:54
*** m75abrams has quit IRC		15:55
dansmith	gmann: so not falling back to json as it should?	16:01
gmann	yeah, I think I need to add more log in this as lot of if else condition for picking file https://github.com/openstack/oslo.policy/blob/master/oslo_policy/policy.py#L363	16:01
gmann	let me add it with more logging and see why fallback not working	16:02
* dansmith nods		16:02
dansmith	abhishekk: lmk when you're back from dinner	16:09
*** mgagne has joined #openstack-glance		16:10
*** udesale has quit IRC		16:17
abhishekk	dansmith, I am now	16:20
dansmith	abhishekk: so I've been looking at this situation where we have nodes that staged an image and then the image was deleted or imported from another	16:21
dansmith	like if someone doesn't have a shared dir for staging,	16:21
dansmith	or one note stages the image and then the image is deleted without being able to delete the staging file	16:21
dansmith	I think we discussed this briefly back when doing the copy-to-store stuff	16:21
dansmith	but as I mentioned, I confirmed that it's a problem, would you agree that's a bug?	16:22
abhishekk	yes	16:22
dansmith	if it happens persistently, you could end up filling the disk with abandoned images	16:22
dansmith	okay, so I will file a bug for that	16:22
abhishekk	ack	16:22
dansmith	abhishekk: so here's my next question, related to this staging stuff:	16:22
dansmith	why is staging a glance_store at all?	16:23
dansmith	the one benefit I see is that we can do a store copy from $staging to $dest	16:23
dansmith	but otherwise, alllll the code that deals with staging seems to have to snip off the file:/// and use os-level commands on the files themselves	16:23
dansmith	and	16:23
dansmith	AFAICT, we can never have staging be anything other than file-on-disk, if we want things like format inspect (during import) and format conversion to work	16:24
abhishekk	as far as I recollect we were using upload call to upload the data in staging area	16:24
dansmith	and even if you discount those, making staging be a network location like swift would be massively inefficient because you could copy across the network multiple times	16:24
jokke	I'd disagree it being a bug, it's problematic 'though but the requirements are clearly laid down in documentation for the deployments to avoid it	16:24
abhishekk	kind of known issue	16:25
jokke	that it is	16:25
dansmith	seems like a bug to me, because it can happen even if properly config'd, but nfs gets unmounted or something similar	16:25
abhishekk	and that's why we are recommending to use web-download instead of glance-direct at the moment	16:26
openstackgerrit	Ghanshyam proposed openstack/glance master: DNM: testing json->yaml migration failure https://review.opendev.org/c/openstack/glance/+/772912	16:26
dansmith	I forget, does web download download it first to staging?	16:26
jokke	it's kind of like saying that VM stopping running because server breaks is nova bug	16:27
abhishekk	web-download downloads in the staging first but all requests goes to same node	16:27
dansmith	jokke: leaking resources is not the same as a vm crashing	16:27
dansmith	abhishekk: but if I power off that worker in the middle of web-download, then when it starts back up, the image could be deleted by the user and the image will still be in staging on that node, forever right?	16:28
abhishekk	yes	16:29
dansmith	jokke: would you agree that is a bug?	16:29
jokke	The very specific case that we don't have means to automatically clean up after unclean shutdown, I think the actual bug is that we have no means to resume where we left	16:31
jokke	being the resume then completed or reverted is different story	16:31
dansmith	no,	16:32
dansmith	I'm saying the following sequence:	16:32
dansmith	1. I start a web-download import, which downloads the image to staging and is in the process of uploading to, say, ceph	16:32
dansmith	2. The node running the import loses power	16:32
dansmith	3. I lose patience and delete the image via another worker which still has power	16:32
dansmith	4. That worker comes back online and leaks 1TB of data until the operator notices	16:33
dansmith	(1TB because I get to choose the image size :)	16:33
jokke	worst case it leaks almost 2TB	16:33
dansmith	what we should do, is on startup scan our staging area, and if we find any images there that also have a deleted image record, we delete the staging file	16:34
jokke	as th location gets registered when the upload finishes so you will have the data in staging and partially in the store (I got to choose how far the upload got before the power went down)	16:34
dansmith	presumably the delete of the image would have seen the partial location for the image and deleted that from ceph too, right?	16:34
dansmith	so no leak of the 0.9TB we already wrote	16:34
dansmith	well, that's a problem too then	16:35
dansmith	I thought it created the location active=false so it could store the destination location, but if not that's another problem I guess	16:35
dansmith	so, does that mean you're on board with this as a bug now?	16:35
jokke	which one, that we allow user to delete the image while there is supposedly import going or that we don't proactively delete user data? ;)	16:37
*** coreycb has quit IRC		16:37
dansmith	do we block the delete of the in-progress image?	16:37
*** coreycb has joined #openstack-glance		16:38
jokke	Currently we don't we just deal with it, but that's one way to approach the problem	16:38
dansmith	I think that users find not being able to delete things pretty annoying	16:38
dansmith	and generally, the philosophy is that you can delete anything in any state, and the server collects it if something is in progress	16:39
dansmith	especially for an operation like this where we're deleting it because it's stuck or not progressing...	16:39
jokke	there is indeed bug somewhere there we know of, but which is the actual bug, that we let user to delete record we are supposedly processing or that we don't automatically delete data on start is the debatable part of it	16:39
dansmith	right now you can't get the delete back to the original worker,	16:39
dansmith	which means you can't let them ever delete it even after the node comes back :)	16:40
dansmith	to your other point, I don't think that deleting from the staging dir is "deleting user data"	16:41
*** dosaboy has quit IRC		16:41
*** dosaboy has joined #openstack-glance		16:41
dansmith	but, if I work on a fix to make nodes clean their staging dir on restart, is that going to be acceptable? (cc: abhishekk )	16:42
abhishekk	at the moment I think this is best possible solution we could have	16:42
dansmith	I agree, it's very "eventual consistency"	16:43
dansmith	which is supposedly what we're doing here	16:43
dansmith	much like the OS doesn't scrub your bits from disk before unlink() returns	16:43
jokke	Just out of curiosity and not pointing any fingers to any direction. Back to the Nova example I made, when nova comes back up and sees that the VM ain't running will it just go and delete the ephemeral data used on that VM or how are you dealing with these there?	16:43
dansmith	it absolutely does	16:44
dansmith	if there is a matching instance record that is now deleted	16:44
dansmith	so,	16:44
dansmith	instance is running, compute crashes, client decides they'll recreate elsewhere, so they delete the original	16:44
dansmith	we can't actually delete when the node is down, but we let them nuke the record right then	16:44
dansmith	nova starts back up, scans the instance records, disk images, libvirt domains, and cleans up after the delete	16:45
dansmith	there is a config that lets you tell it to leave those things in place, for forensics if you want, but it defaults to "reap"	16:46
dansmith	since staging is temporary space, I'm not sure the same makes sense for glance, but if that's important to you then that's fine	16:46
jokke	So this is exactly one of the reasons why I think it's important to have the location record for the staging data. 1) if delayed delete is enabled, we will let the user to nuke the image and scrubber will eventually clean it up or 2) if the data cannot be deleted due to the store not being available (aka the host where the data is sitting in) the delete won't go through before we can act on it	16:47
dansmith	I thought the scrubber was broken and potentially deprecated?	16:49
dansmith	during the wsgi discussions, I asked abhishekk about not being able to make it work and I thought it was needing some love	16:49
jokke	Well that's a bug	16:50
jokke	I wasn't aware of	16:50
abhishekk	scrubber is not tested since ages, at least not by me	16:50
dansmith	so, the scrubber does what? deletes user data from actual stores when the image record is deleted instead of starting that synchronously with the image delete?	16:51
dansmith	abhishekk: it was crashing when I tried it	16:51
abhishekk	ack	16:51
abhishekk	the concept of delayed delete is if it is enabled it marks image as soft delete when user deletes the image and later scrubber picks up those images and deletes those from the backend	16:52
abhishekk	it was added so that user could restore the image if it is still in pending delete state (soft deleted), but this later part of restoring was never addressed	16:53
jokke	yup, so the image status goes to deleted, the metadata gets whacked and the actual location removal from stores will happen out of band instead of keeping the client hostage for it	16:53
dansmith	right, so if the scrubber was working does it properly handle stores that are non-local?	16:53
jokke	no! the restore was never real thing and was tried much later as side effect of it	16:54
abhishekk	ok	16:54
jokke	it was implemented as synchronous deletes of big images were equally taxing on the clients as uploads	16:54
dansmith	like, does it say "well, I see a location for this, but I can't access it because it's not on my local disk, so I better leave this location in place so another node will try to clean it" ?	16:54
jokke	and held the client as hostage until the data was deleted from the store	16:54
jokke	dansmith: the scrubber basically went and tried to delete the locations, if it fails it just hops to the next image on the list and tries again on the next cycle it runs. Once all the locations of that deleted image was whacked it stops trying and the image is cosidered as deleted	16:57
jokke	everything else is cleaned during the actual delete call	16:57
jokke	Allowed for example ceph backed images that had usage by cow to be deleted and the data was finally whacked when the ceph reference count dropped to 0 and ceph allows the data to be deleted	16:58
jokke	o you could delete image without killing a VM still running from that image	16:59
dansmith	so, if we go the location route to solve this staging residue bug, then people without delayed delete have to turn on and run the scrubber to periodically scan everything in order to avoid leaking storage right?	17:00
dansmith	as opposed to just cleaning the local staging dir on startup, right after we know we could have had a problem	17:01
dansmith	but it sounds like the staging leak is a legit bug, regardless of the solution	17:01
abhishekk	either way IMO it is beneficial to scan staging_dir and clean at the startup	17:02
dansmith	agreed	17:02
*** gyee has joined #openstack-glance		17:02
dansmith	I tend to think that it would also be good to clean files there that are related to import-unfinished image records, not only deleted	17:02
dansmith	because if someone didn't delete the image while we were down, we still want to clean up the mess in staging	17:03
jokke	I've been thinking that ^^	17:03
jokke	just wondering if we have meaningful way to resme or if we just need to go and clean up ... the benefit of resume is that we would not be whacking user data without their agreement of it.	17:05
dansmith	it's staging data.. it's /tmp	17:05
dansmith	resume is always nice, but unless or until you're going to make that work reliably, I'd rather have a clean staging dir :)	17:06
abhishekk	and that will be lot of work to do as well	17:07
dansmith	so should I spend time on this or not?	17:12
abhishekk	yes	17:13
jokke	dansmith: somehow this feels like you're looking to go incredible leghts of work to avoid us having actual record that we have data related to the image record that has not been cleaned up and will require an action. Or am I missing something?	17:13
dansmith	I assume you're talking about the location for staging data,	17:16
dansmith	but as I noted above, that only helps if we're going to require people to turn on the scrubber (and fix the scrubber, and make sure the scrubber does the right thing), in order to avoid the leak	17:16
jokke	Also should we go with the locations route be it wallaby or xena, all that work will be void as we definitely should not go and delete data when we do have location record for it	17:17
dansmith	if that's the plan, then that's cool, but I think there's a lot of work in front of us for that to happen	17:17
dansmith	in staging when we have a deleted image? I think that's totally fine	17:17
dansmith	but okay, I'm hearing "don't work on this" so I won't	17:17
abhishekk	jokke, even we go locations approach I think we need to have this check on startup to check delayed delete is not enabled then clear this staging area??	17:18
jokke	abhishekk: there won't be deleted images with data left behind as the delete will fail if the location can't be removed. So that code should never be executed in either case	17:20
dansmith	images should always be deletable	17:21
jokke	dansmith: that's We can debate that with the Images API v3 spec, but that is not the case with v1 or v2 ;)	17:22
jokke	- that's	17:23
dansmith	yep, well, I'll just drop it	17:23
dansmith	I'd like to file the bug to document the concern, you can close it and I'll leave it alone	17:23
jokke	dansmith: sounds good, we can either solve the bug in the refactoring happening on the distributed import to clean up the debt we've accumulated or if that does not happen then we have bug to follow up on	17:26
jokke	Like I totally agree we need a solution to clean up after ourselves, I just don't think you running to opposite direction again right now is good use of either of our time	17:27
* abhishekk going offline for the day		17:38
abhishekk	dansmith, jokke ^^	17:38
dansmith	o/	17:38
jokke	night o/~	17:38
abhishekk	o/~	17:38
*** k_mouza_ has quit IRC		19:08
*** k_mouza has joined #openstack-glance		19:48
*** k_mouza has quit IRC		19:54
*** whoami-rajat__ has quit IRC		19:58
*** rcernin has joined #openstack-glance		20:03
*** rcernin has quit IRC		20:30
*** rcernin has joined #openstack-glance		20:41
*** ralonsoh has quit IRC		20:48
*** zzzeek has quit IRC		20:57
*** zzzeek has joined #openstack-glance		20:57
*** rcernin has quit IRC		21:34
*** rcernin has joined #openstack-glance		21:51
*** k_mouza has joined #openstack-glance		23:09
*** k_mouza has quit IRC		23:14
*** knikolla has quit IRC		23:36
*** knikolla has joined #openstack-glance		23:36
*** PrinzElvis_ has joined #openstack-glance		23:37
*** PrinzElvis has quit IRC		23:37
*** PrinzElvis_ is now known as PrinzElvis		23:37

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!