Wednesday, 2023-05-17

opendevreviewPranali Deore proposed openstack/glance master: Add new add location api  https://review.opendev.org/c/openstack/glance/+/88194007:57
opendevreviewPranali Deore proposed openstack/glance master: Add new get location api  https://review.opendev.org/c/openstack/glance/+/88249808:01
abhishekkdansmith, o/16:53
abhishekkwhoami-rajat, around?16:54
abhishekkdo you know whether consumer of location API (cinder/nova)  does any post processing on response based on image state at their end.17:00
dansmithabhishekk: currently? I dunno, but this is going to be a code change, so I'm not sure it matters17:03
abhishekkyeah, that is also true17:03
dansmithabhishekk: for snapshot I don't think it matters if we fire off the location add and don't care that the image is in importing or "digesting" or whatever state17:03
dansmithalso let me say I'm sorry to be so late on review for this17:04
dansmiththings have been quite busy lately :/17:04
abhishekk++, import workflow sounds better17:04
abhishekkno problem17:05
abhishekkI think its better to get it right at first place it self17:05
abhishekkI was initially wanted to persist on having this async but latter we added performance impact on spec to highlight the issue17:06
dansmithso another aspect to that,17:07
abhishekkEven though we will use import workflow, performance impact will be there but it will not keep consumer hostage for waiting17:07
dansmithis that the whole point of the accelerated ceph/glance/nova snapshot stuff is that nova can freeze the guest for a very short time, snapshot at ceph, let it go, and then just "tell" glance about it17:07
dansmithif we now have to wait for glance to eat and digest the whole thing, we lose a lot of that benefit17:07
dansmith(i.e. by setting our timeout to 30m or something crazy like that)17:08
abhishekk++17:08
dansmithalso, I should probably ask..17:08
dansmithwe likely want to *enable* the async hashing, but not require it before the image can be used17:09
dansmithmeaning, if you're doing cloud-fork kinda things, you want to snapshot an instance and immediately create N copies of it17:09
dansmithenabling hashing will slow that down if we can't use the image immediately17:09
dansmithit's not critical (like delaying the snapshot itself) but might be worth thinking about17:10
abhishekkthat will be configurable on consumer side, if they send do_secure_hash True then only we will calculate the hashing17:10
dansmithIIRC in the "import to multiple stores" workflow, once it's imported to one, the image is active and can be used and the rest of the uploads finish in the background17:11
abhishekkyes17:11
dansmithright but the (nova) user doesn't get to choose that17:11
dansmithso they can't say "I'd rather not have hashing because speed is important to me"17:11
abhishekkok, then what is the use of having this do_secure_hash request param17:13
dansmithwell, honestly, it's probably better to make that a glance config (default on), and then if they provide a hash, calculate and check, and if they don't provide one, calculate and store it17:13
dansmithdo we have any other image states other than active where you can download the image?17:14
abhishekknope17:14
dansmithokay17:15
dansmithwell, I don't want to explode everything about the design, but it's just worth thinking about I think17:15
dansmithhaving to choose between "having a hash, which must be calculated before I use it" and "having a hash once it's available" is kindof a bummer17:16
dansmithif they provide a hash value when they add, then waiting to go active until it's validated makes sense17:16
dansmithif they don't, presumably it should go active immediately, because it's not going to fail anyway right?17:17
dansmithand nova won't provide a hash for snapshots because we don't know what it should be anyway, so .. straight to active in that case17:17
abhishekkyes that is what it will do now because we have do_secure_hash False by default17:18
dansmith...go active immediately, but still calculate it17:18
abhishekkOne active we can not set these properties I guess17:19
abhishekk*once17:19
dansmiththat's a glance rule though (which we can fix) right?17:19
abhishekk:)17:19
dansmithmaybe we should chat with rosmaita about that aspect17:20
dansmithif that's a fundamental design point, that we should never add a hash value after the image goes active, then I guess that's okay17:20
dansmithbut it seems bad to block things on tasks that necessarily take a long time17:20
abhishekkI think that is the rule,17:22
dansmithbut today that leads to never having hashes on snapshots which kinda sucks :/17:23
abhishekkalso as I pointed if nova never gonna send secure_hash and validation data then image will go to active state immediately 17:23
abhishekkyeah17:23
abhishekkthat is the point we are adding this 17:23
abhishekkthis also blocks us from caching these types of images since during caching it calculates checksum and fails in the end since the image doesn't has it17:24
dansmithack17:25
abhishekkWe do not allow setting os_hash_* and checksum using image-update17:25
dansmithfrom the outside that makes sense for sure17:26
abhishekkalso while creating image there is no way to pass these images17:26
abhishekkthese values to image17:26
abhishekkSo somehow we need to make some compromise if we want to calculate the hash/checksum17:28
dansmithyeah I mean I think rosmaita has in the past expressed a desire to up the level of validation we're doing (either signatures or hashes) and I very much agree that we should be really airtight on that stuff17:34
abhishekkagree about later part17:34
dansmithupload and import are easy to calculate the right values. snapshots are just as important, they just need a different (async) workflow17:35
abhishekkSo for this workflow we should make some amends?17:35
dansmithpersonally, I think allowing this workflow to go active immediately (if no hash is provided) and set the hash after the image is active would be best. But, if that's not acceptable,17:36
dansmithI'd rather the snapshots calculate the hash and just not go active until it's done17:36
dansmithi.e. break/delay the instant snapshot-and-boot "cloud fork" model17:36
abhishekkthat is what we are currently doing but without async, if we move to import workflow we will be doing that 17:37
dansmithyeah17:38
abhishekkbut I think rosmaita will be around tomorrow and since we don't hve weekly meeting we can discuss this at that time17:38
abhishekkthe possibility of 1st option17:38
dansmithack17:38
rosmaitajust got back, will read through the scrollback17:39
abhishekkjust read from your 1st mention :D17:40
whoami-rajatabhishekk, when glance backend is cinder and we upload image, if the ADD location call fails/doesn't return success then we delete the cloned volume and fail the operation18:15
whoami-rajatabhishekk, so we kind of rely on the sync behavior of this workflow18:15
whoami-rajathttps://github.com/openstack/cinder/blob/456b6399bece9a9ac6274e56130cff6680d99096/cinder/volume/manager.py#L1746-L175518:15
abhishekk you don't check for image state though18:16
abhishekkSo if we decided to go option 1 then you will not be impacted18:17
abhishekkwhich is, , I think allowing this workflow to go active immediately (if no hash is provided) and set the hash after the image is active would be best. But, if that's not acceptable,18:17
whoami-rajatabhishekk, but what if the hash calculation fails? then the image state won't go to ACTIVE i.e. unusable image and we've a leftover cloned volume on cinder side18:18
abhishekkhmm18:18
whoami-rajatalso the cinder operation will say the operation was successful with no error logs, then operators would need to search glance logs as to why image upload didn't work as expected18:19
dansmithwhoami-rajat: will cinder provide the hash?18:19
whoami-rajati.e. image upload was successful but glance image is queued18:19
whoami-rajatdansmith, there are 2 values that cinder can provide, do_secure_hash and validation_data which makes 3 cases possible (actually 4 but 4th one is the current behavior)18:20
whoami-rajatdansmith, https://specs.openstack.org/openstack/glance-specs/specs/2023.2/approved/glance/new-location-info-apis.html#rest-api-impact18:20
dansmitheither way, anything that is sync for simplicity can be async with polling for robustness and you can handle it the same way if the image goes to error state as you would if you wanted 30m for one call to complete and got an error18:21
dansmithwhoami-rajat: right, but I'm asking _if_ cinder will, and if it will, how does it do that without reading the whole volume?18:21
abhishekkpolling side change at cinder makes sense18:21
dansmithif you had to clone the volume and upload it to a backing store then sure, but if you're doing it in ceph, it should be the same as nova - where you don't want to read the whole volume only to provide the hash right?18:21
whoami-rajatdansmith, ah ok, cinder won't actually provide the validation data, if the end user has it they can provide it via config or CLI18:22
dansmithexactly :)18:22
dansmithso in cinder's case it will never fail hash validation18:22
dansmithas in nova's18:22
whoami-rajatit shouldn't18:23
dansmiththe only thing that would fail is if you provide like a http url that can't be read (or which fails during hash validation)18:23
dansmithglance should retry retry-able errors, and if it's fatal, then you can't download the image anyway18:24
whoami-rajatabhishekk, dansmith, on a higher level, the idea is18:24
whoami-rajat1) abstract locations from end users18:24
whoami-rajat2) provide operators/users ability to choose performance vs security18:24
whoami-rajat3) create separate APIs for location operations to get more control with policies (service or owner checks)18:24
dansmithyep, my argument is that #2 should not require a 100% tradeoff18:24
dansmithif I "prefer" performance, I should get security one the hash is calculated18:25
dansmith*once18:25
dansmithif I upload an image and I know the hash and provide it, I'm choosing security (i.e. don't make this active until you validate it matches)18:25
dansmithif I don't, I either can't enforce the security during the add, or don't want to, but once the hash has been calculated, everything else should be able to validate subsequent operations on that hash18:26
abhishekkIf we allow to set these properties to active image for this operation/workflow then we will have no problem on both sides18:27
dansmithcorrect18:27
abhishekkI don't think there will be any issue in allowing that unless rosmaita thinks otherwise18:28
dansmithwhoami-rajat: going back to your "cinder won't provide the hash unless the user does" comment.. is there that sort of proxy where someone shows up to cinder with a volume image that gets passed through (with its hash) to glance18:28
dansmith?18:28
whoami-rajatbut those properties will be set by glance, maybe a background process calculating the hash but not allow it for an end user?18:28
dansmithwhoami-rajat: that's what we're talking about yes.. not available to users, but the import background task will update them18:29
abhishekkyeah18:29
whoami-rajatdansmith, I don't mind that idea since I'm not implementing it :P18:29
abhishekklol18:30
dansmithack, well, let's aim for the best solution regardless :)18:30
abhishekkthis will haunt you when you will implement consumer side :P18:30
whoami-rajatdansmith, sorry i didn't understand your previous question18:31
dansmithwhoami-rajat: [11:22:29]  <whoami-rajat> dansmith, ah ok, cinder won't actually provide the validation data, if the end user has it they can provide it via config or CLI18:31
dansmithwhoami-rajat: did you mean just the glance api/cli or is there a user->cinder->glance workflow that involves providing the hash values ahead of time?18:31
abhishekkformer part, glance api/cli can provide validation data for http store18:33
dansmithack okay just making sure I wasn't missing some other flow18:33
whoami-rajatdansmith, so we've upload volume to glance operation on cinder side that creates an image with same data as the volume, we've a user facing CLI for it -- if that's what you are asking about18:33
dansmithwhoami-rajat: right but the user wouldn't provide a hash value during that operation right?18:33
abhishekkcinder or nova can just provide do_secure_hash true if we keep it default fase18:34
whoami-rajatdansmith, right now they can't but if we want decide to pass validation data from cinder/nova to glance then we can add new parameters18:34
whoami-rajatdansmith, mainly it was for the http store case18:34
whoami-rajatI think nova/cinder will just use the do_secure_hash param18:34
dansmithwhoami-rajat: ack, okay, well, I'm not sure how the cinder user would know the hash for the volume (without a very laborious effort) but yeah I get it18:35
whoami-rajatmainly s/it/validation_data was for http store case18:35
dansmithwhoami-rajat: well, I think we should *always* hash (unless maybe disabled in config)18:35
dansmithwhoami-rajat: and wait for active if you provide the values ahead of time, and not wait if you don't18:35
whoami-rajatdansmith, if we go with the async flow, we should always. we wanted to make it configurable for the performance impact of sync operation https://specs.openstack.org/openstack/glance-specs/specs/2023.2/approved/glance/new-location-info-apis.html#performance-impact18:35
dansmithso if you provide a hash, you're saying "don't let this be active until you validate that it is correct" and if you don't, you say "calculate the hash in the background and set it when it's available"18:36
dansmithwhoami-rajat: ack yeah18:36
dansmiththe fact that we have no hashes for any of our snapshots today is really bad, IMHO, and this is an opportunity to fix that past mistake :)18:36
abhishekkfor this we all are on same page :D18:37
whoami-rajatdansmith, yes that understanding looks correct18:37
dansmith++18:37
whoami-rajatif jokke_ is around, he would be happy to know we are finally fixing one of the issues he used to mention in every cinder glance cross project :D18:38
dansmithlack of hashes?18:38
whoami-rajati.e. the nova/cinder created images lack security18:38
whoami-rajatyes18:38
dansmithyeah, it's almost embarrassing to be missing those :)18:38
dansmiths/almost//18:38
whoami-rajatyep, thanks dansmith for taking a look18:39
dansmithnp, sorry it's so late :)18:39
abhishekkSo in import workflow, state will be queued, active if validation data not provided but do secure hash is true, else it will be queued,importing,active18:39
dansmithjust drop do_secure_hash, IMHO18:39
dansmithmake it config, default on18:39
dansmithoperators can disable it for PoC or silly situations, but don't let the user choose per request18:40
whoami-rajatyeah agreed, if it's async and image will be active, cinder/nova won't mind glance doing extra work18:40
whoami-rajat:D18:40
dansmithhah, right :)18:40
dansmithmakes it simpler that way18:41
dansmithharder to mess up18:41
dansmithand, if we get this in place, you could make the glanceclient stop agreeing to download images with no hash unless --download-insecure is passed or something18:41
dansmiththen people can decide if they want to take the image while the hash is being calculated or not18:41
abhishekkhmm, that will be good improvement as well18:43
whoami-rajatone question, if the hash calculation fails, we will just won't update the hash in image right? or we will "deactivate" or "queue" the image again?18:43
dansmithwhoami-rajat: fails because it didn't validate against the provided one, or failed due to network or something else?18:43
abhishekkif validatation data is provided then we don't set image to active immediately18:44
abhishekkit will be in importing state18:44
dansmithright18:44
abhishekkif validation fails it will be rolled back to queued if succeeds then active18:44
dansmithif we fail to download the image (like via http) in order to calculate the hash, I think we should retry a few times and then fail the image18:44
dansmithbecause it won't likely be downloadable anyway18:44
abhishekkyep18:46
whoami-rajatdansmith, abhishekk I'm referring to the case when we don't provide validation data. we set the image to active and do the hash calculation in background. but if someone uses that image to create VMs or bootable volumes and later the hash calculation fails and we move the image to QUEUED, what will happen to those VMs and volumes?18:46
abhishekkhow will hash calculation fail unless service down or something else?18:47
dansmithabhishekk: right18:47
dansmithI would put it to error personally18:47
whoami-rajatthe process gets killed due to OOM? there could be several cases since it's a CPU intensive task18:47
whoami-rajatdansmith, but we've already created resources from that image ...18:48
dansmithwhoami-rajat: OOMing from a hash calculation would mean something is *very* wrong :)18:48
dansmithwhoami-rajat: well, that can happen today if you boot an instance from a snapshot and then delete the image18:49
whoami-rajatdansmith, don't we keep chunks in memory? and chunk size is configurable? sorry if I'm ignorant towards the hash calculation logic18:49
whoami-rajatdansmith, but in cinder case, if glance and cinder are both using ceph, bootable volumes are COW clones of image i.e. dependent on the image18:49
dansmithwhoami-rajat: we should not need more than one chunk's worth of memory at any one time.. 4MB at the *most* I would think, 64k in most other cases.. hashes are very favorable to streaming calculation18:50
whoami-rajatok18:50
dansmithwhoami-rajat: ack, well, we could just leave the hash value unset, but I personally don't think that's right18:50
dansmithwhoami-rajat: perhaps some way to re-trigger hash calculation once the issue is fixed or something18:51
whoami-rajatdansmith, so i just want to brainstorm the cases so we don't end up creating a regression in current operations18:51
abhishekkI think it will be better to leave it unset in worst possibility18:51
dansmithwhoami-rajat: ack, yep18:52
dansmithokay, then leave it unset after several retries, with a LOG.error() explaining the problem18:52
abhishekkseveral?18:52
abhishekkthat will introduce another config option :D18:53
abhishekkor 3 times hardcoded 18:53
whoami-rajatretry + unset (eventually) works for me, it's the same thing we have today, which is not great but at least works18:53
abhishekkits win win for all18:53
dansmithabhishekk: we probably need something like "http_retries" which could apply to this and other things with web-download right/18:54
whoami-rajatthough i think everyone has confidence that hash calculation shouldn't fail unless the deployment is in a  really bad state18:54
abhishekkdansmith, yes, later we can use if to enhance web-download18:54
dansmithwhoami-rajat: yep18:55
dansmithabhishekk: yep18:55
dansmithabhishekk: can we use taskflow to restart retry-failed tasks on glance-api restart or something? I think we don't persist the state, IIRC18:56
abhishekkno we don't18:57
dansmithack18:57
abhishekkwe are waiting to confirm from rosmaita about the approach whether to set checksum/hash on active image or not19:00
* abhishekk signing off now, will be back tomorrow19:00
dansmitho/19:02
abhishekko/~19:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!