09:01:01 <jiaopengju> #startmeeting karbor 09:01:02 <openstack> Meeting started Tue Oct 23 09:01:01 2018 UTC and is due to finish in 60 minutes. The chair is jiaopengju. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:01:03 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:01:06 <openstack> The meeting name has been set to 'karbor' 09:01:13 <jiaopengju> hi guys 09:01:15 <yinweiishere> hi 09:01:34 <luobin-smile> hi 09:01:44 <jiaopengju> hi yinweiishere, luobin-smile 09:01:57 <yinweiishere> hi pengju 09:02:15 <yinweiishere> have we got any schedule for S version new features? 09:02:39 <jiaopengju> I saw the meeging agenda has been updated here, https://wiki.openstack.org/wiki/Meetings/Karbor 09:02:47 <yinweiishere> yeach 09:03:02 <yinweiishere> we wanna propose add snapshot feature there 09:03:19 <yinweiishere> and just now luobin and I checked current karbor code 09:03:27 <jiaopengju> yinweiishere, I have do some plan for the S version, mainly focus on optimization 09:03:32 <yinweiishere> we found some problems for snapshot there 09:03:51 <yinweiishere> any wiki page for S version plan? 09:04:53 <jiaopengju> I will send an etherpad link after meeting 09:05:08 <yinweiishere> OK, good 09:05:26 <jiaopengju> We can talk about the feature in the agenda first 09:05:37 <yinweiishere> sure 09:05:41 <jiaopengju> #topic add snapshot feature to Karbor to support crash consistency and app consistency further 09:05:59 <yinweiishere> I'd also like to hear your ideas for optimization 09:06:06 <jiaopengju> :) 09:06:19 <yinweiishere> since as I know Yuval has done a lot opt there :) 09:06:25 <yinweiishere> ok 09:06:31 <yinweiishere> so for snapshot 09:06:39 <jiaopengju> now you can describe the snapshot feature 09:06:48 <yinweiishere> as we know, snapshot has two usages 09:07:02 <yinweiishere> one to restore another instance from snapshot 09:07:17 <yinweiishere> the other is to rollback current instance to the snapshot 09:07:36 <yinweiishere> here, the instance I mean server or volume or volume groups 09:07:47 <jiaopengju> get it 09:07:56 <yinweiishere> we checked current implementation in karbor 09:08:25 <yinweiishere> we only provide restore API for snapshot, where we can't rollback current instance 09:09:23 <yinweiishere> if you check actual nova libvirt snapshot or ceph rbd snapshot, the backends all provide rbd snapshot rollback 09:10:25 <yinweiishere> so, for the API aspect, the snapshot feature is missing there 09:10:25 <jiaopengju> This means we should add rollback operation in the protection plugins? 09:10:39 <jiaopengju> just like verify, copy and so on 09:10:45 <yinweiishere> first add rollback API in restore 09:10:47 <yinweiishere> part 09:11:19 <yinweiishere> then, for snapshot protection plugin, support it 09:11:24 <jiaopengju> yes 09:11:39 <yinweiishere> so this is the API part 09:11:56 <yinweiishere> second, snapshot has a consistency semantics 09:12:22 <yinweiishere> now Karbor's checkpoint doesn't support any level's consistency 09:12:27 <yinweiishere> do you agree? 09:13:06 <jiaopengju> yes, agree 09:14:54 <yinweiishere> when we initially started this project, the founder agreed to postpone the consistency issue. But without consistency, neither the snapshot or the checkpoint (what ever you call) it doesn't satisfy 'protect' semantic in fact. 09:15:21 <yinweiishere> actually, there're two levels for consistensy 09:15:39 <yinweiishere> the initial level: crash consistency 09:15:56 <yinweiishere> the higher level: app consistency 09:16:48 <yinweiishere> although Karbor has analyzed dependencies among resources, it still fails to support even crash consistency 09:17:23 <jiaopengju> do you have specs for these two level? 09:17:34 <jiaopengju> or some docmentation link 09:18:02 <yinweiishere> look at how we protect server and its volumes/volume groups, we didn't maintain the consistency there 09:18:35 <yinweiishere> we propose to support snapshot with two levels consistency step by step 09:18:51 <yinweiishere> first crash consistency and then app level consistency 09:19:01 <yinweiishere> yes, we do have some ideas there 09:19:21 <yinweiishere> want to achieve consensus first before writing it to spec 09:20:36 <yinweiishere> pengju, are you there? 09:21:16 <jiaopengju> yes, can you give more messages about crash consistency? 09:21:27 <yinweiishere> sure 09:22:40 <yinweiishere> actually, I'm thinking that we need a pair of APIs for snapshot:take_snapshot(consistency_level) and rollback_snapshot(checkpoint_id) 09:23:29 <yinweiishere> to differenciate existed protect/restore API, which support loose restriction on consistency 09:23:43 <jiaopengju> do you mean that, if we take a snapshot in karbor, then the checkpoints info that matched to this snapshot will be saved? 09:24:15 <yinweiishere> yes, the snapshot will be the backend resource of the checkpoint 09:24:26 <yinweiishere> similar to the backup id 09:24:44 <jiaopengju> ok, I understand, that sounds useful 09:25:12 <yinweiishere> or we can merge it with protect API with consistency level params, none means current way 09:25:38 <yinweiishere> those are details, we can discuss it in spec 09:25:51 <yinweiishere> and for crash consistency 09:26:42 <yinweiishere> we need pause the server, flush the memory to disk, and then call volume/volume group's backend snapshot methods 09:28:10 <yinweiishere> on contrast, currently we never pause server, but only backup each volumes attached to the server if it booted from a volume. If it booted from an image, we never backup the changes in system volume. 09:28:29 <yinweiishere> that's the crash consistency 09:28:55 <yinweiishere> where the system could really boot from the snapshot, and apps 09:28:59 <yinweiishere> won't crash 09:29:38 <jiaopengju> I have talked the vms that booted from an image with yuval and chenying before 09:31:18 <jiaopengju> But if we pause server in karbor, this will a few intrusive to the end user 09:31:23 <yinweiishere> our current way may backup at a wrong moment,that the system and the apps haven't saved the necessary status to disk and they may fail to boot without those status. 09:31:45 <yinweiishere> that's the protection plugin 09:32:19 <jiaopengju> part of karbor 09:32:28 <yinweiishere> sure 09:32:49 <yinweiishere> that's dependent on the consistency param 09:33:22 <yinweiishere> if user asks for crash consistency, it means he/she understands what does it mean 09:33:43 <yinweiishere> for crash consistency itself, that's the way it should be 09:34:26 <jiaopengju> ok, understand, this is the detail info, you can write it in the spec 09:35:00 <jiaopengju> app consistency? 09:37:07 <jiaopengju> yinweiishere are you there? 09:37:33 <yinweiishere> yes, I'm here 09:37:43 <yinweiishere> app consistency is a bit complicated 09:38:31 <yinweiishere> Luobin, could you pls. elaborate here? 09:39:02 <yinweiishere> it's about global consistent snapshot 09:39:18 <yinweiishere> which requires some background knowledge 09:39:19 <jiaopengju> app , the resource that matched to openstack is a couple of resources, or a group of resources? 09:39:55 <yinweiishere> have you heard of chandy-lamport algorithm for distributed snapshot? 09:40:51 <yinweiishere> this algorithm is enhanced as ABS algorithm and applied in stream engine in flink 09:41:15 <yinweiishere> our idea is to enhance ABS and apply it in karbor 09:41:43 <yinweiishere> we can put it as a more long term target 09:42:52 <yinweiishere> the paper of ABS is here 09:42:54 <yinweiishere> https://arxiv.org/pdf/1506.08603.pdf 09:42:58 <yinweiishere> for your reference 09:43:33 <jiaopengju> actually I think we should define which resource that mapped to openstack we protected in app consistency scene 09:43:53 <jiaopengju> and then how 09:44:41 <yinweiishere> resources are the same 09:45:04 <yinweiishere> the problem is how to make the app aware of the snapshot 09:45:21 <yinweiishere> and, as there are many apps in one plan's servers 09:45:41 <yinweiishere> how to make sure all apps are consistent 09:46:27 <yinweiishere> the distributed snapshot issue is to make sure the causal ordering is correct 09:46:49 <yinweiishere> as, APP1 is the input of APP2 09:47:02 <yinweiishere> we take snapshot on the whole system 09:47:28 <yinweiishere> but the snapshot command may get triggered at different timming on server1 and server2 09:47:38 <jiaopengju> ok, a little understand, I will see the reference you send after meeting 09:47:47 <yinweiishere> since there're events on the fly 09:47:58 <yinweiishere> yeach 09:48:14 <yinweiishere> similar to crash consistency 09:48:31 <yinweiishere> we need maintain the dependencies among APPs 09:48:42 <yinweiishere> ok 09:48:59 <yinweiishere> that's all for my part 09:49:34 <jiaopengju> thanks for giving this useful idea 09:50:03 <yinweiishere> you can send out the schedule link 09:50:21 <jiaopengju> we can add this to stein plan 09:50:30 <yinweiishere> and we can put the effort there to see if more people will get interested there 09:50:50 <yinweiishere> yes, we can support it step by step 09:52:16 <jiaopengju> I can easily describe the plan in stein cycle here: optimization of multiple nodes of operation engine (yuval provide the first version) 09:52:42 <jiaopengju> 2. cross-site backup and restore (cross keystone) 09:52:55 <yinweiishere> haho 09:52:57 <jiaopengju> 3. documentation 09:53:09 <yinweiishere> cross site is what I proposed long long ago 09:53:29 <jiaopengju> yes, some people has asked questions about it 09:53:36 <yinweiishere> it's really useful, right? 09:53:47 <jiaopengju> cross keystone seems not 09:54:29 <yinweiishere> I think cross region/AZ is necessary 09:54:46 <yinweiishere> cross keystone is a bit difficult 09:54:50 <jiaopengju> at the same time, we do not have documentation enough about it 09:55:02 <yinweiishere> again, step by step is more feasible 09:55:11 <jiaopengju> yes, agree 09:55:30 <yinweiishere> ok, I think the time is run out 09:55:54 <jiaopengju> yeah, so I will end the meeting soon and we can talk about it in karbor channel 09:56:13 <yinweiishere> if you have made the schedule link, pls. let us know in karbor channel 09:56:19 <jiaopengju> ok 09:56:33 <jiaopengju> #endmeeting