*** zengyingzhe has quit IRC | 00:48 | |
*** gampel has quit IRC | 01:45 | |
*** wangfeng_yellow has joined #openstack-smaug | 03:20 | |
*** zengyingzhe has joined #openstack-smaug | 06:07 | |
*** zhonghua-lee has quit IRC | 06:29 | |
*** zhonghua-lee has joined #openstack-smaug | 06:29 | |
*** gampel has joined #openstack-smaug | 07:22 | |
*** gampel1 has joined #openstack-smaug | 08:20 | |
yinwei | @gampel on line? | 08:21 |
---|---|---|
yinwei | we're thinking about checkpoint lock mechanism | 08:23 |
*** gampel has quit IRC | 08:23 | |
yinwei | scenario like: operation is under execution of protection service, and checkpoint is created. Here we need lock the checkpoint until execution of protection fishishes. | 08:23 |
yinwei | The lock is to avoid delete chekcpoint on parallel and will be used when service restart and the workflow of the checkpoint rolling back: we need check no other service is working on this checkpoint on parallel. If we need this distributed lock, I'm thinking whether we need introduce a lock service plugin and its backend, say zookeeper or, since the lock is not competed frequemently, and doesn't have high performance requirement, we could implement | 08:25 |
yinwei | a distributed lock based on our bank it self, like S3, swift. What do you think? the latter approach doesn't need introduce another component. | 08:25 |
*** gampel1 has quit IRC | 08:56 | |
*** gampel has joined #openstack-smaug | 08:57 | |
zengyingzhe | yinwei, if lock the checkpoint all time while protection progress is going on, how can we read the status of this checkpoint? | 09:47 |
zengyingzhe | I'm not sure whether S3 or swift is suit for implementing a distributed lock, but I do know DB can, cause heat implements lock mechanism by DB. | 09:50 |
*** zengyingzhe has quit IRC | 09:55 | |
*** zengyingzhe has joined #openstack-smaug | 09:56 | |
*** zengyingzhe has quit IRC | 09:57 | |
*** openstackgerrit has quit IRC | 10:02 | |
*** openstackgerrit has joined #openstack-smaug | 10:02 | |
yinwei | same mechanism as DB, put and check. But swift as a distributed storage should scale better than DB, except you're talking about distributed db like cassandra. | 10:15 |
*** saggi has joined #openstack-smaug | 10:17 | |
yinwei | guys are also discussing use zk to synchronize in heat :http://blogs.rdoproject.org/7685/zookeeper-part-1-the-swiss-army-knife-of-the-distributed-system-engineer-2 | 10:17 |
saggi | yinwei: I missed the beginning. What are we all talking about? | 10:18 |
yinwei | if we don't mind introducing another component, zk would be more reliable way and nova service group has already accepted zk as one of its lock backend. | 10:18 |
yinwei | hi, saggi | 10:19 |
saggi | hi :) | 10:19 |
yinwei | we're talking about checkpoint lock | 10:19 |
yinwei | which seems to be a distributed lock | 10:19 |
yinwei | scenario like: operation is under execution of protection service, and checkpoint is created. Here we need lock the checkpoint until execution of protection fishishes. | 10:20 |
yinwei | The lock is to avoid delete chekcpoint on parallel and will be used when service restart and the workflow of the checkpoint rolling back: we need check no other service is working on this checkpoint on parallel. | 10:20 |
yinwei | I think I got the scenario from your bank.md :) | 10:21 |
saggi | Yes, how it's implemented depends on the bank. | 10:21 |
saggi | So object store implementations support mechanisms that enable this locking. | 10:22 |
yinwei | so two options, shall we introduce another lock service plugin and its backend and make zk as its lock | 10:22 |
yinwei | sorry, by what semantics do you mean object store supports? | 10:23 |
yinwei | AFAK, object storage only ensure atomic put and last one wins | 10:23 |
saggi | First of all, the lock is only really important for deletion. Since the checkpoint will be marked as "in progress" until the checkpoint is done. This means we know that we can't restore from that point if it's not in "done" state. The only problem is deletion. | 10:24 |
saggi | We want to know that: a) The protect operation crashed so we can delete an "in progress" checkpoint. b) there are no restoration in progress so we can delete a "done" checkpoint. | 10:24 |
yinwei | IMHO, either we introduce lock service like zk, or we implement it based on bank in bankplugin. | 10:25 |
saggi | I don't think we can use ZK since it needs to work cross site. | 10:25 |
yinwei | shall we deploy swift across sites? | 10:26 |
saggi | When restoring on site B site A needs to know it can't delete this checkpoint. | 10:26 |
saggi | I'd assume you use the target's swift using it's northbound API. When you backup for DR you need to save off-site. That target site will need to run the object store. If the object store is in the local site we will loose it in the disaster. | 10:28 |
yinwei | how about this case: we deploy multiple protection services, when one service crashes while the checkpoint is 'in progress'. Later, shall we pick up this checkpoint to continue in other service instance when the crash detected? | 10:29 |
saggi | You can't continue. You have to create a new checkpoint from scratch since the tree might have changed while you were down. | 10:30 |
yinwei | AFAK, swift has implemented geo replication | 10:30 |
yinwei | why one site failue will lose whole swift? | 10:31 |
saggi | If you don't have it replicated of course | 10:31 |
yinwei | here I mean why not use its geo replication feature, but build two swift sites and replicate data by ourselves. | 10:32 |
saggi | how are collisions resolved? | 10:33 |
saggi | gampel: Suggested that for simplicity we might lock a bank to a single site. That would remove the need for cross site locking. Each site could put it's ID in the root of the bank. If you want write access you will need to use that site or steal access. We are vulnerable only during a forced transition, but we will warn the user when that happens. | 10:33 |
saggi | yinwei, I have to, will you be here in 45 minutes? | 10:34 |
yinwei | I don't think so, maybe later | 10:34 |
saggi | yinwei, you are asking good questions and I would like to continue | 10:34 |
yinwei | 45 minutes later is my dinner time | 10:34 |
yinwei | thanks, saggi | 10:34 |
saggi | yinwei, ping me when you have time | 10:34 |
yinwei | sure | 10:34 |
yinwei | I still think we need check all zombie checkpoints (to cleanup, yes, delete again), and we need redo those operations | 10:38 |
yinwei | to check zombie checkpoints, distributed lock would be a good semantic. | 10:38 |
yinwei | @zengyingzhe, to answer read status question, we build index for checkpoint status reading. BTW, the lock should allow repeate repeated entrance of the same lock owner | 10:41 |
*** zengyingzhe has joined #openstack-smaug | 10:51 | |
*** zengyingzhe_ has joined #openstack-smaug | 12:05 | |
*** zengyingzhe has quit IRC | 12:08 | |
*** yinweiphone has joined #openstack-smaug | 12:15 | |
yinweiphone | saggi: hello | 12:16 |
saggi | yinweiphone: Hello | 12:16 |
yinweiphone | happy | 12:16 |
yinweiphone | happy you're here | 12:16 |
saggi | yinweiphone: You are here in 3 different forms :) | 12:18 |
yinweiphone | hmm, I'm trying iPhone app | 12:18 |
yinweiphone | seems hard to print | 12:18 |
saggi | yinweiphone: wrt locking. We would like to avoid distributed locking. What I thought was to use Swift's auto deletion feature to create lease objects. While they exists the checkpoint it locked. They will autodelete if the host crashes and no longer extends their lifetime. | 12:22 |
*** yinweiphone has quit IRC | 12:23 | |
saggi | yinweiphone: It still doesn't solve the cross site use case as it will take time for georeplication to copy this objects so we can't trust them across sites. To solve this I suggest marking a checkpoint as deleted first and only deleting it after enough time has passed that we are sure that all sites are up to date. | 12:24 |
*** wei__ has joined #openstack-smaug | 12:25 | |
wei__ | wow, now i can login through mac, cool | 12:26 |
wei__ | much better to print | 12:27 |
wei__ | saggi, shall we continue? | 12:28 |
*** yinweiphone has joined #openstack-smaug | 12:28 | |
*** yinweiphone has quit IRC | 12:28 | |
saggi | wei__: wrt locking. We would like to avoid distributed locking. What I thought was to use Swift's auto deletion feature to create lease objects. While they exists the checkpoint it locked. They will autodelete if the host crashes and no longer extends their lifetime. | 12:29 |
saggi | wei__: It still doesn't solve the cross site use case as it will take time for georeplication to copy this objects so we can't trust them across sites. To solve this I suggest marking a checkpoint as deleted first and only deleting it after enough time has passed that we are sure that all sites are up to date. | 12:29 |
wei__ | auto deletion feature? sounds like empheral node in zk | 12:31 |
saggi | wei__: I don't think we should deploy ZK cross site. I don't think it's built for that. I'd rather have the option to have a restore fail because someone deleted it (because it's not that bad) than have to configure and maintain a cross site ZK configuration. | 12:33 |
wei__ | I need check more about auto deletion. AFAK in S3, auto deletion only means object could be deleted sometime later as user set, like 3 days later. | 12:33 |
wei__ | I understand your point, the cost to introduce another service. | 12:34 |
wei__ | Just not sure whether auto deletion of swift could make it | 12:34 |
wei__ | as you said, it seems to maintain a heartbeat between client and cluster, once client crashes, the ephemeral lock unlocks. | 12:36 |
saggi | wei__: The only thing I'm really worried about is detecting abandoned (zombie) checkpoints. | 12:36 |
saggi | Deleting while restoring isn't very important as the restore would fail an it's the user's problem if it decides to delete the checkpoint. | 12:37 |
wei__ | oh oh oh | 12:37 |
wei__ | I got it | 12:37 |
wei__ | you mean client will update the lifetime of checkpoint key again and again | 12:38 |
wei__ | if client crashes, the checkpoint is auto deleted | 12:38 |
wei__ | ok | 12:38 |
saggi | wei__: Just the lease file. Since the checkpoint might be many objects and updating them all is to much work. | 12:39 |
*** zengyingzhe_ has quit IRC | 12:39 | |
wei__ | the lease file is kept per service instance? | 12:42 |
saggi | Per checkpoint. | 12:42 |
wei__ | ok. | 12:42 |
wei__ | then what's abandoned checkpoint do you mean? | 12:43 |
wei__ | checkpoints under execution while service crashes? | 12:43 |
saggi | The server stopped the checkpoint process spontaneously either by a bug or by a crash | 12:44 |
saggi | So the checkpoint is still "in progress" but it will never finish | 12:44 |
saggi | we need to clean it up | 12:44 |
saggi | wei__: OK? | 12:47 |
wei__ | hmm, do you mean the lock client who will update the lease file will be a separate process other than protection service? Otherwise, I couldn't see why when the service crashes, the lease will still be kept | 12:47 |
wei__ | hmm, if there's a bug, where protection service lives, but lease still continues...if this is the case, it's really hard to detect the zombie since even if you make lease updating logic into the task flow, you can't make sure the fineness is small enough to track bugs. | 12:52 |
saggi | The lease will expire correctly. But another service wouldn't know which of the server leases are responsible for which checkpoints. | 12:52 |
wei__ | why? | 12:53 |
saggi | How would they know? | 12:53 |
wei__ | I thought the lease should be some key like /container/checkpoint/lease/client_id | 12:54 |
wei__ | client_id is a sha256 or sth. like that generated once the service initialized according to timestamp | 12:54 |
saggi | hmm | 12:55 |
saggi | And than we only need to update one lease | 12:55 |
wei__ | swift get -path /container/checkpoint/lease could get if there's any lock | 12:55 |
saggi | I would call it owner_service_id | 12:55 |
wei__ | yes, much better | 12:56 |
saggi | and than we check that this owner is alive | 12:56 |
saggi | It's a good idea wei__ | 12:56 |
wei__ | thanks, saggi | 12:56 |
wei__ | I like your auto deletion idea too | 12:56 |
saggi | So we have one per service. | 12:57 |
wei__ | actually, I'm thinking implement sth. like this in client side but forgot such a magic usage | 12:57 |
wei__ | ok, let me see | 12:57 |
saggi | And we make it long | 12:57 |
saggi | Much longer than the update time. | 12:57 |
saggi | So if the service fails to update for N minutes it abandons all checkpoints. | 12:58 |
saggi | Since we can't attest their integrity | 12:58 |
wei__ | how could we tell 'all checkpoints' belonging to one server owner? | 12:58 |
saggi | We don't need that | 12:59 |
saggi | we just go over all the in progress checkpoints and check their owners | 12:59 |
saggi | collect all the abandoned ones | 12:59 |
saggi | and delete them | 12:59 |
wei__ | I mean how could we tell the owner of each checkpoint? | 12:59 |
saggi | wei__: Could you comment on bank.md so I will remember to add it to the file. | 13:00 |
saggi | yes | 13:00 |
saggi | that is what you suggested | 13:00 |
saggi | putting it in the checkpiont MD | 13:00 |
wei__ | metadata of checkpoint object? | 13:00 |
saggi | in the bank | 13:00 |
wei__ | sure, i will comment it in bank.md | 13:01 |
saggi | /checkpoints/<checkpoint id>/owner_id | 13:01 |
saggi | and we will have /leases/clients/<client_id> | 13:01 |
saggi | or maybe /leases/owners/owner_id | 13:01 |
wei__ | you mean make another key under the prefix under checkpoint_id | 13:01 |
saggi | yes | 13:02 |
saggi | We create it when we create the checkpoint | 13:02 |
wei__ | ok | 13:02 |
saggi | That way we only maintain one lease | 13:02 |
wei__ | hmm, got the mapping: checkpoint->owner-id->lease | 13:03 |
saggi | We would make the leases long. Since I don't foresee a lot of contention | 13:03 |
wei__ | yes, same here | 13:03 |
saggi | yes | 13:03 |
wei__ | another question, when you said we actually delete checkpoint after some time long enough until all sites updated | 13:04 |
wei__ | do you mean the geo replication of swift is an eventually consistency model? | 13:05 |
saggi | wei__: Just what I wanted to talk about now :) | 13:05 |
wei__ | nice | 13:05 |
wei__ | :) | 13:05 |
saggi | Yes, the problem is that we can't ensure consistency of the leases across side. So leases don't work. | 13:06 |
wei__ | anyway in swift to check whether the consistency has achieved? | 13:06 |
saggi | But this is only a problem for the delete while restore case | 13:06 |
wei__ | yes, only happens when delete from one site but read from another site | 13:07 |
saggi | wei__: I would prefer if swift wouldn't do anything. Since we also need to synchronize with resources outside of swift. | 13:07 |
saggi | There is also the issue of double delete | 13:08 |
wei__ | hmm, what swift offers is almost what other object storage offers. They all aligns with S3. | 13:08 |
saggi | So what we suggest is that deleting will only change the state of the checkpoint but wouldn't actually delete it. Then there will be a single process that we need to make sure only runs at one place and actually deletes the checkpoints. | 13:09 |
saggi | Like a garbage collector. | 13:09 |
wei__ | yes, could only be that | 13:10 |
wei__ | eventually consistency introduces dirty read | 13:10 |
saggi | But how do we ensure that it only runs on one site? | 13:10 |
saggi | That is what we haven't solved yet | 13:10 |
wei__ | sorry, could describe the problem with more details? | 13:11 |
saggi | Let's say someone at site A and site B decided to delete the same checkpoint. You will obviously encounter issues. | 13:13 |
saggi | Even with the GC approach we need to somehow make sure only one site runs the GC | 13:13 |
saggi | Or we get issues while cleaning up resources outside the bank | 13:13 |
saggi | wei__: Do you understand the issue? | 13:14 |
wei__ | yeach | 13:14 |
wei__ | Or we get issues while cleaning up resources outside the bank---what issue? delete error 404 not found? | 13:15 |
wei__ | hmm, could be early exit since the key hasn't be replicated there yet? | 13:16 |
saggi | Lets say we also have a volume backed up somewhere else. Where we delete the checkpoint we also need to delete this volume from the other storage. If two process try and do it at once one of them will fail. | 13:17 |
wei__ | but we wait enough to delete, right? we suppose that key has been replicated already, then we start GC | 13:17 |
saggi | wei__: But than two sites can start the GC at once. | 13:17 |
wei__ | why not delete checkpoint key first, only the one succeeds delete checkpoint key will do following steps to cleanup backup resources? | 13:19 |
wei__ | swift will only allow one client succeeds to delete the key, others should get 404, isn't it? | 13:19 |
saggi | wei__: But that doesn't immediately with geo replication | 13:20 |
wei__ | but we wait enough to delete, right? we suppose that key has been replicated already, then we start GC | 13:20 |
wei__ | that's the assumption, isn't it? | 13:20 |
* saggi is thinking | 13:21 | |
wei__ | ok, you mean the delete is not immediately with geo replication | 13:21 |
wei__ | thinking | 13:22 |
saggi | What I'm saying is that two servers can decide to act on the deletion at once | 13:22 |
saggi | yes | 13:22 |
wei__ | saggi, the condition GC could delete the checkpoint is whether lease of this checkpoint is still there | 13:25 |
saggi | Yes, if it's missing we can delete | 13:25 |
saggi | Since we know it was abandoned | 13:26 |
saggi | gampel: Suggested having a root lease that can only actually delete checkpoints. Everyone can mark deletions but only it can actually delete. | 13:27 |
wei__ | so the root lease should locate in which site? | 13:28 |
wei__ | we need ensure this root lease won't fail in any site failure | 13:28 |
saggi | It's in the bank. If it expires someone else will become root. | 13:28 |
wei__ | actually i was thinking we should have each site one GC to collect its owners' checkpoints garbage, so when it fails, others need take the ownership | 13:30 |
wei__ | there need some arbitration service to tell who is the root or who is the owner ? | 13:31 |
wei__ | what if we just tolerate the delete failure? | 13:32 |
saggi | wei__: We will need to write the plugins around that. Since we don't want to have leftover data outside the bank unreachable. | 13:32 |
wei__ | yes, we don't leave garbage | 13:33 |
wei__ | just let the looser of the GC competitors to tolerate the delete error, what's the cons here? | 13:33 |
wei__ | sorry, have to leave. shall we continue tommorrow? | 13:34 |
saggi | sure | 13:34 |
*** wei__ has quit IRC | 13:54 | |
*** wei__ has joined #openstack-smaug | 14:27 | |
*** wei__ has quit IRC | 14:30 | |
*** chenying has quit IRC | 14:30 | |
*** chenying has joined #openstack-smaug | 14:30 | |
openstackgerrit | Eran Gampel proposed openstack/smaug: First draft of the API documentation https://review.openstack.org/255211 | 15:49 |
openstackgerrit | Eran Gampel proposed openstack/smaug: Add Smaug spec directory https://review.openstack.org/261913 | 16:00 |
*** smcginnis has joined #openstack-smaug | 16:12 | |
*** gampel has quit IRC | 16:18 | |
openstackgerrit | Merged openstack/smaug: First draft of the API documentation https://review.openstack.org/255211 | 16:43 |
openstackgerrit | Saggi Mizrahi proposed openstack/smaug: Pluggable protection provider doc https://review.openstack.org/262264 | 16:54 |
openstackgerrit | Merged openstack/smaug: Add Smaug spec directory https://review.openstack.org/261913 | 16:58 |
*** openstackgerrit has quit IRC | 18:32 | |
*** openstackgerrit has joined #openstack-smaug | 18:32 | |
*** zhonghua-lee has quit IRC | 22:13 | |
*** zhonghua-lee has joined #openstack-smaug | 22:14 | |
*** saggi has quit IRC | 23:11 | |
*** saggi has joined #openstack-smaug | 23:28 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!