Sunday, 2024-04-21

mcapeHello all, from the Sunday emergency department!06:01
mcapeWe have two region swift storage, with three replicas on R1Z1, R2Z1, and R2Z2.06:02
mcapeFor some, still unknown reason, three disks on R2Z2 were lost yesterday.06:02
mcapeOur storage was 97% full, and now object nodes on R1Z1 are running out of disk space, they are already on 98-99% and growing.06:03
mcapeI do not understand why only R1Z1 object servers are taking in handoff partitions from the lost disks? Why not R2Z1, or remaining disks on R2Z2?06:04
mcapeOr what is going on?06:05
mcapePerhaps someone could shed light to the processes that are happening during such outages?06:05
mcapeAnother interesting observation is that before, when we did datacenter evacuation from city A to city B,06:05
mcape and plugged off entire zone at once for two days,   this kind of swelling did not happen at all.06:05
mcapeSo currently we think to do such a thing as a workaround - shutdown entire R2Z2 until disks will be fixed. Which sounds kinda unintuitive and perhaps plainly wrong06:05

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!