-openstackstatus- NOTICE: OpenStack CI is down due to hard drive failures | 08:48 | |
*** ChanServ changes topic to "OpenStack CI is down due to hard drive failures" | 08:48 | |
lifeless | oh yeah | 09:22 |
---|---|---|
lifeless | we should be here :) | 09:22 |
AJaeger | lifeless: we could - but there's not mouch noise in #openstack-infra, so no need to hide ;) | 09:25 |
*** jeblair has joined #openstack-infra-incident | 13:46 | |
*** fungi has joined #openstack-infra-incident | 13:47 | |
fungi | fsck of the logs volume is on pass 5 now. should be completed momentarily i think | 13:47 |
jeblair | fungi: mordred and i are in meetings this morning | 13:48 |
jeblair | fungi: is logs the only system affected now? | 13:48 |
fungi | yep | 13:48 |
fungi | as far as i'm aware anyway | 13:48 |
jeblair | i saw talk about backups problems, i guess that's just proactive discussion? | 13:48 |
fungi | right, for the trove instance migrations happening tomorrow | 13:48 |
*** bauzas has joined #openstack-infra-incident | 13:49 | |
jeblair | it's a good reminder that we did say we should do a restore test sometime this cycle :) | 13:49 |
fungi | indeed | 13:49 |
fungi | i've got a patch on the way to add database backups for openstackid.org since it seems to have not had that puppet module applied yet | 13:49 |
jeblair | fungi: thanks for handling | 13:50 |
fungi | of course! | 13:50 |
fungi | okay, fsck completed, server restarting to make sure everything mounts properly now | 14:00 |
fungi | however it's worth noting that https://status.rackspace.com/index/viewincidents?group=11&start=1435636800 implies the event is not yet resolved | 14:01 |
fungi | so we may see more volumes there or on other servers disconnect on us still | 14:01 |
fungi | okay, everything's looking good to me so far | 14:04 |
*** smcginnis has joined #openstack-infra-incident | 14:07 | |
fungi | AJaeger: jeblair: jhesketh: mordred: pleia2: SergeyLukjanov: unless any of you object, i'll go ahead and stand down the statusbot alert and follow up on the ml | 14:21 |
AJaeger | fungi: you have the overview ;) No objection from my side :) | 14:21 |
jhesketh | fungi: sounds good to me :-0 | 14:22 |
jhesketh | *:-) | 14:22 |
AJaeger | fungi, loading http://status.openstack.org/zuul/ takes a long time | 14:26 |
AJaeger | fungi: and the icons don't show up | 14:26 |
AJaeger | Does it work fine for you? | 14:26 |
fungi | AJaeger: it loaded instantly for me | 14:26 |
fungi | where instantly is somewhere between 0 and 1 seconds at least | 14:26 |
fungi | oh, the sparklines | 14:26 |
fungi | and graphs | 14:26 |
AJaeger | fungi: yes, sparklines! graphite.openstack.org/render/?from=-8hours&width=100&height=16&margin=0&hideLegend=true&hideAxes=true&hideGrid=true&target=color(stats.gauges.zuul.pipeline.gate.current_changes,%20%276b8182%27)&_t=0.4349146376458062 | 14:26 |
fungi | yep, i bet graphite is broken | 14:27 |
fungi | checking now | 14:27 |
*** smcginnis has left #openstack-infra-incident | 14:27 | |
fungi | the graphite server is taking a very long time to let me ssh in | 14:27 |
fungi | also not responding to ping | 14:27 |
fungi | i think that server has crashed or fallen off the network | 14:28 |
fungi | i see oom killer messages but not sure how recent those are. could be from before the last time i restarted carbon-cache | 14:29 |
fungi | i'm going to try to trigger a soft reboot | 14:30 |
fungi | the virtual console, while it has output, is unresponsive to carriage return so i suspect it's frozen | 14:30 |
fungi | looks like a hard reboot is my only option there | 14:31 |
fungi | it's back up and responding on the console, but still unreachable | 14:34 |
fungi | weird! i can reach its ipv6 address at 2001:4800:7810:512:3bc3:d7f6:ff04:8201 though we don't have that in dns | 14:36 |
AJaeger | once that's up, I have one more question: Why is the post queue not processing? We have the top job in the queue since 7 hours | 14:36 |
fungi | looks like graphite can't ping its ipv4 default gw | 14:38 |
fungi | oh, though that may be filtering. it's reachable in the arp table | 14:38 |
fungi | i've opened ticket 150630-ord-0000822 with fanatical support about graphite | 14:44 |
fungi | also they responded to my question about the database instance migrations. they will indeed keep the same hostnames/dns entries so we shouldn't need to reconfigure anything | 14:45 |
fungi | #status ok The log volume was repaired and brought back online at 14:00 UTC. Log links today from before that time may be missing, and changes should be rechecked if fresh job logs are desired for them. | 14:51 |
openstackstatus | fungi: sending ok | 14:51 |
*** ChanServ changes topic to "Discussion of OpenStack project infrastructure incidents | No current incident" | 14:53 | |
-openstackstatus- NOTICE: The log volume was repaired and brought back online at 14:00 UTC. Log links today from before that time may be missing, and changes should be rechecked if fresh job logs are desired for them. | 14:53 | |
fungi | following up on the ml now before my next meeting in 5 minutes | 14:54 |
openstackstatus | fungi: finished sending ok | 14:56 |
*** pleia2 has quit IRC | 22:23 | |
*** pleia2 has joined #openstack-infra-incident | 22:25 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!