Tuesday, 2015-06-30

-openstackstatus- NOTICE: OpenStack CI is down due to hard drive failures08:48
*** ChanServ changes topic to "OpenStack CI is down due to hard drive failures"08:48
lifelessoh yeah09:22
lifelesswe should be here :)09:22
AJaegerlifeless: we could - but there's not mouch noise in #openstack-infra, so no need to hide ;)09:25
*** jeblair has joined #openstack-infra-incident13:46
*** fungi has joined #openstack-infra-incident13:47
fungifsck of the logs volume is on pass 5 now. should be completed momentarily i think13:47
jeblairfungi: mordred and i are in meetings this morning13:48
jeblairfungi: is logs the only system affected now?13:48
fungiyep13:48
fungias far as i'm aware anyway13:48
jeblairi saw talk about backups problems, i guess that's just proactive discussion?13:48
fungiright, for the trove instance migrations happening tomorrow13:48
*** bauzas has joined #openstack-infra-incident13:49
jeblairit's a good reminder that we did say we should do a restore test sometime this cycle :)13:49
fungiindeed13:49
fungii've got a patch on the way to add database backups for openstackid.org since it seems to have not had that puppet module applied yet13:49
jeblairfungi: thanks for handling13:50
fungiof course!13:50
fungiokay, fsck completed, server restarting to make sure everything mounts properly now14:00
fungihowever it's worth noting that https://status.rackspace.com/index/viewincidents?group=11&start=1435636800 implies the event is not yet resolved14:01
fungiso we may see more volumes there or on other servers disconnect on us still14:01
fungiokay, everything's looking good to me so far14:04
*** smcginnis has joined #openstack-infra-incident14:07
fungiAJaeger: jeblair: jhesketh: mordred: pleia2: SergeyLukjanov: unless any of you object, i'll go ahead and stand down the statusbot alert and follow up on the ml14:21
AJaegerfungi: you have the overview ;) No objection from my side :)14:21
jheskethfungi: sounds good to me :-014:22
jhesketh*:-)14:22
AJaegerfungi, loading http://status.openstack.org/zuul/ takes a long time14:26
AJaegerfungi: and the icons don't show up14:26
AJaegerDoes it work fine for you?14:26
fungiAJaeger: it loaded instantly for me14:26
fungiwhere instantly is somewhere between 0 and 1 seconds at least14:26
fungioh, the sparklines14:26
fungiand graphs14:26
AJaegerfungi: yes, sparklines! graphite.openstack.org/render/?from=-8hours&width=100&height=16&margin=0&hideLegend=true&hideAxes=true&hideGrid=true&target=color(stats.gauges.zuul.pipeline.gate.current_changes,%20%276b8182%27)&_t=0.434914637645806214:26
fungiyep, i bet graphite is broken14:27
fungichecking now14:27
*** smcginnis has left #openstack-infra-incident14:27
fungithe graphite server is taking a very long time to let me ssh in14:27
fungialso not responding to ping14:27
fungii think that server has crashed or fallen off the network14:28
fungii see oom killer messages but not sure how recent those are. could be from before the last time i restarted carbon-cache14:29
fungii'm going to try to trigger a soft reboot14:30
fungithe virtual console, while it has output, is unresponsive to carriage return so i suspect it's frozen14:30
fungilooks like a hard reboot is my only option there14:31
fungiit's back up and responding on the console, but still unreachable14:34
fungiweird! i can reach its ipv6 address at 2001:4800:7810:512:3bc3:d7f6:ff04:8201 though we don't have that in dns14:36
AJaegeronce that's up, I have one more question: Why is the post queue not processing? We have the top job in the queue since 7 hours14:36
fungilooks like graphite can't ping its ipv4 default gw14:38
fungioh, though that may be filtering. it's reachable in the arp table14:38
fungii've opened ticket 150630-ord-0000822 with fanatical support about graphite14:44
fungialso they responded to my question about the database instance migrations. they will indeed keep the same hostnames/dns entries so we shouldn't need to reconfigure anything14:45
fungi#status ok The log volume was repaired and brought back online at 14:00 UTC. Log links today from before that time may be missing, and changes should be rechecked if fresh job logs are desired for them.14:51
openstackstatusfungi: sending ok14:51
*** ChanServ changes topic to "Discussion of OpenStack project infrastructure incidents | No current incident"14:53
-openstackstatus- NOTICE: The log volume was repaired and brought back online at 14:00 UTC. Log links today from before that time may be missing, and changes should be rechecked if fresh job logs are desired for them.14:53
fungifollowing up on the ml now before my next meeting in 5 minutes14:54
openstackstatusfungi: finished sending ok14:56
*** pleia2 has quit IRC22:23
*** pleia2 has joined #openstack-infra-incident22:25

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!