Tuesday, 2015-06-30

-openstackstatus- NOTICE: OpenStack CI is down due to hard drive failures		08:48
*** ChanServ changes topic to "OpenStack CI is down due to hard drive failures"		08:48
lifeless	oh yeah	09:22
lifeless	we should be here :)	09:22
AJaeger	lifeless: we could - but there's not mouch noise in #openstack-infra, so no need to hide ;)	09:25
*** jeblair has joined #openstack-infra-incident		13:46
*** fungi has joined #openstack-infra-incident		13:47
fungi	fsck of the logs volume is on pass 5 now. should be completed momentarily i think	13:47
jeblair	fungi: mordred and i are in meetings this morning	13:48
jeblair	fungi: is logs the only system affected now?	13:48
fungi	yep	13:48
fungi	as far as i'm aware anyway	13:48
jeblair	i saw talk about backups problems, i guess that's just proactive discussion?	13:48
fungi	right, for the trove instance migrations happening tomorrow	13:48
*** bauzas has joined #openstack-infra-incident		13:49
jeblair	it's a good reminder that we did say we should do a restore test sometime this cycle :)	13:49
fungi	indeed	13:49
fungi	i've got a patch on the way to add database backups for openstackid.org since it seems to have not had that puppet module applied yet	13:49
jeblair	fungi: thanks for handling	13:50
fungi	of course!	13:50
fungi	okay, fsck completed, server restarting to make sure everything mounts properly now	14:00
fungi	however it's worth noting that https://status.rackspace.com/index/viewincidents?group=11&start=1435636800 implies the event is not yet resolved	14:01
fungi	so we may see more volumes there or on other servers disconnect on us still	14:01
fungi	okay, everything's looking good to me so far	14:04
*** smcginnis has joined #openstack-infra-incident		14:07
fungi	AJaeger: jeblair: jhesketh: mordred: pleia2: SergeyLukjanov: unless any of you object, i'll go ahead and stand down the statusbot alert and follow up on the ml	14:21
AJaeger	fungi: you have the overview ;) No objection from my side :)	14:21
jhesketh	fungi: sounds good to me :-0	14:22
jhesketh	*:-)	14:22
AJaeger	fungi, loading http://status.openstack.org/zuul/ takes a long time	14:26
AJaeger	fungi: and the icons don't show up	14:26
AJaeger	Does it work fine for you?	14:26
fungi	AJaeger: it loaded instantly for me	14:26
fungi	where instantly is somewhere between 0 and 1 seconds at least	14:26
fungi	oh, the sparklines	14:26
fungi	and graphs	14:26
AJaeger	fungi: yes, sparklines! graphite.openstack.org/render/?from=-8hours&width=100&height=16&margin=0&hideLegend=true&hideAxes=true&hideGrid=true&target=color(stats.gauges.zuul.pipeline.gate.current_changes,%20%276b8182%27)&_t=0.4349146376458062	14:26
fungi	yep, i bet graphite is broken	14:27
fungi	checking now	14:27
*** smcginnis has left #openstack-infra-incident		14:27
fungi	the graphite server is taking a very long time to let me ssh in	14:27
fungi	also not responding to ping	14:27
fungi	i think that server has crashed or fallen off the network	14:28
fungi	i see oom killer messages but not sure how recent those are. could be from before the last time i restarted carbon-cache	14:29
fungi	i'm going to try to trigger a soft reboot	14:30
fungi	the virtual console, while it has output, is unresponsive to carriage return so i suspect it's frozen	14:30
fungi	looks like a hard reboot is my only option there	14:31
fungi	it's back up and responding on the console, but still unreachable	14:34
fungi	weird! i can reach its ipv6 address at 2001:4800:7810:512:3bc3:d7f6:ff04:8201 though we don't have that in dns	14:36
AJaeger	once that's up, I have one more question: Why is the post queue not processing? We have the top job in the queue since 7 hours	14:36
fungi	looks like graphite can't ping its ipv4 default gw	14:38
fungi	oh, though that may be filtering. it's reachable in the arp table	14:38
fungi	i've opened ticket 150630-ord-0000822 with fanatical support about graphite	14:44
fungi	also they responded to my question about the database instance migrations. they will indeed keep the same hostnames/dns entries so we shouldn't need to reconfigure anything	14:45
fungi	#status ok The log volume was repaired and brought back online at 14:00 UTC. Log links today from before that time may be missing, and changes should be rechecked if fresh job logs are desired for them.	14:51
openstackstatus	fungi: sending ok	14:51
*** ChanServ changes topic to "Discussion of OpenStack project infrastructure incidents \| No current incident"		14:53
-openstackstatus- NOTICE: The log volume was repaired and brought back online at 14:00 UTC. Log links today from before that time may be missing, and changes should be rechecked if fresh job logs are desired for them.		14:53
fungi	following up on the ml now before my next meeting in 5 minutes	14:54
openstackstatus	fungi: finished sending ok	14:56
*** pleia2 has quit IRC		22:23
*** pleia2 has joined #openstack-infra-incident		22:25

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!