*** Zara is now known as Zara__ | 10:31 | |
jeblair | infra-root: ping | 15:53 |
---|---|---|
*** yolanda has joined #openstack-infra-incident | 15:53 | |
mordred | jeblair: o/ | 15:53 |
* anteaya stands by | 15:53 | |
jeblair | i'm moving discussion of filesystem corruption over here | 15:53 |
jeblair | i don't quite understand what i'm seeing on nodepool yet | 15:54 |
yolanda | elasticsearch nodes look fine | 15:54 |
jeblair | [313589.385320] EXT4-fs (loop0p1): mounted filesystem with ordered data mode. Opts: (null) | 15:54 |
jeblair | there were lots of errors regarding loop devices | 15:54 |
jeblair | so i assume something went wrong while dib was running... | 15:54 |
jeblair | but this is also in there: | 15:54 |
jeblair | [318745.450555] EXT4-fs (xvde2): mounted filesystem with ordered data mode. Opts: (null) | 15:54 |
jeblair | i don't think we're using xvde2 for anything | 15:55 |
mordred | yah. I agree with that assumption | 15:55 |
jeblair | cause we are using the cinder volume for /opt | 15:55 |
jeblair | xvde2 is commented out in the fstab | 15:55 |
jeblair | it does not show up in /proc/mounts | 15:56 |
jeblair | i unmounted /opt (the cinder volume) and fsck'd; it seems fine | 15:58 |
jeblair | so i think perhaps we did not actually see an error with the cinder volume here... just some other weirdness | 15:58 |
jeblair | but i'm still getting loop device errors | 15:58 |
jeblair | losetup /dev/loop0p1 | 16:01 |
jeblair | loop: can't get info on device /dev/loop0p1: No such device or address | 16:01 |
jeblair | losetup /dev/loop0 | 16:01 |
jeblair | loop: can't get info on device /dev/loop0: No such device or address | 16:01 |
jeblair | i don't know what it thinks its attached to | 16:01 |
jeblair | hrm... i wonder if we did see an interruption to the cinder volume, but since most of the work was happening through loopback devices on very large files on the volume, it was the filesystem in the file as exposed through the loopback device that failed, rather than the filesystem on the real volume | 16:02 |
jeblair | (because we weren't performing ext4 filesystem operations on the volume's filesystem) | 16:03 |
yolanda | jeblair, have you tried dmsetup? i used to rely on that to clean stuff when i was testing dib and volumes | 16:03 |
jeblair | yolanda: will that show info about loop devices? | 16:03 |
anteaya | context for other folks reading logs: https://status.rackspace.com/index/viewincidents?group=21&start=1453784400 | 16:04 |
jeblair | dmsetup status only shows: | 16:04 |
jeblair | main-opt: 0 1073733632 linear | 16:04 |
jeblair | i think something is messed up with the loopback module, and i'd like to reboot nodepool to try to correct it | 16:05 |
anteaya | I support rebooting nodepool | 16:06 |
yolanda | ++ | 16:06 |
jeblair | okay, doing that now | 16:06 |
jeblair | the host is back up and everything looks okay with loop and /opt | 16:08 |
anteaya | woooot | 16:09 |
jeblair | clarkb: how did you start nodepool? | 16:09 |
jeblair | i don't see "no-builder" in the init script | 16:09 |
jeblair | ah | 16:10 |
jeblair | it's in /etc/default/nodepool | 16:10 |
anteaya | so bit of lag from hearing from clarkb expected this week, yeah? | 16:10 |
jeblair | so i will start it as normal | 16:10 |
jeblair | anteaya: yes, but have to ask anyway -- he deployed some changes manually that haven't landed yet | 16:11 |
anteaya | ah understood | 16:11 |
anteaya | figured you had remembered | 16:11 |
anteaya | would they be in his user history? | 16:11 |
jeblair | anteaya: but i believe they are documented in the pending change at https://review.openstack.org/271541 (which still needs an update before merging; though nobody wrote a comment on the change describing it) | 16:11 |
jeblair | anteaya: i figured it out | 16:11 |
* anteaya nods | 16:11 | |
jeblair | nodepool is running, but i have not started the builder | 16:13 |
anteaya | I understand better, thanks for sharing the link to that patch | 16:13 |
jeblair | i'm going to see if it looks like there's some un-cleaned-up dib stuff laying around | 16:13 |
jeblair | greghaynes: ^ | 16:13 |
jeblair | greghaynes: you may be interested in my theories about filesystem/device corruption in scrollback | 16:14 |
jeblair | i'm removing some stuff from dib_tmp | 16:14 |
jeblair | i'll start the builder now | 16:16 |
jeblair | it's running and idle; i think that's everything for now then | 16:16 |
anteaya | yay | 16:17 |
mordred | jeblair: back online - looks like we're back up and going now? | 17:21 |
jeblair | mordred: yeah | 17:25 |
*** AJaeger has joined #openstack-infra-incident | 18:50 | |
*** AJaeger has left #openstack-infra-incident | 21:00 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!