Tuesday, 2016-06-07

phil_hEric, are you there?19:30
eloyes. just jump on this computer19:32
elowhats up19:32
phil_hJust wanted to check to see if you have given the ansible stuff much thought?19:33
elonot off-hand19:34
phil_hI am looking at what plumgrid did and seeing if I can mod my astara ansible stuff to fit19:34
elolooked at it earlier.. they way they disable stuff is what we need to borrow19:35
phil_hI would like to get astara included in the OSA stuff19:35
elobrb. bio break19:36
phil_hI need to figure out how to use OSA to install OpenStack so I can then start to merge the stuff19:36
phil_hanyway I an trying to find some time to get started on the OSA stuff for Astara20:18
phil_hIs anyone else interested?20:18
stupidnicI am wondering if anybody is around to help me troubleshoot an issue I am having with a virtual network.21:45
stupidnicI have a client (only one) that seems to be having some sort of issue with their instances. Basically the instances stop pinging all of a sudden. Very random and intermittent.21:47
stupidnicI have a ping running externally to a floating IP as well as a ping from the router to the internal IP address. Randomly the internal ping stops responding.21:49
stupidnicI have used tcpdump on the hypervisor to confirm that the packets are getting through to the compute node, it's just that the instance suddenly stops responding.21:49
stupidnicIn the tcpdump I can see the packets going unanswered, and then suddenly the router does an ARP request for the IP address and it magically starts working again21:50
stupidnicThis seems to be happening to all their instances. I started up another instance just for testing and it is also impacted by this.21:51
fzylogic_any chance that they have a second router instance running somewhere? We've seen that happen a few times when an instance partially fails, but astara isn't able to delete the service VM (usually due to a nova bug/outage)22:11
fzylogic_doesn't seem *quite* right if you're seeing traffic hitting the VM, but that's the most frequent problem we see that affects only a single tenant.22:13
stupidnicfzylogic_: Hmmm... Perhaps. It seems to be really at the router level. So I went to each compute node and started a tcpdump on the bridge interface for the tenant22:22
stupidnicWhat is odd... when the ping stops on the router, I can still see the packets being passed between the two compute nodes22:22
stupidnicSo the pings keep going, it's just the router isn't seeing the traffic I think22:23
fzylogic_could definitely be the case that there's a duplicate router somewhere taking over that gateway IP22:23
stupidnicThat makes sense.22:24
stupidnicI have also seen some weird behavior like this a long long time ago where somebody assigned a broadcast IP as an IP on a server22:24
stupidnicIt had similar behavior22:25
stupidnicI just typed arp on the router22:26
stupidnicand it is very slow to respond22:26
fzylogic_dns not working?22:26
stupidnicdoesn't seem to be... I can't ping
stupidnicOkay... So looking at the public interface on the router... I am showing dropped packets on the public interface22:28
stupidnicfzylogic_: is there any way to find out if there is another instance hanging around?22:37
fzylogic_we've got a script that compares the list of routers in neutron with a list of VMs in nova22:40
fzylogic_or if your cluster's small enough, you can probably just eyeball it22:41
stupidnicOkay. So another data point.22:41
stupidnicI just went and checked another tenant's router22:42
stupidnicand I can ping google, the gateway, etc22:42
stupidnicBut in this tenant's router I can't do that22:42
stupidnicThat doesn't make any sense to me22:43
fzylogic_if you leave a ping going for a while, do you see intermittent success at all?22:44
fzylogic_if so, that sounds like 2 routers fighting over an IP22:44
stupidniclet me setup a ping to the router's external interface22:45
stupidnicI will say that I already rebooted the router in the tenant earlier today (was one of the first things I tried)22:48
stupidnicOkay... pinging the external IP of the router doesn't seem to show any issues with it22:54
stupidnicpinging clean22:54
fzylogic_do you see the packets on the router VM that you're logged into?22:54
stupidnicCross checking22:55
stupidnicOkay... the VM seems to be hung?22:55
stupidnicyep... the instance rebooted22:57
stupidnicthat's really weird22:57
stupidnicAlso... 8 routers, 8 instances22:58
stupidnicso no duplicates as near as I can tell22:58
stupidnicCan't ping the gateway though23:00
stupidnicfrom the new router that just started up23:00
stupidnicSo to cross check this... I have another router running on the same compute node everything works fine on that router23:03
stupidnicI can ping out23:04
stupidnicOkay. I confirmed that the two routers can ping each other23:06
stupidnicfzylogic_: okay. So looking at the router VM external interface... I am not seeing the ICMP packets23:12
stupidnicSo something else has that IP address I am guessing23:13
stupidnicAlright... getting somewhere now23:13
stupidnicChecking the core for the IP address assigned to the router VM ( I show a different MAC address than the one assigned to the router's public interface23:14
stupidnicDoes neutron have a way of searching by mac address?23:15
fzylogic_none that I can think of23:18
fzylogic_at least not directly23:18
fzylogic_but if you do a port-list, I think it should show the mac addresses on each one23:19
stupidnicOkay. That's the right track. Now to see if I can find that in the haystack23:20
stupidnicawesome... that mac address doesn't exist in the port list23:22
stupidnicOkay. So I finally found the mac address.23:26
stupidnicit belongs to a tap interface on one of the hypervisors23:27
fzylogic_grep for it in /etc/libvirt/qemu/* on each one23:27
fzylogic_sounds like you may have an instance that libvirt hung onto after nova tried deleting it23:28
fzylogic_I've seen hypervisor crashes do this if nova's not configured just right23:29
stupidnicSo there is an instance xml file that has that mac address23:30
stupidnicso... how do I get rid of this?23:30
fzylogic_the same xml file will have the UUID of the instance so you can ask nova if it's supposed to exist or not23:30
fzylogic_if not, you can just do `virsh destroy instance-whatever`23:30
stupidnicjust confirming here... that the main <uuid> is the one I should be looking for, right?23:32
stupidnicYeah looks like Nova doesn't know anything about that. I went to the Hypervisor list... and I don't see that instance id listed for the compute node at all23:33
stupidnicThis was probably related to the hostname change we made to the controller a while back23:34
stupidnicThat wrecked some stuff23:34
fzylogic_could definitely be23:34
stupidnicneed a blink tag on that one...23:34
fzylogic_might be interesting to look through old nova logs, but for now you can pretty safely just delete that VM with virsh23:34
stupidnicchanging the hostname on the controller is bad... very bad23:34
fzylogic_also, check what you have configured for the running_deleted_instance_action, running_deleted_instance_timeout, and running_deleted_instance_poll_interval options in nova.conf23:36
fzylogic_you want running_deleted_instance_action=reap23:36
stupidnicGood to know.23:36
stupidnicThank you23:36
fzylogic_and whatever values for the other 2 you consider sane for your environment23:36
fzylogic_sure thing23:37
stupidnicHot damn23:37
stupidnicthat's got it23:37
stupidnicman... that was a lot harder than it should have been23:38
stupidnicfzylogic_: is that option in the compute nova.conf or at the server?23:39
fzylogic_nova.conf on your hypervisors23:39
fzylogic_it's used by nova-compute23:39

