| *** AJaeger is now known as AJaeger_ | 06:08 | |
| *** hjensas has joined #openstack-infra-incident | 07:30 | |
| *** hjensas has quit IRC | 08:23 | |
| -openstackstatus- NOTICE: zuul was restarted due to an unrecoverable disconnect from gerrit. If your change is missing a CI result and isn't listed in the pipelines on http://status.openstack.org/zuul/ , please recheck | 08:51 | |
| *** hjensas has joined #openstack-infra-incident | 09:34 | |
| *** Daviey has quit IRC | 12:28 | |
| *** Daviey has joined #openstack-infra-incident | 12:55 | |
| jroll | https://nvd.nist.gov/vuln/detail/CVE-2016-10229#vulnDescriptionTitle "udp.c in the Linux kernel before 4.5 allows remote attackers to execute arbitrary code via UDP traffic that triggers an unsafe second checksum calculation during execution of a recv system call with the MSG_PEEK flag." | 13:00 |
|---|---|---|
| jroll | not sure if infra listens for UDP off the top of my head, but thought I'd drop that here | 13:00 |
| *** hjensas has quit IRC | 14:57 | |
| *** lifeless_ has joined #openstack-infra-incident | 15:07 | |
| *** mordred has quit IRC | 15:10 | |
| *** lifeless has quit IRC | 15:10 | |
| *** EmilienM has quit IRC | 15:10 | |
| *** mordred1 has joined #openstack-infra-incident | 15:10 | |
| *** 21WAAA2JF has joined #openstack-infra-incident | 15:13 | |
| *** mordred1 is now known as mordred | 15:46 | |
| clarkb | jroll: thanks, we have an open snmp port we might want to close | 16:10 |
| clarkb | pabelanger: fungi ^ | 16:10 |
| pabelanger | ack | 16:10 |
| pabelanger | pbx might be one | 16:12 |
| pabelanger | since we use UDP for RTP | 16:12 |
| *** hjensas has joined #openstack-infra-incident | 16:12 | |
| jroll | clarkb: np | 16:13 |
| clarkb | actually snmp issourcespecific so fairly safe | 16:13 |
| clarkb | afs | 16:13 |
| clarkb | is udp | 16:13 |
| clarkb | mordred: ^ | 16:13 |
| pabelanger | ya, AFS might be our large exposure | 16:15 |
| mordred | oy. that's awesome | 16:17 |
| pabelanger | https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html | 16:17 |
| pabelanger | Linux afs01.dfw.openstack.org 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | 16:18 |
| pabelanger | so ya, might need a new kernel and reboot? | 16:18 |
| fungi | ugh | 16:19 |
| fungi | any reports it's actively exploited in the wild? | 16:19 |
| pabelanger | I am not sure | 16:21 |
| pabelanger | looks like android is getting the brunt of it however | 16:22 |
| fungi | keep in mind source filtering is still a lot less effective for udp than tcp | 16:23 |
| fungi | easier to spoof (mainly just need to guess a source address and active ephemeral port) | 16:23 |
| *** openstack has joined #openstack-infra-incident | 16:33 | |
| *** openstackstatus has joined #openstack-infra-incident | 16:34 | |
| *** ChanServ sets mode: +v openstackstatus | 16:34 | |
| *** 21WAAA2JF is now known as EmilienM | 17:05 | |
| *** EmilienM has joined #openstack-infra-incident | 17:05 | |
| clarkb | looking at https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-10229.html says xenial is not affected? | 18:57 |
| clarkb | I also don't see a ubuntu security notice for it yet | 18:59 |
| clarkb | it looks like we may be patched in many places? | 19:06 |
| clarkb | trying to figure out what exactly is required, but if I read that correctly xenial is fine despite being 4.4? trusty needs kernel >=3.13.0-79.123 | 19:06 |
| clarkb | pabelanger: fungi ^ that sound right to you? if so maybe next step is generate a list of kernels on all our hosts via ansible then produce a needs to be rebooted list | 19:12 |
| pabelanger | clarkb: ya, xenail isn't affected what what I read. | 19:13 |
| pabelanger | ++ to ansible run | 19:13 |
| clarkb | pabelanger: Linux review 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux | 19:48 |
| pabelanger | ++ | 19:48 |
| clarkb | on review.o.o which is newer than 3.13.0-79.123 | 19:48 |
| clarkb | so I think just restart the service? | 19:48 |
| pabelanger | ya, looks like just a restart then | 19:49 |
| -openstackstatus- NOTICE: The Gerrit service on http://review.openstack.org is being restarted to address hung remote replication tasks. | 19:51 | |
| fungi | sorry for not being around... kernel update sounds right, too bad we didn't take the gerrit restart as an opportunity to reboot | 19:58 |
| clarkb | fungi: we don't need to reboot it | 20:00 |
| clarkb | fungi: gerrit's kernel is new enough I think ^ you can double check above. | 20:00 |
| fungi | oh | 20:00 |
| fungi | yep, read those backwards | 20:00 |
| clarkb | pabelanger: puppetmaster.o.o:/home/clarkb/collect_facts.yaml has a small playbook thing to collect kernel info, want to check that out? | 20:00 |
| clarkb | that is incredibly verbose, is there a better way to introspect facts? | 20:02 |
| pabelanger | clarkb: sure | 20:05 |
| clarkb | pabelanger: if it looks good to you I will run it against all the hosts and stop using my --limit commands to test | 20:06 |
| clarkb | its verbose but works so just going to go with it I think | 20:06 |
| pabelanger | clarkb: looks good | 20:06 |
| clarkb | pabelanger: ok I will run it and redirect output into ~clarkb/kernels.txt | 20:07 |
| clarkb | its running | 20:08 |
| pabelanger | clarkb: only seeing the ok: hostname bits | 20:10 |
| clarkb | pabelanger: ya its gathering all the facts before running the task I think | 20:10 |
| pabelanger | Ha, ya | 20:11 |
| pabelanger | gather_facts: false | 20:11 |
| clarkb | well we need the facts | 20:11 |
| pabelanger | but, we need them | 20:11 |
| pabelanger | ya | 20:11 |
| clarkb | I guess I could've uname -a 'd it instead :) | 20:11 |
| pabelanger | okay, let me get some coffee | 20:11 |
| pabelanger | also, forgot about infracloud | 20:12 |
| pabelanger | that will be fun | 20:12 |
| clarkb | hrm why does the mtl01 internap mirror show up, I thoguht I cleaned that host up a while back | 20:16 |
| mgagne | mtl01 is the active region, nyj01 is the one that is now unused | 20:16 |
| * mgagne didn't read backlog | 20:16 | |
| clarkb | oh I got them mixed up, thanks | 20:18 |
| clarkb | looks like right now ansible is timing out trying to get to things like jeblairtest | 20:27 |
| clarkb | I'm just gonna let it timeout on its own, does anyone know how long that will take? | 20:27 |
| fungi | maybe 60 minutes in my experience | 20:36 |
| clarkb | fwiw most of my spot checking shows our kernels are new neough | 20:54 |
| clarkb | so don't expect to need to reboot much once ansible gets back to me | 20:54 |
| clarkb | its been an hour and they haven't timed out yet... | 21:17 |
| clarkb | done waiting going to kill ssh processes and hope that doesn't prevent play from finishing | 21:20 |
| clarkb | pabelanger: https://etherpad.openstack.org/p/infra-reboots-old-kernel | 21:26 |
| clarkb | I'm just going to start picking off some of the easy ones | 21:28 |
| clarkb | mordred: are you around? how do we do the afs reboots? make afs01 rw for all volumes, reboot afs02, make 02 rw, reboot 01? | 21:28 |
| clarkb | then do each of the db hosts one at a time? what about kdc01? | 21:28 |
| clarkb | doing etherpad now so the etherpad will be temporarily unavailable | 21:34 |
| clarkb | for propsal.slave.openstack.org and others, is the zuul launcher going to gracefully pick those back up again after a reboot or will we have to restart the launcher too? | 21:37 |
| clarkb | I guess I'm going to find out? | 21:37 |
| clarkb | rebooting proposal.slave.o.o now as its not doing anything | 21:38 |
| clarkb | I'm going to try grabbing all the mirror update locks on mirror-update so that I can reboot it next | 22:09 |
| clarkb | pabelanger: gem mirroring appears to have been stalled since april 2nd but there is a process holding the lock. Safe to grab it and then reboot? | 22:26 |
| pabelanger | clarkb: ya, we'll need to grab lock after reboot | 22:33 |
| pabelanger | just finishing up dinner, and need to run an errand | 22:34 |
| pabelanger | I can help with reboots once I get back in about 90mins | 22:34 |
| clarkb | pabelanger: sounds good we can leave mirror-update and afs for then. I will keep working on the others | 22:34 |
| clarkb | afs didn't survive reboot on gra1 mirror. working to fix now | 22:39 |
| clarkb | oh maybe it did and its just slow things are cdable now | 22:40 |
| clarkb | pabelanger for when you get back grafana updated on the grafana server, not sure if it matter or not? hopefully I didn't derp anything | 22:43 |
| clarkb | my apologies if it does :/ | 22:44 |
| clarkb | the web ui seems to be working though so going to reboot server now | 22:44 |
| clarkb | and its up and happy again | 22:50 |
| clarkb | now for the baremetal00 host for infra clouds running bifrost | 22:50 |
| pabelanger | looks like errands are pushed back a few hours | 23:00 |
| pabelanger | clarkb: look like we might have upgraded grafana.o.o too | 23:00 |
| pabelanger | checking logs to see if there are any errors | 23:00 |
| pabelanger | but so far, seems okay | 23:00 |
| clarkb | pabelanger: yes it upgraded, sorry I didn't think to not do that until it was done | 23:00 |
| clarkb | but ya service seems to work | 23:00 |
| pabelanger | 2.6.0 | 23:01 |
| pabelanger | should be fine | 23:01 |
| clarkb | I have been running apt-get update and dist-upgrade before reboots to make sure we get the new stuff | 23:01 |
| pabelanger | we'll find out soon if grafyaml has issues | 23:01 |
| clarkb | :) | 23:01 |
| clarkb | baremetal00 and puppetmaster are the two I want to do next | 23:01 |
| clarkb | then we are just left with afs things | 23:01 |
| pabelanger | k | 23:01 |
| clarkb | I think baremetal should be fine to just reboot | 23:01 |
| pabelanger | ya | 23:02 |
| clarkb | for puppetmaster we should grab the puppet run lock and then do it so we don't interrupt a bunch of ansible/puppet | 23:02 |
| pabelanger | okay | 23:02 |
| pabelanger | which do you want me to do | 23:02 |
| clarkb | I want you to do mirror-update if you can | 23:02 |
| pabelanger | k | 23:02 |
| clarkb | since you have the gem lock? I should have all the other locks at this point you can ps -elf | grep k5 to make sure nothing else is running | 23:02 |
| clarkb | I'm logged in but don't worry about it the only process I have are holding locsk | 23:03 |
| clarkb | I'm going to reboot baremetal00 now | 23:03 |
| * clarkb crosses fingers | 23:03 | |
| pabelanger | ya, k5start process are not running | 23:03 |
| pabelanger | so think mirror-update.o.o is ready | 23:03 |
| clarkb | pabelanger: cool go for it | 23:04 |
| clarkb | then grab whatever locks you need | 23:04 |
| clarkb | since they shouldn't survive a reboot | 23:04 |
| pabelanger | rebooting | 23:04 |
| clarkb | then when thats done and baremetal comes back lets figure out puppetmaster, then figure out afs servers | 23:05 |
| clarkb | baremetal still not back. Real hardware is annoying :) | 23:06 |
| pabelanger | mirror-update.o.o good, locks grabbed agin | 23:08 |
| clarkb | and still not up | 23:08 |
| pabelanger | ya, will take a few minutes | 23:08 |
| clarkb | pabelanger: can you start poking at puppetmaster maybe, see about grabbing lock for the puppet/ansible rotation? | 23:08 |
| clarkb | remmeber there are two now iirc | 23:08 |
| pabelanger | yup | 23:09 |
| clarkb | tyty | 23:09 |
| clarkb | at what point do I worry the hardware for baremetal00 is not coming back? :/ | 23:10 |
| clarkb | oh it just starting pinging | 23:10 |
| clarkb | \o/ | 23:10 |
| pabelanger | okay, have both locks on puppetmaster.o.o | 23:12 |
| pabelanger | and ansible is not running | 23:12 |
| clarkb | I don't need puppetmaster if you want to go for it | 23:12 |
| pabelanger | k, rebooting | 23:12 |
| clarkb | baremetal is up now. and ironic node-list works | 23:13 |
| clarkb | well thats interesting | 23:13 |
| clarkb | its urnning its old kernel | 23:13 |
| pabelanger | puppetmaster.o.o online | 23:14 |
| clarkb | I'm going to keep debugging baremetal and may have to reboot it again :( | 23:14 |
| pabelanger | ansible now running | 23:15 |
| clarkb | as for afs, can we reboot the kdc01 server safely ? we just won't be able to get kerberos tokens while its down? | 23:15 |
| clarkb | and can we reboot the db servers one at a time without impacting the service? | 23:16 |
| clarkb | then we just have to do the fileservers in a synchronized manner ya? | 23:16 |
| clarkb | mordred: corvus ^ | 23:16 |
| clarkb | I manually installed linux-image-3.13.0-116-generic on baremetal00, I do not know why a dist-upgrade was not pulling that in | 23:18 |
| clarkb | but its in there and in grub so thinking I will do a second reboot now | 23:18 |
| clarkb | pabelanger: ^ any ideas on that or concerns? | 23:18 |
| pabelanger | nope, go for it. We have access to iLo if needed | 23:19 |
| clarkb | we don't have to go through baremetal00 to get ilo? we can go through any of the other hosts ya? | 23:20 |
| clarkb | thats my biggest concern | 23:20 |
| pabelanger | I think we can do any now | 23:20 |
| pabelanger | they are all on same network | 23:20 |
| clarkb | ok rebooting | 23:20 |
| clarkb | I put some notes about doing the afs related servers on the etherpad. Does that look right to everyone? | 23:23 |
| clarkb | pabelanger: maybe you can check if one file server is already rw for all volumes and we can reboot the other one? | 23:23 |
| * clarkb goes to grab a drink while waiting for baremetal00 | 23:23 | |
| pabelanger | clarkb: vos listvldb show everything in symc | 23:25 |
| pabelanger | sync* | 23:25 |
| pabelanger | clarkb: http://docs.openafs.org/AdminGuide/HDRWQ139.html | 23:26 |
| pabelanger | we might want to follow that? | 23:26 |
| clarkb | vldb is what we run on afsdb0X? | 23:29 |
| pabelanger | I did it from afsdb01 | 23:30 |
| pabelanger | I think we use bos shutdown | 23:30 |
| pabelanger | to rotate things out | 23:30 |
| clarkb | gotcha thats the way you signal the other server to take over duties? | 23:30 |
| pabelanger | I think so | 23:30 |
| clarkb | definitely seems like what you are supposed to do according to the guide | 23:30 |
| pabelanger | maybe start with afs02.dfw.openstack.org | 23:31 |
| clarkb | does afs02 only have ro volumes? | 23:31 |
| pabelanger | yes | 23:31 |
| clarkb | and afs01 is all rw? if so then ya I think we do that one first | 23:31 |
| pabelanger | right | 23:31 |
| clarkb | (still waiting on baremetal00) | 23:31 |
| pabelanger | afs01 has rw and ro | 23:31 |
| pabelanger | err | 23:31 |
| pabelanger | afs01.dfw.openstack.org RW RO | 23:32 |
| pabelanger | afs01.dfw.openstack.org RO | 23:32 |
| pabelanger | afs02.dfw.openstack.org RO | 23:32 |
| pabelanger | afs01.ord.openstack.org | 23:32 |
| pabelanger | is still online, but not used | 23:32 |
| pabelanger | maybe we do afs01.ord.openstack.org first | 23:32 |
| clarkb | right ok. Then once afs02 is back up again we transition all the volumes to be swapped on the RW RO | 23:32 |
| pabelanger | npm volume locked atm | 23:33 |
| clarkb | ord's kernel is old too but not in my list | 23:33 |
| clarkb | maybe we skipped things in emergency file? may need to dbouel check that after we are done | 23:34 |
| clarkb | (I used hosts: '*' to try and get them all and not !disabled) | 23:34 |
| pabelanger | odd, okay we'll need to do 3 servers it seems. afs01.ord.openstack.org is still used by a few volumes | 23:34 |
| clarkb | still waiting on baremetal00 :/ | 23:35 |
| pabelanger | okay, so which do we want to shutdown first? | 23:35 |
| clarkb | I really don't know :( my feeling is the non file servers may be the lowest impact? | 23:36 |
| pabelanger | right, afs01.ord.openstack.org is the least used FS | 23:36 |
| clarkb | ok lets do that one first of the fileservers | 23:37 |
| clarkb | then question is do we want to do kdc and afsdb before fileservers or after? | 23:37 |
| clarkb | also still no baremetal pinging. This is much longer than the last time | 23:37 |
| clarkb | pabelanger: does the ord fs have an RW volumes? | 23:38 |
| pabelanger | mirror.npm still is locked | 23:38 |
| pabelanger | clarkb: no | 23:39 |
| pabelanger | just RO | 23:39 |
| clarkb | ok so what we want to do then maybe is grab all the flocks on mirror-update so that things stop updating volumes (like npm) | 23:40 |
| clarkb | then reboot ord fileserver first? | 23:40 |
| clarkb | see how that goes? | 23:40 |
| pabelanger | sure | 23:40 |
| clarkb | ok why don't you grab the flocks. I am working on getting ilo access to baremetal00 | 23:41 |
| pabelanger | ha | 23:42 |
| pabelanger | puppet needs to create the files in /var/run still | 23:42 |
| pabelanger | since they are deleted on boot | 23:42 |
| clarkb | the lock files are deleted? | 23:42 |
| pabelanger | /var/run is tmpfs | 23:43 |
| pabelanger | so /var/run/npm/npm.lock | 23:43 |
| pabelanger | will fail, until puppet create /var/run/npm | 23:43 |
| clarkb | I can't seem to hit the ipmi adds with ssh | 23:44 |
| clarkb | s/adds/addrs/ | 23:44 |
| clarkb | I am able to hit the compute hosts own ipmi but its slwo, maybe baremetal is slower /me tries harder | 23:45 |
| clarkb | ok I'm on the ilo now. I guess being persistent after multiple connection timeouts is the trick | 23:49 |
| pabelanger | k, have all the lock on mirror-update | 23:50 |
| clarkb | so I can see the text console. The server is running | 23:56 |
| clarkb | but no ssh | 23:56 |
| clarkb | I think I am going to reboot with the text console up | 23:57 |
| pabelanger | k | 23:57 |
| pabelanger | like I'm ready to bos shutdown afs01.ord.openstack.org | 23:58 |
| clarkb | pabelanger: I think if you are ready and willing then lets do it :) | 23:58 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!