*** thorst_ has joined #openstack-powervm | 00:38 | |
*** thorst_ has quit IRC | 00:43 | |
*** thorst_ has joined #openstack-powervm | 02:12 | |
*** thorst_ has quit IRC | 02:17 | |
*** thorst_ has joined #openstack-powervm | 02:17 | |
*** thorst_ has quit IRC | 02:28 | |
*** thorst_ has joined #openstack-powervm | 02:28 | |
*** thorst_ has quit IRC | 02:37 | |
*** thorst_ has joined #openstack-powervm | 03:26 | |
*** thorst_ has quit IRC | 03:28 | |
*** thorst_ has joined #openstack-powervm | 03:28 | |
*** thorst_ has quit IRC | 03:37 | |
*** thorst_ has joined #openstack-powervm | 04:35 | |
*** thorst_ has quit IRC | 04:42 | |
*** thorst_ has joined #openstack-powervm | 05:40 | |
*** thorst_ has quit IRC | 05:47 | |
*** thorst_ has joined #openstack-powervm | 06:46 | |
*** thorst_ has quit IRC | 06:53 | |
*** thorst_ has joined #openstack-powervm | 07:50 | |
*** thorst_ has quit IRC | 07:57 | |
*** k0da has joined #openstack-powervm | 08:28 | |
*** k0da has quit IRC | 08:38 | |
-openstackstatus- NOTICE: Gerrit is going to be restarted due to slowness and proxy errors | 08:46 | |
*** openstackgerrit has quit IRC | 08:48 | |
*** openstackgerrit has joined #openstack-powervm | 08:48 | |
*** thorst_ has joined #openstack-powervm | 08:55 | |
*** thorst_ has quit IRC | 09:02 | |
*** kotra03 has joined #openstack-powervm | 09:12 | |
*** thorst_ has joined #openstack-powervm | 10:00 | |
*** k0da has joined #openstack-powervm | 10:02 | |
*** thorst_ has quit IRC | 10:07 | |
*** thorst_ has joined #openstack-powervm | 11:05 | |
*** thorst_ has quit IRC | 11:12 | |
*** thorst_ has joined #openstack-powervm | 12:10 | |
*** thorst_ has quit IRC | 12:17 | |
*** thorst_ has joined #openstack-powervm | 12:50 | |
*** thorst_ has quit IRC | 12:50 | |
*** thorst_ has joined #openstack-powervm | 12:52 | |
*** kylek3h has quit IRC | 12:57 | |
*** tblakes has joined #openstack-powervm | 13:02 | |
*** edmondsw has joined #openstack-powervm | 13:17 | |
*** apearson has joined #openstack-powervm | 13:23 | |
*** jwcroppe has quit IRC | 13:30 | |
*** kylek3h has joined #openstack-powervm | 13:30 | |
*** kylek3h has quit IRC | 13:30 | |
*** jwcroppe has joined #openstack-powervm | 13:30 | |
*** kylek3h has joined #openstack-powervm | 13:30 | |
*** jwcroppe has quit IRC | 13:35 | |
*** mdrabe has joined #openstack-powervm | 13:46 | |
*** jwcroppe has joined #openstack-powervm | 13:55 | |
*** dwayne_ has quit IRC | 14:10 | |
*** kotra03 has quit IRC | 14:17 | |
*** seroyer has joined #openstack-powervm | 14:40 | |
openstackgerrit | Shyama proposed openstack/nova-powervm: SSP Volume Adapter https://review.openstack.org/372254 | 14:42 |
---|---|---|
*** esberglu has joined #openstack-powervm | 14:49 | |
*** kriskend has joined #openstack-powervm | 14:50 | |
*** dwayne_ has joined #openstack-powervm | 14:50 | |
*** k0da has quit IRC | 15:25 | |
*** apearson has quit IRC | 15:30 | |
*** apearson has joined #openstack-powervm | 15:37 | |
*** apearson_ has joined #openstack-powervm | 15:40 | |
*** adi___ has quit IRC | 15:41 | |
*** tjakobs has joined #openstack-powervm | 15:41 | |
*** apearson has quit IRC | 15:42 | |
*** toan has quit IRC | 15:43 | |
*** toan has joined #openstack-powervm | 15:47 | |
*** adi___ has joined #openstack-powervm | 15:54 | |
*** apearson_ has quit IRC | 15:59 | |
*** apearson_ has joined #openstack-powervm | 16:02 | |
*** adi___ has quit IRC | 16:06 | |
*** adi___ has joined #openstack-powervm | 16:09 | |
thorst_ | efried: asked in openstack-nova for them to take another look at the powervm blueprint. I think they will this week. | 16:23 |
adreznec | thorst_: We can bring it up in the open floor part of the nova meeting as well | 16:32 |
thorst_ | yeah, we'll see. I was going to do that last meeting but that was like a 30 second window to get it in | 16:37 |
thorst_ | and I missed it | 16:37 |
thorst_ | :-) | 16:37 |
thorst_ | esberglu: if we got you a patch to test in the CI...could you apply it? | 16:56 |
thorst_ | its in addition to the local2remote | 16:56 |
esberglu | Yeah | 16:56 |
thorst_ | try 4458 | 16:57 |
thorst_ | I guess efried will give you the go ahead before doing it | 16:57 |
esberglu | The LU thing is hitting the undercloud again | 17:02 |
*** smatzek has joined #openstack-powervm | 17:13 | |
efried | esberglu: Try 4458 now. | 17:55 |
efried | esberglu: want me to look at "The LU thing"? | 17:56 |
*** jwcroppe has quit IRC | 18:00 | |
*** jwcroppe has joined #openstack-powervm | 18:01 | |
*** jwcroppe_ has joined #openstack-powervm | 18:04 | |
*** jwcroppe has quit IRC | 18:05 | |
*** jwcroppe_ has quit IRC | 18:06 | |
*** jwcroppe has joined #openstack-powervm | 18:07 | |
*** jwcroppe has quit IRC | 18:11 | |
thorst_ | esberglu: seems like efried is close to having this ready | 18:26 |
*** jwcroppe has joined #openstack-powervm | 18:42 | |
*** jwcroppe has quit IRC | 18:46 | |
*** jwcroppe has joined #openstack-powervm | 18:46 | |
*** jwcroppe has quit IRC | 18:51 | |
esberglu | efried: Sorry for the slow delay, out at lunch. | 18:57 |
esberglu | So nodes are failing to delete again, but only from 2 of the 5 SSP groups | 18:57 |
esberglu | *2 of the 4 SSP groups | 18:58 |
adreznec | Interesting | 18:59 |
esberglu | Seeing this in the logs | 18:59 |
esberglu | http://paste.openstack.org/show/588307/ | 19:00 |
adreznec | thorst_: all the SSPs are from the same SAN, right? | 19:00 |
thorst_ | adreznec: yep | 19:00 |
thorst_ | but usually when we reinstall the VIOSes we have to remove the luns on the storage for the host | 19:00 |
efried | esberglu, that looks like yet another VIOS flake. | 19:00 |
adreznec | That's... weird | 19:00 |
adreznec | Yeah | 19:00 |
thorst_ | so it doesn't install to the wrong thing | 19:00 |
adreznec | Sure | 19:00 |
adreznec | But I don't think esberglu has reinstalled anything here | 19:00 |
adreznec | Just the VIOS upgrades recently, right? | 19:01 |
esberglu | Yeah | 19:01 |
efried | Yeah, exactly. | 19:01 |
efried | Do we need another REST-side fix from apearson_ to retry when VIOS effs up? | 19:01 |
efried | What's the point of a clustered file system if you can't operate on it in a distributed fashion? | 19:02 |
efried | ^^ Captain Obvious ^^ | 19:02 |
adreznec | efried: Asking the tough questions | 19:04 |
efried | adreznec thorst_ Should we open a VIOS defect and chase it around for a few weeks before they tell us they "just provide building blocks" and we have to synchronize ourselves? | 19:06 |
adreznec | Sigh | 19:11 |
adreznec | probably | 19:11 |
adreznec | That was the point of being on the latest VIOS | 19:11 |
adreznec | esberglu: is this recreatable pretty easily? | 19:11 |
esberglu | If I leave the CI on for long enough it seems to hit it. Not sure what steps it would take to recreate on purpose | 19:13 |
efried | esberglu, how long ago did the error occur? | 19:13 |
efried | I.e. if we snap the offending VIOS(es), would that give the VIOS team enough information to nail down exactly what operation caused the bounce. | 19:14 |
esberglu | That error happened yesterday afternoon at some point | 19:17 |
efried | boo, that's probably not recent enough. | 19:18 |
efried | esberglu, see mriedem's comment in #openstack-nova | 19:23 |
efried | It's going to blow up our skip list, potentially, cause I sure would like to keep the test names in comments. | 19:23 |
esberglu | Yep | 19:23 |
esberglu | efried: Okay the error just popped again on neo8 | 19:29 |
efried | same error? | 19:29 |
esberglu | Yeah | 19:29 |
efried | Quick, run a snap on the offending VIOS | 19:29 |
efried | And open a VIOS defect. | 19:29 |
esberglu | How do I do that? | 19:30 |
efried | Which? | 19:30 |
adreznec | Run snap? | 19:30 |
esberglu | Yeah | 19:30 |
seroyer | snap | 19:30 |
efried | hm, on another glance, this actually looks like it may be much lower down in the stack - an actual ODM lock timeout. That's trickier. | 19:31 |
adreznec | Yeah... | 19:32 |
adreznec | that's an AIX bug | 19:32 |
efried | esberglu: snap -a | 19:33 |
esberglu | -a option flag is not valid | 19:35 |
efried | Try it as root. | 19:37 |
efried | oem_setup_env | 19:37 |
seroyer | You can snap as padmin | 19:37 |
seroyer | Can and should | 19:37 |
esberglu | Hmm it isn't letting me specify the -a flag as padmin, but it does for root | 19:38 |
efried | okay, do what seroyer says then. | 19:38 |
efried | Either way. | 19:39 |
esberglu | seroyer: Should I just do it without the -a as padmin? Or with -a as root? | 19:39 |
seroyer | I don't know what the -a does as root, so I'm not sure how to answer that. snap -help should give you a list of supported options. | 19:40 |
efried | But if we're seeing an ODM lock problem, I kinda doubt we're going to convince VIOS to implement the retry on their end. Or convert it to a "VIOS busy" error". | 19:40 |
seroyer | But I've always just run it without any args. | 19:40 |
seroyer | efried, +1 | 19:40 |
efried | So it's likely we'll have to have apearson_ do the retry magic keyed off that error code. | 19:41 |
efried | Although we have to hope that VIOS at least has atomicity/transactionalism around whatever that operation is. | 19:41 |
efried | I.e. it's not doing part of an operation and then failing. | 19:41 |
efried | All of that needs to be ferreted out via the defect before we take action in REST. | 19:41 |
esberglu | Okay so what actions do I need to take for this? | 19:48 |
efried | The snap command should have printed the location of a tarball at the end. | 19:49 |
efried | Open a CQ defect against VIOS and attach that snap. | 19:50 |
efried | adreznec seroyer thorst_ - I can never remember how all the routing info goes for these defects - y'all have that? | 19:50 |
thorst_ | efried: I do not... | 19:51 |
efried | esberglu, once done, give me the defect number and I'll put in my rant. | 19:51 |
efried | esberglu, this won't be the last VIOS defect you open, so keep notes on the process for future reference ;-) | 19:51 |
thorst_ | efried: yeah, lets ask hsien and apearson to do the magic retry thing | 19:51 |
thorst_ | we'll get a quick fix then | 19:52 |
thorst_ | (sorry - I read 80% of that thread...so hopefully that was enough for me to go off of) | 19:52 |
efried | thorst_, want to do that right away? I think we should at least make sure VIOS is cleaning up properly first. | 19:52 |
thorst_ | efried: I just want to do the quickest thing to unwedge us | 19:52 |
thorst_ | we've got to get CI unwedged ASAP... | 19:53 |
efried | This is intermittent. | 19:53 |
efried | esberglu, when it happens, does it bring the whole CI crashing down around us? | 19:53 |
thorst_ | even if it kills a test run...that's kinda bad. Leads to an inconsistent CI... | 19:55 |
thorst_ | I'm assuming we can't do the retry in nova-powervm. | 19:55 |
thorst_ | too high level? | 19:55 |
esberglu | Eventually yeah. And as the nodes hanging on delete start accumulating, it messed nodepool up | 19:56 |
esberglu | Right now the CI is still "running" though | 19:57 |
efried | thorst_, we could start scraping the error message of every 500 we receive for a set of known error codes (in this case 0514-516) and spoof those as retries. That would be a pypowervm thing. But that doesn't seem like The Right Thing. | 20:04 |
thorst_ | efried: agree..lets get this to hsien and apearson to fast path in. | 20:07 |
seroyer | adreznec, I have the defect routing info. I can pm to the one who needs it. | 20:12 |
adreznec | efried: esberglu ^^ | 20:12 |
esberglu | seroyer: I need it | 20:13 |
efried | thanks seroyer | 20:13 |
*** jwcroppe has joined #openstack-powervm | 20:27 | |
efried | adreznec thorst_, turns out these VIOSes were running 1614_61e. According to seroyer, that's actually the latest GAed version - but it's still pretty old - like from 1Q16. Do we want to be on a recent 61k? Apparently there's a GOLD level at 1640E_61k. | 20:31 |
adreznec | Blah | 20:34 |
adreznec | Probably | 20:34 |
adreznec | The VIOS team will tell us to recreate on that anyway I'm sure | 20:34 |
seroyer | No, what this was hit on was the latest GA level. They have to support it. | 20:37 |
thorst_ | ahhh | 21:02 |
thorst_ | nice pt seroyer | 21:02 |
*** smatzek has quit IRC | 21:23 | |
thorst_ | esberglu: did we get a test today with that new patch? | 21:36 |
esberglu | I was waiting for efried to give me the okay | 21:37 |
*** tblakes has quit IRC | 21:37 | |
efried | (11:55:50 AM) Eric F.: esberglu: Try 4458 now. | 21:37 |
*** tblakes has joined #openstack-powervm | 21:38 | |
*** kriskend_ has joined #openstack-powervm | 22:02 | |
*** kriskend has quit IRC | 22:04 | |
*** thorst_ has quit IRC | 22:10 | |
esberglu | thorst_: efried: I just redeployed the CI management node and it should have both pypowervm patches applied. | 22:10 |
esberglu | So runs should be going through at some point this evening with that | 22:11 |
*** edmondsw has quit IRC | 22:15 | |
*** tblakes has quit IRC | 22:28 | |
efried | esberglu: both? | 22:37 |
esberglu | There was 1 that got applied already. So just the 1 new one | 22:37 |
esberglu | That local2remote one that qinq wu is working on changing and then this one | 22:38 |
efried | okay. | 22:40 |
*** mdrabe has quit IRC | 22:42 | |
*** kylek3h has quit IRC | 22:51 | |
*** kriskend_ has quit IRC | 23:04 | |
*** tjakobs has quit IRC | 23:05 | |
*** esberglu has quit IRC | 23:12 | |
*** esberglu has joined #openstack-powervm | 23:13 | |
*** esberglu has quit IRC | 23:18 | |
*** esberglu has joined #openstack-powervm | 23:26 | |
*** apearson_ has quit IRC | 23:29 | |
*** esberglu has quit IRC | 23:30 | |
*** seroyer has quit IRC | 23:40 | |
*** thorst_ has joined #openstack-powervm | 23:47 | |
*** esberglu has joined #openstack-powervm | 23:49 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!