*** thorst_ has joined #openstack-powervm | 00:19 | |
*** thorst_ has quit IRC | 01:19 | |
*** seroyer has joined #openstack-powervm | 01:20 | |
*** thorst_ has joined #openstack-powervm | 01:20 | |
*** thorst_ has quit IRC | 01:29 | |
*** edmondsw has quit IRC | 01:40 | |
*** seroyer has quit IRC | 02:01 | |
*** thorst_ has joined #openstack-powervm | 02:22 | |
*** thorst_ has quit IRC | 02:23 | |
*** toan has joined #openstack-powervm | 02:23 | |
*** thorst_ has joined #openstack-powervm | 02:24 | |
*** thorst_ has quit IRC | 02:32 | |
*** tsjakobs has joined #openstack-powervm | 02:33 | |
*** tsjakobs has quit IRC | 02:38 | |
*** thorst_ has joined #openstack-powervm | 03:30 | |
*** thorst_ has quit IRC | 03:37 | |
*** kotra03 has joined #openstack-powervm | 03:58 | |
*** kotra03 has quit IRC | 04:01 | |
*** thorst_ has joined #openstack-powervm | 04:34 | |
*** kotra03 has joined #openstack-powervm | 04:35 | |
*** thorst_ has quit IRC | 04:42 | |
*** kotra03 has quit IRC | 04:51 | |
*** kotra03 has joined #openstack-powervm | 05:18 | |
*** Cartoon has joined #openstack-powervm | 05:22 | |
*** thorst_ has joined #openstack-powervm | 05:40 | |
*** thorst_ has quit IRC | 05:47 | |
*** thorst_ has joined #openstack-powervm | 06:45 | |
*** thorst_ has quit IRC | 06:52 | |
*** kotra03 has quit IRC | 07:22 | |
*** thorst_ has joined #openstack-powervm | 07:50 | |
*** thorst_ has quit IRC | 07:56 | |
*** k0da has joined #openstack-powervm | 08:13 | |
*** kotra03 has joined #openstack-powervm | 08:35 | |
*** thorst_ has joined #openstack-powervm | 08:55 | |
*** thorst_ has quit IRC | 09:02 | |
*** Cartoon_ has joined #openstack-powervm | 10:12 | |
*** Cartoon has quit IRC | 10:15 | |
*** smatzek has joined #openstack-powervm | 10:26 | |
*** thorst_ has joined #openstack-powervm | 11:04 | |
*** Cartoon_ has quit IRC | 11:06 | |
openstackgerrit | Merged openstack/nova-powervm: Wrap console failure message https://review.openstack.org/356149 | 11:11 |
---|---|---|
*** svenkat has joined #openstack-powervm | 11:52 | |
*** edmondsw has joined #openstack-powervm | 12:15 | |
*** mdrabe has joined #openstack-powervm | 12:30 | |
openstackgerrit | Drew Thorstensen (thorst) proposed openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass https://review.openstack.org/357167 | 12:32 |
openstackgerrit | Drew Thorstensen (thorst) proposed openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass https://review.openstack.org/357167 | 12:39 |
*** burgerk has joined #openstack-powervm | 12:43 | |
*** apearson has joined #openstack-powervm | 12:59 | |
*** mdrabe has quit IRC | 13:21 | |
*** mdrabe has joined #openstack-powervm | 13:33 | |
*** esberglu has joined #openstack-powervm | 13:39 | |
*** seroyer has joined #openstack-powervm | 13:42 | |
*** tblakes has joined #openstack-powervm | 13:42 | |
*** seroyer has quit IRC | 13:49 | |
*** burgerk has quit IRC | 14:02 | |
thorst_ | efried: when your SR-IOV change goes in, this can start to be a reality: https://review.openstack.org/#/c/357239/ | 14:05 |
thorst_ | esberglu: 32 concurrent jobs...nice! | 14:06 |
adreznec | dat silent pipeline | 14:08 |
esberglu | thorst: Regarding the heal_and_optimize_interval, we have it increased for the compute nodes but not the aio nodes. What do you think would be reasonable for that? Default is 30 min. | 14:08 |
adreznec | esberglu: thorst_ Once we get things stabilized with the new volume and things are passing regularly, how are we going to handle notifications for the nova/neutron pipeline failures? | 14:08 |
adreznec | Turn emails back on? | 14:09 |
adreznec | Right now we're running the jobs but we'd have to look manually to see failures | 14:09 |
thorst_ | esberglu: if that patch set works properly...I'm thinking we leave the heal and optimize intervals at the normal rate | 14:09 |
thorst_ | basically should be able to leave it as is. | 14:09 |
thorst_ | adreznec: nova/neutron pipeline failures...yeah, should e-mail to us | 14:10 |
thorst_ | though some are taking 3 hours?! | 14:10 |
adreznec | Yeah, I was just noticing that... | 14:10 |
thorst_ | :-( | 14:10 |
esberglu | Turn the compute nodes back down to default or leave them up? | 14:10 |
adreznec | It's still running tempest | 14:10 |
adreznec | So...yeah | 14:10 |
thorst_ | esberglu: no changes on your side basically | 14:10 |
thorst_ | just leave everything as is | 14:11 |
thorst_ | adreznec: we have hit a queue limit | 14:11 |
thorst_ | so maybe these jobs are just taking a while cause they were queued? | 14:11 |
adreznec | max jobs? | 14:11 |
adreznec | Hmm | 14:11 |
adreznec | That's easy enough to determine | 14:12 |
thorst_ | yeah, the 3 hr 12 minute one has actually only been active for 2 hours 20 min | 14:12 |
adreznec | Yep | 14:12 |
thorst_ | still high...not ridiculous | 14:12 |
adreznec | 2h23m | 14:12 |
adreznec | We'll have to watch it | 14:12 |
thorst_ | yep | 14:12 |
esberglu | thorst: Part of those longer timeas is the VIF failures. Each time it happened it timed out at 5 min waiting. Hit a handful of those and it can really increase the time | 14:12 |
thorst_ | 30 as a parallel number is also too small | 14:12 |
thorst_ | esberglu: true...true | 14:12 |
adreznec | thorst_: 30 parallel jobs? | 14:12 |
thorst_ | I have a patch for that! | 14:12 |
thorst_ | adreznec: we should bump that to 3x the system count...maybe 4x | 14:13 |
esberglu | It shouldn’t be maxing out at 30 it should be 50 | 14:13 |
thorst_ | o, its set to 50? | 14:13 |
adreznec | Yeah | 14:13 |
adreznec | I thought that's what we were trying for | 14:13 |
esberglu | Might just be spawing more nodes right now | 14:13 |
esberglu | If volume went up really quick | 14:13 |
adreznec | Which it probably did | 14:14 |
adreznec | Because U.S. morning | 14:14 |
thorst_ | yeah, we have no ready nodes | 14:14 |
adreznec | Fire off patches, rechecks, etc and grab coffee | 14:14 |
thorst_ | well, seems we need more ready nodes. | 14:14 |
thorst_ | :-D | 14:14 |
adreznec | How many are we at now? | 14:14 |
esberglu | Ughh. There are a bunch of nodes stuck deleting again. Which contribute to the 50. | 14:14 |
thorst_ | 1.8 | 14:14 |
adreznec | We have 1.8 ready nodes? | 14:15 |
adreznec | No wonder we can't keep up | 14:15 |
thorst_ | I'm kidding. | 14:15 |
adreznec | :P | 14:15 |
thorst_ | delete issues... | 14:15 |
thorst_ | that's a bad. | 14:15 |
adreznec | Bleh | 14:15 |
adreznec | Yeah | 14:15 |
thorst_ | esberglu: which server is having trouble deleting a devstack node...and what's the name of the instance? | 14:16 |
*** tsjakobs has joined #openstack-powervm | 14:19 | |
esberglu | neo19: PowerVM_CI-PowerVM_DevStacked-6868, PowerVM_CI-PowerVM_DevStacked-6862, PowerVM_CI-PowerVM_DevStacked-6822 | 14:20 |
esberglu | thorst: Looks like it is only happening on a subset of the servers, but most one have multiple cases. Neos 19, 21, 24, 25, 27, 28, and 30 all have at least 1 stuck | 14:22 |
efried | thorst_, confirmed vlan ID does indeed come through. | 14:24 |
efried | binding:profile does not - haven't figured that out yet. | 14:24 |
thorst_ | efried: sweet...maybe we push yours through initially | 14:25 |
thorst_ | so we can do that fancy SEA removal thing | 14:25 |
thorst_ | esberglu: got time to look now...going in... | 14:25 |
esberglu | Cool. Looks like it started almost 48 hours ago, but I didn’t notice because we weren’t at high volume. And just random nodes failing intermittently since | 14:27 |
thorst_ | yeah...so the logs are...neat | 14:29 |
thorst_ | -6868 was created today | 14:30 |
thorst_ | instance uuid: c9bf97c3-db3b-47fc-8275-634be9d06abb | 14:30 |
openstackgerrit | Eric Fried proposed openstack/nova-powervm: VIF driver implementation for SR-IOV https://review.openstack.org/343419 | 14:32 |
thorst_ | esberglu: do you know if these instances actually built? | 14:33 |
thorst_ | or did they attempt to build, kinda get hung, and then we are now trying to delete them? | 14:33 |
esberglu | The latter. They are still in build state. But running the deleting task | 14:34 |
thorst_ | yeah | 14:34 |
thorst_ | this looks like efried territory to be honest | 14:34 |
thorst_ | they're hung on the SSP upload. | 14:34 |
efried | Let me get this commit in place right quick and I can take a look. | 14:36 |
thorst_ | efried: thx | 14:37 |
openstackgerrit | Eric Fried proposed openstack/networking-powervm: WIP: Mechanism driver & agent for powervm SR-IOV https://review.openstack.org/343423 | 14:38 |
thorst_ | efried: is that still WIP? | 14:39 |
thorst_ | I think that bit could potentially go in... | 14:39 |
efried | thorst_, need to finish UT. | 14:41 |
thorst_ | efried: ah | 14:41 |
*** burgerk has joined #openstack-powervm | 14:46 | |
efried | thorst_, esberglu: So what is it that needs to be looked at SSP-wise? | 14:46 |
thorst_ | so go to neo19, just scp down /home/neo/n-cpu.log | 14:47 |
thorst_ | that's a snapshot | 14:47 |
thorst_ | filter on the instance-uuid: c9bf97c3-db3b-47fc-8275-634be9d06abb | 14:47 |
thorst_ | you can see it gets stuck in the SSP 'crt-disk' step | 14:47 |
thorst_ | here's what I suspect happened...we had a lock issue in the SSP upload | 14:48 |
thorst_ | and said lock has been hanging around all day | 14:48 |
efried | thorst_, esberglu: sorry, been multitasking, which I suck at. We should have REST investigate this: | 15:22 |
efried | 2016-08-18 03:24:49.638 41657 WARNING pypowervm.tasks.storage [req-94fbdad2-82e6-497c-8217-3545375dfd3a admin admin] HTTP error 500 for method DELETE on path /rest/api/uom/Tier/b0f7f3d0-41a8-336e-beee-4c1ceec29ae6/LogicalUnit/fe7646fe-dd69-33bb-b934-2b8be9f25b2e: Internal Server Error -- java.lang.NullPointerException | 15:22 |
efried | thorst_, esberglu: The upshot is that we've been bouncing off of this guy since 1:29 system time: | 15:27 |
efried | part65fd0cbbimage_template_PowerVM_Ubuntu_Base_1471493040_ab1d7cee5378f003d2749 | 15:27 |
efried | that's a "marker LU" indicating an in-progress (probably hung/failed) upload from another process. | 15:27 |
thorst_ | I've got to step out...someone able to follow up with apearson? | 15:27 |
efried | Did we ungracefully kill a compute process at some point? | 15:27 |
efried | At this point, if you want to get things moving, you can remove that LU. Not sure how assiduously we want to pursue the root cause of this particular incident. | 15:28 |
thorst_ | efried: I've actually seen this a few times | 15:29 |
thorst_ | so I think we do want to root it out. | 15:29 |
efried | recently? | 15:29 |
esberglu | There are 16 nodes hanging on delete right now | 15:29 |
thorst_ | in the past month or so...yeah | 15:29 |
efried | hanging on delete? That's weird - and likely unrelated. | 15:29 |
thorst_ | no no, the marker LU had a hiccup | 15:29 |
thorst_ | and things went ... bad | 15:29 |
thorst_ | ok...gotta run, back in an hour | 15:29 |
efried | Should still be able to delete the node. Which should interrupt the build process. | 15:30 |
efried | So esberglu, example of a node that's hung on delete? | 15:30 |
efried | vm, not node. | 15:30 |
esberglu | Well they are stuck in the deleting task but in build status. | 15:32 |
*** k0da has quit IRC | 15:32 | |
efried | Hm, I would expect delete to interrupt the build. Let's wait for thorst_ to get back and we can pursue that. | 15:33 |
efried | Meanwhile, I guess let's try to figure out why this upload failed, and what we might could have done about it. | 15:34 |
efried | esberglu: It is clear to me at this point that the marker LU in question was created from a different node. What other nodes are sharing this SSP? | 15:50 |
esberglu | 19, 21, 24, 25. All of which have hanging vms | 15:51 |
efried | Offending LU was created on 25. Tracking it down... | 16:04 |
efried | esberglu, found the culprit | 16:09 |
efried | 2016-08-18 01:34:19.786 8769 ERROR nova.compute.manager [instance: 77770e51-5d82-4a60-9e20-56fefcbc54a9] HttpError: HTTP error 500 for method DELETE on path /rest/api/uom/Tier/b0f7f3d0-41a8-336e-beee-4c1ceec29ae6/LogicalUnit/5a8b2d26-8dca-3881-8ef9-acd8787924ca: Internal Server Error -- java.lang.NullPointerException | 16:09 |
efried | Attempting to delete the marker LU. | 16:09 |
efried | If we can't delete the marker LU, everything else trying to use the same image will hang. | 16:09 |
efried | We need REST to investigate the above. | 16:09 |
esberglu | Thanks for the assist | 16:10 |
thorst_ | efried esberglu: got it sorted? | 16:24 |
efried | See #novalink | 16:24 |
efried | @changh is working it. | 16:24 |
efried | Need to clear the env now and get things moving. Stand by... | 16:25 |
esberglu | thorst: Your neworking_powervm change passed CI btw. With that change and fixing this issue we might be stable! | 16:26 |
thorst_ | ooo, I want to look at those logs | 16:26 |
thorst_ | will dig in after a bit | 16:26 |
efried | thorst_, esberglu: deleted the errant marker LU. Things should proceed now, one hopes. | 16:27 |
efried | thorst_, question is: should we try to recover from this somehow? | 16:29 |
efried | At the very least, I would expect an instance delete to interrupt the spawns that are wedged on that image upload. Why doesn't that happen? | 16:30 |
efried | Do we need an interrupt handler in that loop? Does the loop maybe need to check for instance state DELETING whenever it wakes up, and bail? | 16:31 |
esberglu | efried: This is also happening on the ssp with 27, 28, 30. Want to delete that marker too? | 16:38 |
efried | esberglu, or I could teach you to fish. | 16:38 |
esberglu | Sure | 16:39 |
efried | Have you found the log entry for the offending marker LU? | 16:39 |
efried | You're looking for something like: | 16:40 |
efried | Waiting for in-progress upload(s) to complete. Marker LU(s): ['part65fd0cbbimage_template_PowerVM_Ubuntu_Base_1471493040_ab1d7cee5378f003d2749'] | 16:40 |
esberglu | Yep. Looking at it now | 16:41 |
efried | That'd have the same req ID as the rest of the entries for the spawn you're trying to unblock. | 16:42 |
efried | So to get it moving, you delete the marker LU via e.g.: | 16:42 |
efried | pvmctl lu delete -i name=part65fd0cbbimage_template_PowerVM_Ubuntu_Base_1471493040_ab1d7cee5378f003d2749 | 16:42 |
thorst_ | esberglu: my change may have functionally worked but has a bug in the logging... | 16:45 |
thorst_ | will get another patch up | 16:45 |
esberglu | efried: Cool thanks | 16:48 |
openstackgerrit | Drew Thorstensen (thorst) proposed openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass https://review.openstack.org/357167 | 16:59 |
adreznec | FYI thorst_ efried - Proposed Ocata schedule is up at https://review.openstack.org/#/c/357214/, release date ~Feb 20th | 17:00 |
thorst_ | whoa... | 17:01 |
adreznec | Shorter cycle | 17:01 |
adreznec | To align with the new schedule discussed at the last summit | 17:01 |
thorst_ | yeah, crazy | 17:01 |
adreznec | Take back to whoever internal as needed | 17:01 |
thorst_ | right right... | 17:01 |
adreznec | We'll have to decide development timeline | 17:01 |
thorst_ | well, certainly Ocata will have less content :-) | 17:01 |
adreznec | Yeah | 17:02 |
adreznec | I mean no, thorst_ | 17:02 |
adreznec | Same content, just faster! | 17:02 |
thorst_ | but I'm already working as fast as I can! | 17:02 |
* adreznec cracks whip | 17:02 | |
thorst_ | kinda wish we had 4 releases a year. | 17:04 |
thorst_ | lol | 17:04 |
*** k0da has joined #openstack-powervm | 17:05 | |
thorst_ | esberglu: so...where we at? Unwedged? | 17:10 |
esberglu | Yep. Nodes are spawning right now. I’m pretty sure we will hit capacity | 17:11 |
efried | thorst_, in light of https://review.openstack.org/#/c/357239 -- is the code arount https://review.openstack.org/#/c/343423/18/networking_powervm/plugins/ibm/agent/powervm/sriov_agent.py@103 needed?? | 17:11 |
thorst_ | esberglu: hit capacity as in...hit our limit of number of nodes we can run at once? | 17:11 |
thorst_ | or just plain run out of capacity | 17:11 |
esberglu | Hit our limit of 50 | 17:12 |
thorst_ | esberglu: phew | 17:12 |
thorst_ | cool | 17:12 |
esberglu | thorst: What’s the status on those new systems? | 17:12 |
thorst_ | efried: yeah, totally. Notice that in 357239 the device up is in the sea_agent. It does it after it makes sure the SEA has the VLAN. In the SR-IOV agent you do the device up when the request comes in | 17:12 |
thorst_ | esberglu: that's on my todo next list. :-) | 17:13 |
thorst_ | just finishing easier stuff first. | 17:13 |
efried | thorst_, I'll pretend I understood that, and leave the code alone? | 17:15 |
thorst_ | efried: sounds good to me | 17:15 |
thorst_ | also, my code can't go in until some nova changes go in and 343423 | 17:15 |
*** esberglu has quit IRC | 17:17 | |
efried | thorst_, working up UT for the networking-powervm side now. | 17:21 |
efried | Lots of work pending there. | 17:21 |
thorst_ | efried: awesome...we'll be able to make things much simpler then | 17:22 |
*** esberglu has joined #openstack-powervm | 17:31 | |
efried | thorst_, gah!, I can never remember how you're supposed to override conf options in a unit test. Remind me? | 17:31 |
thorst_ | efried: I always forget too | 17:32 |
thorst_ | I think its the 'flags' thing in nova | 17:32 |
thorst_ | not sure about neutron. | 17:32 |
efried | ah, flags sounds familiar. But I'm in neutron, so... | 17:33 |
thorst_ | may still be flags... | 17:33 |
efried | cfg.CONF.set_override(...) | 18:01 |
efried | thorst_, should the sriov_agent be using the CNAEventHandler? | 18:01 |
thorst_ | efried: I'd say no to start... | 18:02 |
thorst_ | 1) its not CNA's....so it would be a different event type | 18:02 |
thorst_ | 2) I'm not sure you care...you don't have to do anything | 18:02 |
thorst_ | 3) with the latest stuff...I'm not sure I care anymore in the sea one... | 18:02 |
*** k0da has quit IRC | 18:04 | |
efried | thorst_, 3) as in, gonna rip it out of the SEA agent too? | 18:04 |
thorst_ | efried: Not yet...I need to think it through. | 18:04 |
thorst_ | I'm not convinced I need it | 18:04 |
thorst_ | maybe...not 100% sure. | 18:04 |
thorst_ | the maybe is for live migration. | 18:05 |
efried | thorst_, okay - right now the setup for that guy is in agent_base. So I'm going to need to move it to sea_agent. Does the order matter? | 18:05 |
thorst_ | efried: I don't think so | 18:05 |
efried | Can I set it up after rpc_setup? | 18:05 |
efried | okay. | 18:05 |
thorst_ | adreznec: is this one still needed? https://review.openstack.org/#/c/345986/ | 18:08 |
adreznec | Yes | 18:10 |
adreznec | But we can't merge it until the dep goes it | 18:10 |
adreznec | *in | 18:10 |
adreznec | It can't pass jenkins before that... | 18:10 |
thorst_ | right right...just thought we only had one to go. But then I remebered Ashana's | 18:20 |
adreznec | Yeah | 18:24 |
adreznec | I forgot about that one | 18:24 |
*** k0da has joined #openstack-powervm | 18:25 | |
*** kotra03 has quit IRC | 18:31 | |
*** catintheroof has joined #openstack-powervm | 18:49 | |
thorst_ | esberglu: this hardware stuff is so time consuming... | 19:10 |
*** apearson_ has joined #openstack-powervm | 19:11 | |
efried | thorst_ mdrabe: 3764 ready for y'all. | 19:13 |
mdrabe | ight | 19:13 |
*** apearson has quit IRC | 19:14 | |
esberglu | thorst_: What do you have to do for it now? | 19:16 |
thorst_ | I have to rewire 4 ethernet switches, 4 san switches, etc... before I can even start talking about wiring the servers | 19:17 |
thorst_ | taking apart an old cloud | 19:17 |
thorst_ | its really unraveled into something amazing. | 19:17 |
*** tsjakobs has quit IRC | 19:28 | |
*** tjakobs has joined #openstack-powervm | 19:30 | |
*** apearson__ has joined #openstack-powervm | 19:54 | |
*** apearson_ has quit IRC | 19:58 | |
*** thorst_ has quit IRC | 21:16 | |
*** smatzek has quit IRC | 21:18 | |
*** catintheroof has quit IRC | 21:21 | |
*** svenkat has quit IRC | 21:25 | |
*** edmondsw has quit IRC | 21:25 | |
*** tblakes has quit IRC | 21:38 | |
*** thorst_ has joined #openstack-powervm | 22:06 | |
*** burgerk has quit IRC | 22:10 | |
*** thorst_ has quit IRC | 22:10 | |
*** thorst_ has joined #openstack-powervm | 22:14 | |
*** apearson__ has quit IRC | 22:14 | |
*** tjakobs has quit IRC | 22:15 | |
*** esberglu has quit IRC | 22:16 | |
*** mdrabe has quit IRC | 22:19 | |
*** thorst_ has quit IRC | 22:31 | |
*** thorst_ has joined #openstack-powervm | 22:32 | |
*** smatzek has joined #openstack-powervm | 22:38 | |
*** thorst_ has quit IRC | 22:40 | |
*** smatzek has quit IRC | 22:44 | |
*** svenkat has joined #openstack-powervm | 22:50 | |
*** k0da has quit IRC | 22:57 | |
*** svenkat has quit IRC | 23:01 | |
*** thorst_ has joined #openstack-powervm | 23:31 | |
openstackgerrit | Merged openstack/networking-powervm: Enforce limit of VLAN clean ups in each pass https://review.openstack.org/357167 | 23:40 |
*** thorst_ has quit IRC | 23:46 | |
*** thorst_ has joined #openstack-powervm | 23:47 | |
*** thorst_ has quit IRC | 23:55 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!