*** thorst_afk has joined #openstack-powervm | 01:06 | |
*** thorst_afk has quit IRC | 01:11 | |
*** thorst_afk has joined #openstack-powervm | 01:11 | |
*** thorst_afk has quit IRC | 01:20 | |
*** edmondsw has quit IRC | 01:36 | |
*** svenkat has quit IRC | 01:45 | |
*** thorst_afk has joined #openstack-powervm | 01:56 | |
*** YuYangWang has joined #openstack-powervm | 02:06 | |
*** thorst_afk has quit IRC | 02:32 | |
*** thorst_afk has joined #openstack-powervm | 02:33 | |
*** thorst_afk has quit IRC | 02:37 | |
*** edmondsw has joined #openstack-powervm | 02:46 | |
*** edmondsw has quit IRC | 02:51 | |
*** thorst_afk has joined #openstack-powervm | 03:43 | |
*** thorst_afk has quit IRC | 04:02 | |
*** chhavi has joined #openstack-powervm | 04:03 | |
*** edmondsw has joined #openstack-powervm | 04:35 | |
*** edmondsw has quit IRC | 04:39 | |
*** thorst_afk has joined #openstack-powervm | 05:59 | |
*** thorst_afk has quit IRC | 06:04 | |
*** YuYangWang has quit IRC | 06:10 | |
*** edmondsw has joined #openstack-powervm | 06:23 | |
*** edmondsw has quit IRC | 06:28 | |
*** thorst_afk has joined #openstack-powervm | 07:01 | |
*** thorst_afk has quit IRC | 07:05 | |
*** k0da has joined #openstack-powervm | 07:13 | |
*** thorst_afk has joined #openstack-powervm | 08:01 | |
*** edmondsw has joined #openstack-powervm | 08:11 | |
*** edmondsw has quit IRC | 08:16 | |
*** thorst_afk has quit IRC | 08:21 | |
*** thorst_afk has joined #openstack-powervm | 09:18 | |
*** thorst_afk has quit IRC | 09:22 | |
*** edmondsw has joined #openstack-powervm | 09:59 | |
*** chhavi has quit IRC | 10:01 | |
*** chhavi has joined #openstack-powervm | 10:01 | |
*** edmondsw has quit IRC | 10:04 | |
*** smatzek has joined #openstack-powervm | 10:39 | |
*** smatzek has quit IRC | 10:43 | |
*** chhavi has quit IRC | 11:07 | |
*** smatzek has joined #openstack-powervm | 11:20 | |
*** edmondsw has joined #openstack-powervm | 11:27 | |
*** svenkat has joined #openstack-powervm | 11:40 | |
*** thorst_afk has joined #openstack-powervm | 11:49 | |
*** jpasqualetto has joined #openstack-powervm | 12:11 | |
*** jpasqualetto has quit IRC | 12:46 | |
*** chhavi has joined #openstack-powervm | 13:02 | |
*** mdrabe has joined #openstack-powervm | 13:13 | |
*** jpasqualetto has joined #openstack-powervm | 13:26 | |
edmondsw | efried please take a look at https://review.openstack.org/#/c/471773/ and let me know if that's what you were thinking | 13:27 |
---|---|---|
edmondsw | efried then we need to talk about how to test it, and I need to write UTs | 13:27 |
efried | edmondsw Will do. | 13:27 |
*** smatzek has quit IRC | 13:40 | |
*** jwcroppe has joined #openstack-powervm | 13:45 | |
*** smatzek has joined #openstack-powervm | 14:03 | |
esberglu | efried: edmondsw: thorst_afk: Looking into the timeout errors. So far I've seen it on all of the systems of 1 of the SSP groups. And haven't seen it on any other systems | 14:10 |
esberglu | So it could be some sort of issue with the SSP | 14:10 |
edmondsw | esberglu interesting | 14:11 |
edmondsw | we all know SSP never has issues... ;) | 14:11 |
esberglu | I'm looking through some more failures to confirm the above | 14:11 |
edmondsw | esberglu one minor reword on 5406, but looking good overall, tx for the changes | 14:14 |
esberglu | edmondsw: And you're cool with the "none" thing? | 14:15 |
edmondsw | yeah, it's not ideal but it's not a deal breaker. Bigger fish to fry | 14:16 |
esberglu | edmondsw: Yeah I'm confident it's an SSP issue. 25+ timeouts among neo19, neo21, neo24, neo25 | 14:19 |
esberglu | vs. 1 on neo39 (which could be a different issue) and no other systems | 14:19 |
esberglu | And those 4 are in the same cluster | 14:20 |
edmondsw | esberglu let's look at that one issue on neo39... link? | 14:27 |
*** jwcroppe has quit IRC | 14:28 | |
esberglu | edmondsw: http://184.172.12.213/37/459737/6/check/nova-out-of-tree-pvm/de5b4e8/ | 14:28 |
edmondsw | esberglu that looks like the same issue to me | 14:29 |
esberglu | edmondsw: I wasn't saying it couldn't be the same issue. Just that other things could cause a build to timeout | 14:31 |
edmondsw | esberglu could also be one SSP cluster is more reliable than another but still not 100% reliable | 14:31 |
esberglu | edmondsw: Any idea where to start debugging SSP issues? | 14:32 |
edmondsw | esberglu other than pinging that team, no | 14:32 |
edmondsw | efried any suggestions? | 14:33 |
edmondsw | esberglu are the 4 tempest tests in your CI TODO etherpad still the only places we've hit this? | 14:35 |
*** jwcroppe has joined #openstack-powervm | 14:36 | |
efried | Well, the first thing to do is grep the logs for how long SSP-related ops are taking. LU creation, upload, etc. | 14:40 |
efried | Compare the results on the good cluster vs the bad, see what the pattern is. | 14:41 |
efried | Is it just LU creation? Just upload? Or both? | 14:41 |
efried | Is it every time, or intermittent? | 14:41 |
*** tjakobs has joined #openstack-powervm | 14:42 | |
efried | Then probably sanity check the config of the SSPs. Are they backed by the same SAN? Do they have the same kind & number of disks? Same number of VIOSes with the same kind of resources? | 14:43 |
*** mdrabe has quit IRC | 14:44 | |
efried | I believe it also may be possible for one bad VIOS to affect the performance of the whole cluster; so maybe try removing one VIOS at a time in turn and see if it gets better. | 14:44 |
esberglu | edmondsw: afaik | 14:45 |
efried | We can engage the VIOS team as soon as we have hard evidence (timings from the log) that prove the pattern and point to the specific op(s) taking too long. | 14:45 |
efried | Yeah, also that - dig into those four tests and see what's different about them. They shouldn't be doing any specific op that's unique to those tests, but maybe they're doing a particular op multiple times where other tests just do it once. | 14:46 |
edmondsw | efried the log color work y'all did a couple weeks ago.. how do I get vim to interpret that and give me colors instead of those nasty codes? | 14:48 |
efried | vim, no idea, would have to RTFM | 14:48 |
efried | Why do you need to edit logs? | 14:48 |
efried | edmondsw https://stackoverflow.com/questions/10592715/ansi-color-codes-in-vim (first google hit) | 14:49 |
edmondsw | efried I don't need to edit them... what do you use to view them? | 14:49 |
efried | edmondsw I just use terminal tools like less, grep, etc. Those will interpret the codes by default. | 14:50 |
efried | If not, use less -R | 14:50 |
esberglu | edmondsw: mac or windows? | 14:50 |
edmondsw | mac | 14:50 |
edmondsw | efried ah, -R | 14:51 |
edmondsw | tx | 14:51 |
efried | sweet | 14:51 |
esberglu | The console application has some nice filtering capabilities but 0 color support | 14:51 |
esberglu | But if I want color I just use less -R | 14:51 |
efried | esberglu btw, don't make any sudden moves on the prep_devstack.sh changes - I'm in mid-review. | 14:52 |
esberglu | efried: k | 14:52 |
*** mdrabe has joined #openstack-powervm | 14:54 | |
esberglu | efried: edmondsw: Seeing this in the logs | 15:12 |
esberglu | http://paste.openstack.org/show/611732/ | 15:12 |
esberglu | So the test in question is trying to rebuild the server and times out waiting for the server to become active | 15:13 |
esberglu | But for some reason it is being deleted here before it finishes rebuilding | 15:13 |
esberglu | Searching for why it's getting deleted | 15:14 |
efried | Possibly this is just the manifestation of that condition. It fails to build, so it deletes it, then the test proceeds and that's just how it fails. | 15:14 |
esberglu | efried: Yeah | 15:15 |
mdrabe | efried: Is there a get_instance_wrapper_from_uuid API in OOT or should I make one? | 15:46 |
efried | mdrabe From nova UUID or pvm UUID? | 15:46 |
mdrabe | nova | 15:46 |
efried | You're looking to get the LPAR wrapper, not the instance obj, right? | 15:47 |
mdrabe | yup | 15:47 |
efried | nova_powervm.virt.powervm.vm.get_instance_wrapper takes an instance. You could update it so the 'instance' arg can be either a UUID or an instance obj. | 15:48 |
efried | Which ultimately probably just means making nova_powervm.virt.powervm.vm.get_pvm_uuid accept either | 15:49 |
efried | (but update docstrings for things that call it) | 15:49 |
mdrabe | K will do | 15:49 |
efried | coo | 15:49 |
tjakobs | efried thorst: can you take another look at https://review.openstack.org/#/c/462248/ when you get a chance (file-backed ephemeral) | 15:53 |
*** k0da has quit IRC | 15:55 | |
esberglu | efried: Seeing a bunch of 412 errors before the previous paste | 15:59 |
esberglu | http://paste.openstack.org/show/611736/ | 15:59 |
efried | esberglu Very much expected, especially with a tempest run doing stuff in parallel. | 16:01 |
edmondsw | esberglu I think 412s may be normal. It does appear to recover from those and continue on. But they could be a symptom of delays | 16:01 |
efried | If those retry counts get above three or four, may be worth looking into, but I don't expect that to be related to the issue at hand. | 16:02 |
edmondsw | yeah, I think I saw it 2-3 times | 16:02 |
efried | tjakobs Ack, sorry, don't know why that didn't float to the top when you responded. | 16:03 |
edmondsw | efried is it normal to see a whole lot of NotImplementedErrors, like "It took flow 'destroy' 110.22 seconds to finish.: NotImplementedError" ? | 16:06 |
esberglu | efried: edmondsw: Another thing I noticed in the logs | 16:08 |
esberglu | http://paste.openstack.org/show/611739/ | 16:08 |
edmondsw | esberglu yeah, that's similar to what I'm looking at | 16:08 |
efried | edmondsw That's new, not our code, probably related to dhellman's recent oslo.log changes to print exception contexts. He's still working through it. | 16:10 |
efried | If it's not blowing things up, don't worry about it. | 16:11 |
efried | esberglu What's the distribution of neos to SSPs? Do all SSPs have the same number of neos, or is this one "heavy"? | 16:12 |
esberglu | All of the SSPs have 4 systems except one which has 2 | 16:12 |
efried | esberglu Are the SSPs all the same size (number & size of disks) and backed by the same SAN? | 16:15 |
esberglu | Yeah. Same SAN. All have 4 250G disks | 16:16 |
efried | edmondsw Have you devstacked your victim neo yet? | 16:25 |
edmondsw | efried no, I spun up a VM on which I can do that, but waiting for you to tell me what I may need to do in my local.conf | 16:26 |
efried | spun up a vm.... | 16:27 |
edmondsw | looking through logs on timeouts (between meetings and interruptions) | 16:27 |
edmondsw | efried gotta run devstack somewhere... | 16:27 |
efried | edmondsw Normally on the neo. | 16:27 |
edmondsw | oh really... | 16:27 |
efried | Were you planning to set up remote/proxy? | 16:28 |
edmondsw | efried no... I assumed you'd have devstack setup on a VM for nova api, etc. and then just run the compute stuff on the neo... but I guess it would be nice to just run it all on the neo... | 16:28 |
edmondsw | wasn't thinking through it | 16:29 |
edmondsw | still thinking like a PowerVC developer :) | 16:29 |
efried | edmondsw I suppose you could split it up, though I can't really think of any benefit to that. | 16:29 |
edmondsw | yeah | 16:29 |
efried | So the best thing to do, local.conf-wise, is to grab the one from the CI. | 16:29 |
efried | Or I suppose you can grab mine from neo40:/opt/stack/devstack/local.conf - it was known to work within the past week ;-) | 16:30 |
efried | and you wouldn't have to edit it as much. | 16:30 |
efried | I think | 16:30 |
efried | edmondsw Can we talk through the disable-compute-service thing at some point? | 16:40 |
edmondsw | efried yes please | 16:40 |
efried | Now good? | 16:40 |
edmondsw | I pulled down both local confs and started diffing them, but I can do that later... now sounds good | 16:40 |
efried | ^^ make sure you get the .aio one from the CI. | 16:41 |
edmondsw | now called "intree" | 16:41 |
edmondsw | yes | 16:41 |
efried | edmondsw No, you'll want the OOT one. | 16:41 |
edmondsw | oh... even though I'm working on an intree change? | 16:41 |
efried | oh, right. Well, I find it easiest to stack with OOT, cause then you can flip to in-tree just by changing the driver in nova.conf and restarting compute. | 16:42 |
efried | And you're gonna want to port this change back to OOT when done anyway. | 16:42 |
edmondsw | oh, if that's the case, makes sense | 16:42 |
efried | yeah, you can't go the other way (flip in-tree to OOT if you stacked in-tree only) cause nova-powervm and networking-powervm will be missing. | 16:43 |
efried | Anyway, back to disable-compute-service. | 16:43 |
efried | From the top, we have two scenarios we want to be able to cover: init_host and periodic task. | 16:44 |
edmondsw | yep | 16:44 |
efried | Starting with init_host, I'm not actually sure we can get anywhere if this fails. | 16:45 |
efried | It runs only once, when the compute service starts. I don't believe it retries if it fails - it just bails. | 16:45 |
edmondsw | efried the periodic task will retry the guts of it | 16:45 |
edmondsw | default runs that periodic task ever 60s IIRC | 16:46 |
efried | Only if the driver can successfully report get_available_nodes | 16:46 |
efried | Which relies on self.host_wrapper | 16:46 |
efried | Which doesn't exist if we never initted properly. | 16:46 |
efried | So we would have to change what that guy reports, to something generic that we can ascertain without talking over the adapter. | 16:47 |
edmondsw | efried what get_available_nodes call are you referencing? | 16:48 |
efried | nova.virt.powervm.driver.PowerVMDriver#get_available_nodes | 16:48 |
edmondsw | efried oh, I see it | 16:48 |
edmondsw | https://github.com/openstack/nova/blob/5d95cb9dbca403790db4e9680919e6716fa5cb76/nova/compute/manager.py#L6630 | 16:49 |
efried | Kinda raises the question as to whether 'available' ~ 'enabled' in this context. I'm gonna assert 'no', precisely because we want this to work. | 16:49 |
efried | Anyway, as far as what that returns, we can probably get away with using the neo's hostname, gleaned by local cmd rather than pvm API. | 16:50 |
efried | Course then we give esberglu another log scrubbing task :) | 16:50 |
efried | Unless we use the shortname, which would probably be aiight. But not sure how globally unique these have to be. | 16:51 |
edmondsw | what does get_available_nodes return today when working? | 16:51 |
efried | MTMS string | 16:52 |
efried | side topic, what IDE do you use? | 16:53 |
edmondsw | efried can we get that from a cli call? | 16:53 |
efried | hm, possibly, or something like it, through RMC. | 16:53 |
edmondsw | efried do we have to worry about RMC being down? | 16:54 |
edmondsw | I would guess yes :) | 16:54 |
efried | Well, yeah, but if it is, I think we can safely declare ourselves dead. | 16:54 |
efried | maybe | 16:54 |
edmondsw | does it possibly take time to get RMC working after restart? | 16:55 |
efried | Yeah, that's one of the main things we're trying to account for here - I don't know how quickly RMC comes up relative to other stuff. | 16:55 |
edmondsw | right | 16:55 |
edmondsw | me either | 16:55 |
mdrabe | Only for the VIOSes should RMC matter I think, but idk about the SDE case | 16:55 |
mdrabe | NL in VIOS environment doesn't have RMC | 16:56 |
efried | It may not be an actual RMC command we need. The MTMS (or something like it) may be in a text file somewhere. | 16:56 |
efried | (edmondsw, talking it through on slack fyi) | 17:02 |
edmondsw | efried yeah, I'm following | 17:02 |
efried | k | 17:02 |
*** smatzek has quit IRC | 17:04 | |
efried | mdrabe What we're working towards is allowing init_host to complete fairly quickly, even if the NL services aren't up yet, in which case it will mark the compute host as disabled. | 17:05 |
efried | But then recheck when the periodic task for get_available_resource runs. | 17:06 |
efried | and if NL is now responsive, enable the compute host (and make sure all the init_host stuff is really done) | 17:06 |
efried | So the problem is that the periodic task calls get_available_nodes | 17:07 |
mdrabe | That seems a little weird to me | 17:07 |
efried | which today uses the managed system wrapper to generate a name. | 17:07 |
efried | which part? | 17:08 |
mdrabe | I guess it just makes more sense to me to have init_host wait for services | 17:08 |
mdrabe | Why have compute do periodic tasks and the like before then? | 17:09 |
efried | Yeah, so the whole point here is not to hold the compute service hostage waiting for NL services. We want to be able to say "alive, but disabled" fairly quickly, and then enable later when things smooth out. | 17:11 |
efried | The general case is to do that if the NL services go pear-shaped during the normal course of events. | 17:11 |
efried | But accounting for the reboot scenario is the kinda tricky edge case we're working on. | 17:12 |
mdrabe | It doesn't have to be tricky is all I'm sayin | 17:14 |
efried | mdrabe How long do we wait? | 17:15 |
efried | That's what raised nova core eyebrows to begin with. | 17:15 |
mdrabe | It's pypowervm right? | 17:15 |
mdrabe | 300 tries every 2 seconds? | 17:15 |
mdrabe | 1 try every 2 seconds for a maximum of up to 300 *** | 17:16 |
efried | Which works out to 10 minutes? Michael Still and Matt Riedemann both independently balked at that. | 17:16 |
efried | Swhat prompted this whole project. | 17:17 |
mdrabe | I do have some concerns, but I don't wanna hold anything up | 17:19 |
efried | Thing is, we can make this work if we can resolve the get_available_nodes thing. | 17:20 |
edmondsw | efried is this only about getting initialized, or did we also want to handle the NL services and/or VIOS going down after some period of working? | 17:21 |
efried | edmondsw Both. | 17:22 |
edmondsw | the current impl only handles the former | 17:22 |
efried | Yup, we'll get to that next :) | 17:22 |
edmondsw | vmware didn't handle the latter, just initialization | 17:23 |
edmondsw | but I do think that would be a good idea | 17:23 |
edmondsw | and we won't have a get_available_nodes problem there | 17:23 |
edmondsw | because we can cache that from the time that it was working | 17:23 |
edmondsw | I'm thinking it may be best to just use hostname for get_available_nodes | 17:24 |
edmondsw | doesn't nova refer to nodes by hostname? | 17:24 |
efried | The docstring even says you can use hypervisor_hostname | 17:25 |
efried | So yeah, I guess if that's the case, it's probably workable. | 17:25 |
edmondsw | efried https://developer.openstack.org/api-ref/compute/#show-host-details | 17:26 |
edmondsw | then again, that's deprecated... | 17:26 |
edmondsw | mdrabe, did you see that they've deprecated os-hosts APIs? | 17:26 |
edmondsw | and specifically say there will be no replacement for maintenance_mode | 17:26 |
edmondsw | :( | 17:26 |
efried | edmondsw So | 17:35 |
efried | Just use socket.gethostname() | 17:35 |
edmondsw | efried do we need to get esberglu to add some log scraping before I push that up? | 17:36 |
efried | Nope, that yields the short hostname. | 17:36 |
efried | and is what nova uses to default the ComputeManager.host / Service.host / etc. | 17:36 |
edmondsw | efried k | 17:37 |
edmondsw | mdrabe is this change to get_available_nodes going to affect PowerVC? | 17:37 |
edmondsw | efried or other things on upgrade? | 17:37 |
efried | So the logs will now have things like "Compute_service record updated for neo40:neo40" instead of "Compute_service record updated for neo40:8247-21L*215439A" | 17:37 |
efried | I'm really hoping pvc doesn't use get_available_nodes anywhere. That would be... silly. | 17:38 |
efried | Hah - esberglu we just figured out another way you could figure out what neo you're running against in the CI. | 17:39 |
efried | ...until this change drops, that is. | 17:39 |
*** smatzek has joined #openstack-powervm | 17:39 | |
efried | Okay, so now I *think* that'll sort us out for init_host. | 17:41 |
efried | Now edmondsw, what you were saying before about VCenter not supporting dynamic enable/disable during runtime - they do. | 17:42 |
edmondsw | efried not really... they start returning blank data, but they don't disable the service | 17:42 |
edmondsw | efried oh, my bad... you're right | 17:43 |
efried | Their get_available_resource calls self._vc_state.get_host_stats(refresh=True), which calls update_status, which calls _set_host_enabled(False) if they get an exception trying to discover stuff. | 17:43 |
edmondsw | I overlooked L85, thought the only call was L108 | 17:43 |
efried | Which is roughly what we might do. But before we settle on that, I'd like to bring up another possibility. | 17:44 |
efried | This methodology relies on get_available_resource. | 17:44 |
efried | Which means a couple of things: | 17:44 |
efried | We're at the mercy of whatever interval the consumer configured for that guy to run. | 17:44 |
efried | If something goes wrong in between hits, we won't disable until the next hit. | 17:45 |
efried | Which could result in more failures than necessary. | 17:46 |
efried | The other thing is that get_available_resource is gonna disappear pretty soon, so we'll probably have to move the logic to get_inventory() - but will need to make sure that's running in the same kind of periodic task. | 17:46 |
edmondsw | efried right, I assumed that if we need to disable when NL services go down, that will be done differently, on failures wherever they occur | 17:46 |
efried | So yeah, that's an option. | 17:47 |
efried | In fact, there's ways of doing that where we could avoid get_available_resource altogether. | 17:47 |
efried | One way to do that would be to add a helper to our Adapter. | 17:47 |
efried | Those guys get wrapped around the low-level requests, so they can trap specific HTTP error codes and whatnot, which is kinda what we want to be on the lookout for. | 17:48 |
efried | We get a "service unavailable", we disable right then. | 17:48 |
efried | The question then becomes: how do we switch back on? | 17:48 |
efried | We could keep that logic in get_available_resource. | 17:49 |
efried | Or (go with me for a sec here) we could spawn a thread that polls the service and re-enables when it comes back to life. | 17:49 |
*** chhavi has quit IRC | 17:51 | |
edmondsw | efried if we spawned a thread, we could do the same during init and not have to change get_available_nodes | 17:56 |
efried | uh | 17:56 |
edmondsw | efried no? am I missing something? | 17:57 |
efried | Well, we would still have to account for the fact that get_available_resource could run before init_host has set up the host_wrapper. | 17:57 |
edmondsw | efried you mean get_available_nodes? | 17:57 |
efried | yeah, called by get_available_resource | 17:57 |
edmondsw | yes, we'd have to add some error handling logic there but we could just return [] | 17:58 |
efried | hm, that could actually work. | 17:58 |
edmondsw | efried no, called by update_available_resource | 17:58 |
edmondsw | I kinda like returning [] there until nodes are actually available :) | 17:59 |
efried | shit, that could just work anyway. | 17:59 |
efried | yeah | 17:59 |
efried | As long as the thing that re-enables the service is running NOT at the behest of update_available_resource | 18:00 |
edmondsw | actually... no, I think we'd still want to change get_available_resource to hostname | 18:01 |
edmondsw | I think we're getting hung up on the word available | 18:01 |
edmondsw | it's available in the sense that there is a compute service setup there. It's not available in the sense that the service is disabled | 18:01 |
edmondsw | I think nova means the former, though | 18:01 |
edmondsw | they seem to have coded that way, as in update_available_resource | 18:02 |
efried | Right | 18:02 |
edmondsw | I'm still a little concerned that switching from MTMS to hostname is going to break something somewhere | 18:03 |
efried | Anyway, let's table that and come back to it. | 18:03 |
efried | I'm 95% sure that is an opaque string that nobody cares about unless there's more than one on a compute node (which I think only applies to ironic) | 18:04 |
efried | We could return 'foo' and it would be okay. | 18:04 |
efried | But let's table it for now. | 18:04 |
efried | So I'm not a big fan of threads in general. | 18:05 |
efried | The get_available_resource periodic task thing is probably "good enough". | 18:05 |
efried | It's also worth noting that a recent change was made that automatically disables a compute service on which some number of consecutive deploys failed. | 18:06 |
efried | 10 by default, conf-able. | 18:06 |
edmondsw | yeah, would be nice to be automatically re-enabling after they've done that | 18:08 |
efried | Well, the stated design there is that the admin has to re-enable manually. | 18:13 |
efried | There are many reasons beyond nvl services that deploys could be failing. | 18:13 |
efried | edmondsw So the other thing I want to see done - possibly in a preceding change set - is factoring out the service enable/disable code, which is now used in at least four places I know of, including this one. | 18:17 |
edmondsw | efried sure | 18:19 |
efried | Okay, so to disable, we could a) rely on get_available_resource as the trigger; and/or b) add a helper to the adapter to disable on certain conditions | 18:21 |
efried | To re-enable, we could a) rely on get_available_resource as the trigger; and/or b) spawn a thread when we disable (however that is) to poll for live-ness. | 18:21 |
edmondsw | I'm working on (a) atm | 18:22 |
edmondsw | for both | 18:22 |
efried | Why I like relying on get_available_resource: It's freakin simple. Why I don't like it: we're at the mercy of periodic_task_interval/update_resources_interval | 18:25 |
efried | I'm gonna declare that's okay for now. If we get complaints (I can almost guarantee we never will) we can look into swapping it around. | 18:26 |
efried | I'd like there to be a record of the design alternatives, though. May be too heavyhanded to do it in code comments, but at least in the commit message. | 18:28 |
efried | ...so our posterity can git blame their way to finding it. | 18:28 |
efried | You can refer to this IRC log - we're on eavesdrop. | 18:28 |
efried | edmondsw In other news, I found out how to get the MTMS from /proc ;-) | 18:29 |
edmondsw | efried oh reallyl? | 18:31 |
efried | yup, gimme a sec. | 18:31 |
mdrabe | edmondsw os-hosts being deprecated is gonna cause some problems | 18:45 |
mdrabe | And about this get_available_nodes business, can it just return CONF.host? | 18:46 |
mdrabe | What is CONF.host set to by default? | 18:46 |
edmondsw | mdrabe yeah, I'm worried about the os-hosts thing | 18:46 |
edmondsw | chatted with cjvolzka about that a bit, and will send a note | 18:46 |
edmondsw | mdrabe I was just looking for CONF.host... so stay tuned | 18:47 |
efried | edmondsw http://paste.openstack.org/show/611760/ | 18:49 |
efried | From what I can tell, if you don't set CONF.host, things that want it use socket.gethostname() | 18:50 |
esberglu | efried: edmondsw: thorst: Thinking about reworking the patching logic in prep_devstack | 18:51 |
esberglu | My idea is that instead of passing the patch lists through on the command line we provide a file that will have a project per line | 18:51 |
esberglu | <project1> <patch_list_1> | 18:51 |
esberglu | <project2> <patch_list_2> | 18:51 |
esberglu | So we can patch multiple projects in the same run. And keep this file in powervm-ci so we can live update super easily | 18:51 |
esberglu | Thoughts? | 18:51 |
edmondsw | mdrabe https://github.com/openstack/nova/blob/master/nova/conf/netconf.py#L55 | 18:51 |
edmondsw | default for CONF.host is socket.gethostname() | 18:52 |
edmondsw | efried that's what we'd talked about changing it to anyway | 18:52 |
efried | esberglu Yes. Make the format project:branch:patch_list | 18:52 |
mdrabe | CONF.host is what pvc does | 18:52 |
edmondsw | turns out PowerVC is already overwriting the pvm driver to use CONF.host | 18:52 |
edmondsw | mdrabe right | 18:52 |
edmondsw | so I'm thinking we just change the driver to use CONF.host and PowerVC can just stop overwriting that | 18:53 |
edmondsw | efried agreed? | 18:53 |
esberglu | efried: Ok. Will wait until the external prep_devstack is working to get started on that | 18:53 |
mdrabe | +1 | 18:53 |
efried | Shrug. | 18:54 |
efried | Sure. | 18:54 |
edmondsw | esberglu +1 | 18:54 |
*** k0da has joined #openstack-powervm | 19:02 | |
*** thorst_afk has quit IRC | 19:11 | |
*** thorst_afk has joined #openstack-powervm | 19:15 | |
edmondsw | efried how can I stop pvm-rest to simulate that going down? | 19:18 |
*** thorst_afk has quit IRC | 19:20 | |
mdrabe | edmondsw try `service pvm-rest stop` maybe | 19:20 |
edmondsw | mdrabe nope | 19:20 |
mdrabe | sudo? | 19:21 |
edmondsw | mdrabe yep... duh... | 19:24 |
*** thorst_afk has joined #openstack-powervm | 19:29 | |
edmondsw | efried I just found an infinite loop condition in validate_vios_ready... | 19:37 |
edmondsw | I'll propose something | 19:37 |
efried | edmondsw Well, we haven't hit it yet... | 19:38 |
edmondsw | efried not surprising, actually | 19:38 |
edmondsw | efried that method is currently called right after you setup the adapter, so it should be good to go, and you'd only hit this if the adapter wasn't working | 19:39 |
efried | edmondsw Okay, so if the adapter went belly up right after it was initialized successfully? | 19:40 |
edmondsw | in current usage... but we're going to start calling this differently, so it'd be more likely | 19:40 |
edmondsw | efried https://github.com/powervm/pypowervm/blob/master/pypowervm/tasks/partition.py#L263 | 19:41 |
edmondsw | if you always hit the Exception in L275, you never break out | 19:41 |
edmondsw | max_wait_time is ignored | 19:42 |
edmondsw | because rmc_down_vioses will always evaluate False in the if that can break out based on max_wait_time | 19:42 |
esberglu | efried: Can I get a final review on 5406? | 19:42 |
efried | on it now | 19:43 |
esberglu | I want to deploy staging with that this afternoon | 19:43 |
esberglu | efried: thanks | 19:43 |
efried | esberglu done | 19:43 |
efried | esberglu Is 5405 just WIP because it's waiting for the other to merge? | 19:44 |
esberglu | efried: Yeah pretty much. And if i hit anything unexpected when testing on staging | 19:56 |
-openstackstatus- NOTICE: The Gerrit service on review.openstack.org is being restarted now to clear some excessive connection counts while we debug the intermittent request failures reported over the past few minutes | 20:06 | |
*** smatzek has quit IRC | 20:08 | |
efried | esberglu Heads up, the nova project just went bananas with spurious merge conflicts on every pending change set. | 21:09 |
efried | So dozens of rebases are gonna choke CIs everywhere. | 21:09 |
efried | It'll be a good scale test for us, I suppose. Have we ever hit full capacity with a wait queue before? | 21:09 |
efried | I count 38 in the last hour | 21:11 |
esberglu | efried: Yeah we have hit our max nodes before. There are 86 runs in the queue right now (current max is 50). We can up the max probably | 21:11 |
efried | Not sure if that's necessary, assuming they don't just drop off at some point. | 21:12 |
efried | We're not likely to be the last CI posting results after this glut. | 21:12 |
efried | The max in this context is the number of nodes running simultaneously? | 21:13 |
esberglu | efried: Yep. We very rarely get up to 50 under normal circumstances but it does happen every now and then | 21:13 |
efried | Increasing the number of nodes would put load on.. what? | 21:13 |
efried | The systems those VMs live on? Are they shared proc/mem? | 21:13 |
efried | And presumably the SSPs backing them. | 21:14 |
esberglu | efried: I don't know enough about the performance side of things to say | 21:14 |
efried | k. If you're saying it's really rare to hit the max, let's leave it alone. This is an anomaly, for sure. | 21:14 |
esberglu | I think I have seen it hit the max 2 (maybe 3) other times ever. So yeah very rare | 21:15 |
*** edmondsw_ has joined #openstack-powervm | 21:22 | |
*** zerick_ has joined #openstack-powervm | 21:25 | |
*** jpasqualetto has quit IRC | 21:27 | |
*** adi_____ has quit IRC | 21:29 | |
*** edmondsw has quit IRC | 21:29 | |
*** zerick has quit IRC | 21:29 | |
*** zerick_ is now known as zerick | 21:30 | |
openstackgerrit | Matt Rabe proposed openstack/nova-powervm master: Change NVRAM manager store to use uuid instead of instance object https://review.openstack.org/471926 | 21:30 |
*** esberglu has quit IRC | 21:34 | |
*** esberglu has joined #openstack-powervm | 21:35 | |
*** mdrabe has quit IRC | 21:38 | |
*** esberglu has quit IRC | 21:39 | |
*** esberglu has joined #openstack-powervm | 21:48 | |
*** esberglu has quit IRC | 21:53 | |
*** thorst_afk has quit IRC | 21:53 | |
*** edmondsw_ has quit IRC | 21:58 | |
*** svenkat has quit IRC | 22:12 | |
*** tjakobs has quit IRC | 22:22 | |
*** thorst_afk has joined #openstack-powervm | 22:28 | |
*** thorst_afk has quit IRC | 22:32 | |
*** jwcroppe has quit IRC | 22:36 | |
*** jwcroppe has joined #openstack-powervm | 22:36 | |
*** jwcroppe has quit IRC | 22:36 | |
*** jwcroppe has joined #openstack-powervm | 22:37 | |
*** jwcroppe has quit IRC | 22:41 | |
*** thorst_afk has joined #openstack-powervm | 23:02 | |
*** openstack has joined #openstack-powervm | 23:14 | |
*** thorst_afk has quit IRC | 23:17 | |
*** edmondsw has joined #openstack-powervm | 23:18 | |
*** adi_____ has joined #openstack-powervm | 23:20 | |
*** edmondsw has quit IRC | 23:23 | |
*** svenkat has joined #openstack-powervm | 23:24 | |
*** k0da has quit IRC | 23:46 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!