#openstack-powervm log

13:30:34 <adreznec> #startmeeting PowerVM CI Meeting
13:30:35 <openstack> Meeting started Thu Nov 10 13:30:34 2016 UTC and is due to finish in 60 minutes.  The chair is adreznec. Information about MeetBot at http://wiki.debian.org/MeetBot.
13:30:36 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
13:30:38 <openstack> The meeting name has been set to 'powervm_ci_meeting'
13:30:50 <adreznec> All right. Roll call?
13:30:53 <thorst_> o/
13:31:22 <wangqwsh> start?
13:31:25 <esberglu> yep
13:31:41 <adreznec> All right, looks like we have enough to get started
13:31:48 <adreznec> #topic Current Status
13:32:11 <adreznec> thorst_ or esberglu, want to kick things off here?
13:32:29 <efried> o/
13:32:48 <thorst_> so I think my (and really efried's) contribution is that the fifo_pipo is almost done
13:33:01 <thorst_> I had a working version yesterday (no UT yet), then efried did a rev
13:33:09 <thorst_> I'm not sure if we've validated that rev yet.  efried?
13:33:14 <efried> No chance it's working now ;-)
13:33:24 <thorst_> I have full faith!
13:33:32 <thorst_> but want proof as well
13:33:33 <thorst_> :-)
13:33:39 <efried> Can you try it out with whatever setup you used yesterday thorst_?
13:34:31 <thorst_> efried: yep - can do
13:34:35 <efried> coo
13:34:42 <adreznec> AWESOME
13:34:47 <adreznec> Oops
13:34:48 <thorst_> then we'll ask esberglu to take it in the CI while we work UT and what not
13:34:55 <esberglu> I loaded the devstack CI with the version Drew put up yesterday
13:34:55 <thorst_> well it is awesome.
13:35:02 <thorst_> oooo
13:35:02 <adreznec> Not quite that awesome, but still good to get it tested
13:35:04 <thorst_> how'd that go?
13:35:25 <esberglu> The tempest runs are still just looping looking for a matching LU with checksum xxx
13:35:41 <esberglu> http://184.172.12.213/02/396002/1/silent/neutron-pvm-dsvm-tempest-full/3aed2d8/
13:35:48 <efried> Right, I was going to look at that this morning.
13:36:00 <thorst_> bah
13:36:16 <esberglu> So its not even getting to the tempest stuff at this point
13:37:11 <efried> uh, is that the right log?
13:37:22 <efried> oh, is it the os_ci_tempest.sh that's doing that?
13:37:38 <thorst_> efried: right...we pre-seed the image in the SSP
13:37:39 <esberglu> Yeah
13:37:42 <adreznec> yikes
13:37:44 <adreznec> super early
13:38:13 <efried> oh, we've seen this before.
13:38:16 <thorst_> #action thorst to validate efried's rev of remote patch
13:38:41 <efried> #action efried to diag/debug "looking for LU with checksum" loop
13:39:03 <adreznec> We have? Hmm, must be forgetting
13:39:10 <adreznec> Ok
13:39:34 <esberglu> #action esberglu to put remote patch in CI when ready
13:39:44 <adreznec> What other status do we have? wangqwsh I saw an email from you with boot speed issues?
13:39:49 <esberglu> Yay I have an action this time
13:40:01 <wangqwsh> yes
13:40:14 <wangqwsh> not sure the reason
13:40:24 <adreznec> How slow are we talking here?
13:40:26 <thorst_> wangqwsh: how slow of a boot are we talking/
13:40:28 <thorst_> lol
13:40:35 <adreznec> echo echo echo
13:40:47 <wangqwsh> more than 3 hours
13:41:11 <wangqwsh> they are still in deleting
13:41:14 <thorst_> whoa.
13:41:18 <adreznec> Wow, that's a lot worse than I expected
13:41:26 <esberglu> Ahh I bet they hit the marker LU thing
13:41:27 <thorst_> I thought we'd be talking a couple mins here.
13:41:28 <adreznec> Ok, and this is on neo14?
13:41:30 <thorst_> but its deleting?
13:41:40 <wangqwsh> ye, neo14
13:41:44 <wangqwsh> yes
13:41:55 <wangqwsh> i am trying to deleting them.
13:42:21 <wangqwsh> because of spawning error.
13:42:51 <efried> Okay, the thing esberglu is seeing is because there's a stale marker LU somewhere that's blocking the "seed LPAR" creation.
13:43:13 <esberglu> Yeah. I will go in and delete that marker lu now
13:43:16 <efried> And by "stale" I mean "who knows?"
13:43:26 <adreznec> wangqwsh: I take it they never finished booting?
13:43:27 <efried> partf59362e8image_base_os_2f282b84e7e608d5852449ed940bfc51
13:43:44 <wangqwsh> yes.
13:43:47 <adreznec> Ok
13:43:49 <wangqwsh> never..
13:43:52 <adreznec> Yeah it sounds like they got blocked
13:44:19 <efried> So people.
13:44:40 <efried> Is it possible for the compute process to die completely without cleaning up?
13:44:56 <efried> Possible of course theoretically - but in the CI environment
13:45:34 <efried> Cause Occam's razor says that's the most likely way we end up in this stale-marker-LU state.
13:45:54 <thorst_> efried: sure - we hit the timeout limit...it'll leave it around
13:45:57 <thorst_> but we have a cleanup process
13:46:05 <thorst_> now whether or not that cleanup process is cleaning out the markers...
13:46:07 <efried> Right - the 'finally' block will still get executed.
13:46:15 <efried> Yes, 'finally' cleans up the marker LU.
13:46:39 <thorst_> no no...
13:46:50 <thorst_> I mean, we can just straight kill from zuul the run
13:46:55 <thorst_> shut down the VM
13:46:57 <thorst_> mid process
13:47:04 <thorst_> could I guess leave the marker around...
13:47:19 <adreznec> Yeah
13:47:26 <efried> oh, for sure.
13:47:27 <thorst_> I'd think that'd be rare
13:47:28 <adreznec> An edge case, but definitely possible
13:47:49 <efried> Well, I can assure you our cleanup process doesn't delete marker LUs.
13:47:56 <efried> At that level.
13:48:05 <efried> Cause how would we know which ones were safe to delete?
13:48:57 <thorst_> efried: agree.
13:49:20 <efried> But it seems we're seeing this very frequently.
13:49:21 <thorst_> though I think that the CI could have a job that says 'hey...if we have a marker lu that's been around for more than 60 minutes...its time to delete it'
13:49:30 <thorst_> that would be 100% specific to the CI though
13:49:36 <efried> yeah
13:49:41 <thorst_> because we know that nothing in there should take more than an hour.
13:49:56 <efried> But would that mask if we had a real bug that causes a real upload hang?
13:50:01 <thorst_> but I'd rather try to find the root cause (I don't think we have yet) before we do that
13:50:06 <thorst_> efried: yep...exactly
13:50:14 <efried> no, we definitely have not identified the root cause.
13:50:24 <thorst_> I wonder if we should add to the CI job something that prints out all the LUs before it runs
13:50:35 <thorst_> before the devstack even...
13:51:02 <esberglu> I can add that in it would be super easy
13:51:05 <adreznec> You're thinking for debug?
13:51:12 <thorst_> adreznec: yah
13:51:20 <adreznec> Seems like a fair place to start
13:51:21 <thorst_> just to know, was it there before we even started
13:51:31 <efried> And at the end of a run, grep the log for certain messages pertaining to marker LUs.
13:51:33 <adreznec> Is there anything else we'd want to add to that debug info
13:51:37 <thorst_> and *maybe* to the pypowervm bit
13:51:54 <thorst_> as part of a warning log, when we detect a marker lu...but I thought we had that.
13:52:06 <thorst_> down in pypowervm itself...
13:52:10 <efried> 2016-11-10 02:53:12.099 INFO pypowervm.tasks.cluster_ssp [req-dcce7f01-a1fd-470b-a174-9fc51f9a4a05 admin admin] Waiting for in-progress upload(s) to complete.  Marker LU(s): ['partf59362e8image_base_os_2f282b84e7e608d5852449ed940bfc51']
13:52:45 <efried> I'm not seeing pypowervm DEBUG turned on in this log, btw, esberglu.
13:52:53 <thorst_> yeah...
13:52:54 <efried> I thought we made that change.
13:53:12 <thorst_> efried: we had it off because of all the gorp.  Now that gorp is gone...did we make that change in CI yet?
13:53:20 <adreznec> I thought so...
13:53:48 <esberglu> We did. I wonder if I didn't pull down the newest neo-os-ci last redeploy
13:54:06 <efried> Merged on 11/3
13:54:16 <efried> 4420
13:54:27 <thorst_> #action esberglu to figure out where the pypowervm debug logs are
13:54:40 <esberglu> I definitely did not pull down the newest...
13:54:57 <adreznec> That sounds bad
13:55:37 <esberglu> If I go into jenkins and manually kill jobs would that leave the marker LUs around?
13:55:55 <adreznec> Does zuul delete the VM if the job ends?
13:56:38 <adreznec> Have to step away for a couple minutes, please continue without me
13:56:40 <efried> esberglu, I think the answer is yes
13:56:51 <thorst_> esberglu: well, if its stuck in an upload
13:57:11 <thorst_> efried: I wonder if we could/should get the VM's name in the marker lu
13:57:13 <efried> Anything that aborts the compute process.
13:57:19 <esberglu> Ahh then I'm probably to blame. I do that sometimes when I need to redeploy but don't want to wait for all the runs to finish
13:57:38 <thorst_> esberglu: when we redeploy though, I thought we cleaned out the SSP?
13:57:39 <esberglu> Or just redeploy the management playbook which also just kills stuff
13:57:47 <esberglu> Yeah redeploys do
13:57:55 <esberglu> clean out the ssp
13:57:55 <efried> And since upload takes "a while", you can easily catch a thread that's in the middle of one.
13:58:07 <thorst_> maybe we need to add something to the management playbook to clean out the SSPs.
13:58:13 <thorst_> but efried, thoughts on adding a VM name?
13:58:29 <efried> Right - I had considered it, but it's kinda tough with name length restrictions.
13:58:59 <thorst_> yeah
13:59:02 <efried> I can make it work if we think it's critical.
13:59:12 <thorst_> not sure...lets see how these other actions pan out?
13:59:13 <efried> Though I was going to make it some part of the MTMS
13:59:20 <efried> The VM name doesn't really help us much.
13:59:29 <thorst_> host + lpar id...
13:59:36 <thorst_> that'd be a start
13:59:49 <thorst_> changing it though has implications for in the field...though I think this is low use code at the moment
14:00:20 <efried> But yeah, if esberglu is interrupting runs, that's going to be the culprit most of the time.
14:00:49 <thorst_> esberglu: can we add something in the management playbook that cleans out the compute node markers (or ssps)?
14:01:01 <thorst_> I mean, honestly, we want to clear out all of the computes
14:01:08 <thorst_> but just not redeploy.
14:01:26 <esberglu> So the computes get cleaned out when the compute playbook is run
14:01:33 <esberglu> But not in the management playbook
14:01:38 <efried> This is something that covers all the nodes sharing a given SSP?
14:01:57 <thorst_> efried: it'd be all nodes in the given cloud
14:02:02 <efried> Cool.
14:02:12 <efried> Are we close to done?  I've got another meeting.
14:02:15 <thorst_> esberglu: yeah, its weird...because its the management playbook.  But I think you're running that just to build new images.
14:02:30 <thorst_> so, I think you want to pull the clean of the compute nodes into that...or make a new playbook altogether for it
14:02:30 <esberglu> Yeah exactly
14:02:37 <thorst_> efried: yeah, me too
14:03:58 <esberglu> #action esberglu find a way to clean compute nodes from management playbook
14:04:37 <esberglu> The only other thing I had was for wangqwsh
14:04:44 <esberglu> The read only filesystem was fixed
14:04:57 <thorst_> is that an assertion or a question?
14:05:31 <wangqwsh> esberglu: cool
14:05:38 <esberglu> And I redeployed the staging env. with the newest versions of both OSA CI patches and the latest local2remote. I'm still getting stuck trying to build the wheels
14:05:55 <esberglu> It looks like you got past all of the bootstrap stuff and to the tempest part?
14:06:13 <wangqwsh> yes, at the tempest
14:06:32 <wangqwsh> you can use this review:
14:07:03 <wangqwsh> let me find it
14:08:07 <wangqwsh> http://morpheus.rch.stglabs.ibm.com/#/c/4033/
14:08:36 <esberglu> That review is already deployed in the environment
14:08:39 <wangqwsh> i add some variables and scripts for osa
14:09:13 <esberglu> Yeah I have the latest version of that deployed
14:10:08 <esberglu> I will send an email with more info about what I am hitting, too much info for irc
14:10:21 <esberglu> #endmeeting
14:10:23 <adreznec> Ok
14:10:25 <wangqwsh> ok
14:10:31 <adreznec> Sounds like we're done then
14:10:33 <adreznec> Thanks all
14:10:36 <adreznec> #endmeeting