13:30:34 <adreznec> #startmeeting PowerVM CI Meeting 13:30:35 <openstack> Meeting started Thu Nov 10 13:30:34 2016 UTC and is due to finish in 60 minutes. The chair is adreznec. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:30:36 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:30:38 <openstack> The meeting name has been set to 'powervm_ci_meeting' 13:30:50 <adreznec> All right. Roll call? 13:30:53 <thorst_> o/ 13:31:22 <wangqwsh> start? 13:31:25 <esberglu> yep 13:31:41 <adreznec> All right, looks like we have enough to get started 13:31:48 <adreznec> #topic Current Status 13:32:11 <adreznec> thorst_ or esberglu, want to kick things off here? 13:32:29 <efried> o/ 13:32:48 <thorst_> so I think my (and really efried's) contribution is that the fifo_pipo is almost done 13:33:01 <thorst_> I had a working version yesterday (no UT yet), then efried did a rev 13:33:09 <thorst_> I'm not sure if we've validated that rev yet. efried? 13:33:14 <efried> No chance it's working now ;-) 13:33:24 <thorst_> I have full faith! 13:33:32 <thorst_> but want proof as well 13:33:33 <thorst_> :-) 13:33:39 <efried> Can you try it out with whatever setup you used yesterday thorst_? 13:34:31 <thorst_> efried: yep - can do 13:34:35 <efried> coo 13:34:42 <adreznec> AWESOME 13:34:47 <adreznec> Oops 13:34:48 <thorst_> then we'll ask esberglu to take it in the CI while we work UT and what not 13:34:55 <esberglu> I loaded the devstack CI with the version Drew put up yesterday 13:34:55 <thorst_> well it is awesome. 13:35:02 <thorst_> oooo 13:35:02 <adreznec> Not quite that awesome, but still good to get it tested 13:35:04 <thorst_> how'd that go? 13:35:25 <esberglu> The tempest runs are still just looping looking for a matching LU with checksum xxx 13:35:41 <esberglu> http://184.172.12.213/02/396002/1/silent/neutron-pvm-dsvm-tempest-full/3aed2d8/ 13:35:48 <efried> Right, I was going to look at that this morning. 13:36:00 <thorst_> bah 13:36:16 <esberglu> So its not even getting to the tempest stuff at this point 13:37:11 <efried> uh, is that the right log? 13:37:22 <efried> oh, is it the os_ci_tempest.sh that's doing that? 13:37:38 <thorst_> efried: right...we pre-seed the image in the SSP 13:37:39 <esberglu> Yeah 13:37:42 <adreznec> yikes 13:37:44 <adreznec> super early 13:38:13 <efried> oh, we've seen this before. 13:38:16 <thorst_> #action thorst to validate efried's rev of remote patch 13:38:41 <efried> #action efried to diag/debug "looking for LU with checksum" loop 13:39:03 <adreznec> We have? Hmm, must be forgetting 13:39:10 <adreznec> Ok 13:39:34 <esberglu> #action esberglu to put remote patch in CI when ready 13:39:44 <adreznec> What other status do we have? wangqwsh I saw an email from you with boot speed issues? 13:39:49 <esberglu> Yay I have an action this time 13:40:01 <wangqwsh> yes 13:40:14 <wangqwsh> not sure the reason 13:40:24 <adreznec> How slow are we talking here? 13:40:26 <thorst_> wangqwsh: how slow of a boot are we talking/ 13:40:28 <thorst_> lol 13:40:35 <adreznec> echo echo echo 13:40:47 <wangqwsh> more than 3 hours 13:41:11 <wangqwsh> they are still in deleting 13:41:14 <thorst_> whoa. 13:41:18 <adreznec> Wow, that's a lot worse than I expected 13:41:26 <esberglu> Ahh I bet they hit the marker LU thing 13:41:27 <thorst_> I thought we'd be talking a couple mins here. 13:41:28 <adreznec> Ok, and this is on neo14? 13:41:30 <thorst_> but its deleting? 13:41:40 <wangqwsh> ye, neo14 13:41:44 <wangqwsh> yes 13:41:55 <wangqwsh> i am trying to deleting them. 13:42:21 <wangqwsh> because of spawning error. 13:42:51 <efried> Okay, the thing esberglu is seeing is because there's a stale marker LU somewhere that's blocking the "seed LPAR" creation. 13:43:13 <esberglu> Yeah. I will go in and delete that marker lu now 13:43:16 <efried> And by "stale" I mean "who knows?" 13:43:26 <adreznec> wangqwsh: I take it they never finished booting? 13:43:27 <efried> partf59362e8image_base_os_2f282b84e7e608d5852449ed940bfc51 13:43:44 <wangqwsh> yes. 13:43:47 <adreznec> Ok 13:43:49 <wangqwsh> never.. 13:43:52 <adreznec> Yeah it sounds like they got blocked 13:44:19 <efried> So people. 13:44:40 <efried> Is it possible for the compute process to die completely without cleaning up? 13:44:56 <efried> Possible of course theoretically - but in the CI environment 13:45:34 <efried> Cause Occam's razor says that's the most likely way we end up in this stale-marker-LU state. 13:45:54 <thorst_> efried: sure - we hit the timeout limit...it'll leave it around 13:45:57 <thorst_> but we have a cleanup process 13:46:05 <thorst_> now whether or not that cleanup process is cleaning out the markers... 13:46:07 <efried> Right - the 'finally' block will still get executed. 13:46:15 <efried> Yes, 'finally' cleans up the marker LU. 13:46:39 <thorst_> no no... 13:46:50 <thorst_> I mean, we can just straight kill from zuul the run 13:46:55 <thorst_> shut down the VM 13:46:57 <thorst_> mid process 13:47:04 <thorst_> could I guess leave the marker around... 13:47:19 <adreznec> Yeah 13:47:26 <efried> oh, for sure. 13:47:27 <thorst_> I'd think that'd be rare 13:47:28 <adreznec> An edge case, but definitely possible 13:47:49 <efried> Well, I can assure you our cleanup process doesn't delete marker LUs. 13:47:56 <efried> At that level. 13:48:05 <efried> Cause how would we know which ones were safe to delete? 13:48:57 <thorst_> efried: agree. 13:49:20 <efried> But it seems we're seeing this very frequently. 13:49:21 <thorst_> though I think that the CI could have a job that says 'hey...if we have a marker lu that's been around for more than 60 minutes...its time to delete it' 13:49:30 <thorst_> that would be 100% specific to the CI though 13:49:36 <efried> yeah 13:49:41 <thorst_> because we know that nothing in there should take more than an hour. 13:49:56 <efried> But would that mask if we had a real bug that causes a real upload hang? 13:50:01 <thorst_> but I'd rather try to find the root cause (I don't think we have yet) before we do that 13:50:06 <thorst_> efried: yep...exactly 13:50:14 <efried> no, we definitely have not identified the root cause. 13:50:24 <thorst_> I wonder if we should add to the CI job something that prints out all the LUs before it runs 13:50:35 <thorst_> before the devstack even... 13:51:02 <esberglu> I can add that in it would be super easy 13:51:05 <adreznec> You're thinking for debug? 13:51:12 <thorst_> adreznec: yah 13:51:20 <adreznec> Seems like a fair place to start 13:51:21 <thorst_> just to know, was it there before we even started 13:51:31 <efried> And at the end of a run, grep the log for certain messages pertaining to marker LUs. 13:51:33 <adreznec> Is there anything else we'd want to add to that debug info 13:51:37 <thorst_> and *maybe* to the pypowervm bit 13:51:54 <thorst_> as part of a warning log, when we detect a marker lu...but I thought we had that. 13:52:06 <thorst_> down in pypowervm itself... 13:52:10 <efried> 2016-11-10 02:53:12.099 INFO pypowervm.tasks.cluster_ssp [req-dcce7f01-a1fd-470b-a174-9fc51f9a4a05 admin admin] Waiting for in-progress upload(s) to complete. Marker LU(s): ['partf59362e8image_base_os_2f282b84e7e608d5852449ed940bfc51'] 13:52:45 <efried> I'm not seeing pypowervm DEBUG turned on in this log, btw, esberglu. 13:52:53 <thorst_> yeah... 13:52:54 <efried> I thought we made that change. 13:53:12 <thorst_> efried: we had it off because of all the gorp. Now that gorp is gone...did we make that change in CI yet? 13:53:20 <adreznec> I thought so... 13:53:48 <esberglu> We did. I wonder if I didn't pull down the newest neo-os-ci last redeploy 13:54:06 <efried> Merged on 11/3 13:54:16 <efried> 4420 13:54:27 <thorst_> #action esberglu to figure out where the pypowervm debug logs are 13:54:40 <esberglu> I definitely did not pull down the newest... 13:54:57 <adreznec> That sounds bad 13:55:37 <esberglu> If I go into jenkins and manually kill jobs would that leave the marker LUs around? 13:55:55 <adreznec> Does zuul delete the VM if the job ends? 13:56:38 <adreznec> Have to step away for a couple minutes, please continue without me 13:56:40 <efried> esberglu, I think the answer is yes 13:56:51 <thorst_> esberglu: well, if its stuck in an upload 13:57:11 <thorst_> efried: I wonder if we could/should get the VM's name in the marker lu 13:57:13 <efried> Anything that aborts the compute process. 13:57:19 <esberglu> Ahh then I'm probably to blame. I do that sometimes when I need to redeploy but don't want to wait for all the runs to finish 13:57:38 <thorst_> esberglu: when we redeploy though, I thought we cleaned out the SSP? 13:57:39 <esberglu> Or just redeploy the management playbook which also just kills stuff 13:57:47 <esberglu> Yeah redeploys do 13:57:55 <esberglu> clean out the ssp 13:57:55 <efried> And since upload takes "a while", you can easily catch a thread that's in the middle of one. 13:58:07 <thorst_> maybe we need to add something to the management playbook to clean out the SSPs. 13:58:13 <thorst_> but efried, thoughts on adding a VM name? 13:58:29 <efried> Right - I had considered it, but it's kinda tough with name length restrictions. 13:58:59 <thorst_> yeah 13:59:02 <efried> I can make it work if we think it's critical. 13:59:12 <thorst_> not sure...lets see how these other actions pan out? 13:59:13 <efried> Though I was going to make it some part of the MTMS 13:59:20 <efried> The VM name doesn't really help us much. 13:59:29 <thorst_> host + lpar id... 13:59:36 <thorst_> that'd be a start 13:59:49 <thorst_> changing it though has implications for in the field...though I think this is low use code at the moment 14:00:20 <efried> But yeah, if esberglu is interrupting runs, that's going to be the culprit most of the time. 14:00:49 <thorst_> esberglu: can we add something in the management playbook that cleans out the compute node markers (or ssps)? 14:01:01 <thorst_> I mean, honestly, we want to clear out all of the computes 14:01:08 <thorst_> but just not redeploy. 14:01:26 <esberglu> So the computes get cleaned out when the compute playbook is run 14:01:33 <esberglu> But not in the management playbook 14:01:38 <efried> This is something that covers all the nodes sharing a given SSP? 14:01:57 <thorst_> efried: it'd be all nodes in the given cloud 14:02:02 <efried> Cool. 14:02:12 <efried> Are we close to done? I've got another meeting. 14:02:15 <thorst_> esberglu: yeah, its weird...because its the management playbook. But I think you're running that just to build new images. 14:02:30 <thorst_> so, I think you want to pull the clean of the compute nodes into that...or make a new playbook altogether for it 14:02:30 <esberglu> Yeah exactly 14:02:37 <thorst_> efried: yeah, me too 14:03:58 <esberglu> #action esberglu find a way to clean compute nodes from management playbook 14:04:37 <esberglu> The only other thing I had was for wangqwsh 14:04:44 <esberglu> The read only filesystem was fixed 14:04:57 <thorst_> is that an assertion or a question? 14:05:31 <wangqwsh> esberglu: cool 14:05:38 <esberglu> And I redeployed the staging env. with the newest versions of both OSA CI patches and the latest local2remote. I'm still getting stuck trying to build the wheels 14:05:55 <esberglu> It looks like you got past all of the bootstrap stuff and to the tempest part? 14:06:13 <wangqwsh> yes, at the tempest 14:06:32 <wangqwsh> you can use this review: 14:07:03 <wangqwsh> let me find it 14:08:07 <wangqwsh> http://morpheus.rch.stglabs.ibm.com/#/c/4033/ 14:08:36 <esberglu> That review is already deployed in the environment 14:08:39 <wangqwsh> i add some variables and scripts for osa 14:09:13 <esberglu> Yeah I have the latest version of that deployed 14:10:08 <esberglu> I will send an email with more info about what I am hitting, too much info for irc 14:10:21 <esberglu> #endmeeting 14:10:23 <adreznec> Ok 14:10:25 <wangqwsh> ok 14:10:31 <adreznec> Sounds like we're done then 14:10:33 <adreznec> Thanks all 14:10:36 <adreznec> #endmeeting