13:30:34 #startmeeting PowerVM CI Meeting 13:30:35 Meeting started Thu Nov 10 13:30:34 2016 UTC and is due to finish in 60 minutes. The chair is adreznec. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:30:36 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:30:38 The meeting name has been set to 'powervm_ci_meeting' 13:30:50 All right. Roll call? 13:30:53 o/ 13:31:22 start? 13:31:25 yep 13:31:41 All right, looks like we have enough to get started 13:31:48 #topic Current Status 13:32:11 thorst_ or esberglu, want to kick things off here? 13:32:29 o/ 13:32:48 so I think my (and really efried's) contribution is that the fifo_pipo is almost done 13:33:01 I had a working version yesterday (no UT yet), then efried did a rev 13:33:09 I'm not sure if we've validated that rev yet. efried? 13:33:14 No chance it's working now ;-) 13:33:24 I have full faith! 13:33:32 but want proof as well 13:33:33 :-) 13:33:39 Can you try it out with whatever setup you used yesterday thorst_? 13:34:31 efried: yep - can do 13:34:35 coo 13:34:42 AWESOME 13:34:47 Oops 13:34:48 then we'll ask esberglu to take it in the CI while we work UT and what not 13:34:55 I loaded the devstack CI with the version Drew put up yesterday 13:34:55 well it is awesome. 13:35:02 oooo 13:35:02 Not quite that awesome, but still good to get it tested 13:35:04 how'd that go? 13:35:25 The tempest runs are still just looping looking for a matching LU with checksum xxx 13:35:41 http://184.172.12.213/02/396002/1/silent/neutron-pvm-dsvm-tempest-full/3aed2d8/ 13:35:48 Right, I was going to look at that this morning. 13:36:00 bah 13:36:16 So its not even getting to the tempest stuff at this point 13:37:11 uh, is that the right log? 13:37:22 oh, is it the os_ci_tempest.sh that's doing that? 13:37:38 efried: right...we pre-seed the image in the SSP 13:37:39 Yeah 13:37:42 yikes 13:37:44 super early 13:38:13 oh, we've seen this before. 13:38:16 #action thorst to validate efried's rev of remote patch 13:38:41 #action efried to diag/debug "looking for LU with checksum" loop 13:39:03 We have? Hmm, must be forgetting 13:39:10 Ok 13:39:34 #action esberglu to put remote patch in CI when ready 13:39:44 What other status do we have? wangqwsh I saw an email from you with boot speed issues? 13:39:49 Yay I have an action this time 13:40:01 yes 13:40:14 not sure the reason 13:40:24 How slow are we talking here? 13:40:26 wangqwsh: how slow of a boot are we talking/ 13:40:28 lol 13:40:35 echo echo echo 13:40:47 more than 3 hours 13:41:11 they are still in deleting 13:41:14 whoa. 13:41:18 Wow, that's a lot worse than I expected 13:41:26 Ahh I bet they hit the marker LU thing 13:41:27 I thought we'd be talking a couple mins here. 13:41:28 Ok, and this is on neo14? 13:41:30 but its deleting? 13:41:40 ye, neo14 13:41:44 yes 13:41:55 i am trying to deleting them. 13:42:21 because of spawning error. 13:42:51 Okay, the thing esberglu is seeing is because there's a stale marker LU somewhere that's blocking the "seed LPAR" creation. 13:43:13 Yeah. I will go in and delete that marker lu now 13:43:16 And by "stale" I mean "who knows?" 13:43:26 wangqwsh: I take it they never finished booting? 13:43:27 partf59362e8image_base_os_2f282b84e7e608d5852449ed940bfc51 13:43:44 yes. 13:43:47 Ok 13:43:49 never.. 13:43:52 Yeah it sounds like they got blocked 13:44:19 So people. 13:44:40 Is it possible for the compute process to die completely without cleaning up? 13:44:56 Possible of course theoretically - but in the CI environment 13:45:34 Cause Occam's razor says that's the most likely way we end up in this stale-marker-LU state. 13:45:54 efried: sure - we hit the timeout limit...it'll leave it around 13:45:57 but we have a cleanup process 13:46:05 now whether or not that cleanup process is cleaning out the markers... 13:46:07 Right - the 'finally' block will still get executed. 13:46:15 Yes, 'finally' cleans up the marker LU. 13:46:39 no no... 13:46:50 I mean, we can just straight kill from zuul the run 13:46:55 shut down the VM 13:46:57 mid process 13:47:04 could I guess leave the marker around... 13:47:19 Yeah 13:47:26 oh, for sure. 13:47:27 I'd think that'd be rare 13:47:28 An edge case, but definitely possible 13:47:49 Well, I can assure you our cleanup process doesn't delete marker LUs. 13:47:56 At that level. 13:48:05 Cause how would we know which ones were safe to delete? 13:48:57 efried: agree. 13:49:20 But it seems we're seeing this very frequently. 13:49:21 though I think that the CI could have a job that says 'hey...if we have a marker lu that's been around for more than 60 minutes...its time to delete it' 13:49:30 that would be 100% specific to the CI though 13:49:36 yeah 13:49:41 because we know that nothing in there should take more than an hour. 13:49:56 But would that mask if we had a real bug that causes a real upload hang? 13:50:01 but I'd rather try to find the root cause (I don't think we have yet) before we do that 13:50:06 efried: yep...exactly 13:50:14 no, we definitely have not identified the root cause. 13:50:24 I wonder if we should add to the CI job something that prints out all the LUs before it runs 13:50:35 before the devstack even... 13:51:02 I can add that in it would be super easy 13:51:05 You're thinking for debug? 13:51:12 adreznec: yah 13:51:20 Seems like a fair place to start 13:51:21 just to know, was it there before we even started 13:51:31 And at the end of a run, grep the log for certain messages pertaining to marker LUs. 13:51:33 Is there anything else we'd want to add to that debug info 13:51:37 and *maybe* to the pypowervm bit 13:51:54 as part of a warning log, when we detect a marker lu...but I thought we had that. 13:52:06 down in pypowervm itself... 13:52:10 2016-11-10 02:53:12.099 INFO pypowervm.tasks.cluster_ssp [req-dcce7f01-a1fd-470b-a174-9fc51f9a4a05 admin admin] Waiting for in-progress upload(s) to complete. Marker LU(s): ['partf59362e8image_base_os_2f282b84e7e608d5852449ed940bfc51'] 13:52:45 I'm not seeing pypowervm DEBUG turned on in this log, btw, esberglu. 13:52:53 yeah... 13:52:54 I thought we made that change. 13:53:12 efried: we had it off because of all the gorp. Now that gorp is gone...did we make that change in CI yet? 13:53:20 I thought so... 13:53:48 We did. I wonder if I didn't pull down the newest neo-os-ci last redeploy 13:54:06 Merged on 11/3 13:54:16 4420 13:54:27 #action esberglu to figure out where the pypowervm debug logs are 13:54:40 I definitely did not pull down the newest... 13:54:57 That sounds bad 13:55:37 If I go into jenkins and manually kill jobs would that leave the marker LUs around? 13:55:55 Does zuul delete the VM if the job ends? 13:56:38 Have to step away for a couple minutes, please continue without me 13:56:40 esberglu, I think the answer is yes 13:56:51 esberglu: well, if its stuck in an upload 13:57:11 efried: I wonder if we could/should get the VM's name in the marker lu 13:57:13 Anything that aborts the compute process. 13:57:19 Ahh then I'm probably to blame. I do that sometimes when I need to redeploy but don't want to wait for all the runs to finish 13:57:38 esberglu: when we redeploy though, I thought we cleaned out the SSP? 13:57:39 Or just redeploy the management playbook which also just kills stuff 13:57:47 Yeah redeploys do 13:57:55 clean out the ssp 13:57:55 And since upload takes "a while", you can easily catch a thread that's in the middle of one. 13:58:07 maybe we need to add something to the management playbook to clean out the SSPs. 13:58:13 but efried, thoughts on adding a VM name? 13:58:29 Right - I had considered it, but it's kinda tough with name length restrictions. 13:58:59 yeah 13:59:02 I can make it work if we think it's critical. 13:59:12 not sure...lets see how these other actions pan out? 13:59:13 Though I was going to make it some part of the MTMS 13:59:20 The VM name doesn't really help us much. 13:59:29 host + lpar id... 13:59:36 that'd be a start 13:59:49 changing it though has implications for in the field...though I think this is low use code at the moment 14:00:20 But yeah, if esberglu is interrupting runs, that's going to be the culprit most of the time. 14:00:49 esberglu: can we add something in the management playbook that cleans out the compute node markers (or ssps)? 14:01:01 I mean, honestly, we want to clear out all of the computes 14:01:08 but just not redeploy. 14:01:26 So the computes get cleaned out when the compute playbook is run 14:01:33 But not in the management playbook 14:01:38 This is something that covers all the nodes sharing a given SSP? 14:01:57 efried: it'd be all nodes in the given cloud 14:02:02 Cool. 14:02:12 Are we close to done? I've got another meeting. 14:02:15 esberglu: yeah, its weird...because its the management playbook. But I think you're running that just to build new images. 14:02:30 so, I think you want to pull the clean of the compute nodes into that...or make a new playbook altogether for it 14:02:30 Yeah exactly 14:02:37 efried: yeah, me too 14:03:58 #action esberglu find a way to clean compute nodes from management playbook 14:04:37 The only other thing I had was for wangqwsh 14:04:44 The read only filesystem was fixed 14:04:57 is that an assertion or a question? 14:05:31 esberglu: cool 14:05:38 And I redeployed the staging env. with the newest versions of both OSA CI patches and the latest local2remote. I'm still getting stuck trying to build the wheels 14:05:55 It looks like you got past all of the bootstrap stuff and to the tempest part? 14:06:13 yes, at the tempest 14:06:32 you can use this review: 14:07:03 let me find it 14:08:07 http://morpheus.rch.stglabs.ibm.com/#/c/4033/ 14:08:36 That review is already deployed in the environment 14:08:39 i add some variables and scripts for osa 14:09:13 Yeah I have the latest version of that deployed 14:10:08 I will send an email with more info about what I am hitting, too much info for irc 14:10:21 #endmeeting 14:10:23 Ok 14:10:25 ok 14:10:31 Sounds like we're done then 14:10:33 Thanks all 14:10:36 #endmeeting