18:06:04 #startmeeting sahara 18:06:05 Meeting started Thu Feb 19 18:06:04 2015 UTC and is due to finish in 60 minutes. The chair is alazarev. Information about MeetBot at http://wiki.debian.org/MeetBot. 18:06:06 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 18:06:09 The meeting name has been set to 'sahara' 18:06:19 now we 18:06:28 I was searching how to start it :) 18:06:31 we're cooking with gas, I was going to say 18:06:42 yeah, I forget :) 18:06:54 #topic sahara@horizon status (crobertsrh, NikitaKonovalov) 18:07:15 Reviews are still ongoing. 18:07:23 ok, so we've a got some progress 18:07:26 Quite a few things have merged, which is good. 18:07:39 patches are getting review 18:07:42 sometimes 18:08:09 I've also started working on event logs in UI 18:08:42 my patches are still on review 18:08:45 so it will also be on review soon 18:09:16 thanks, Nikita 18:09:17 #topic News / updates 18:09:24 The edit node group/cluster template UI reviews will be back to workflow 0 soon. I'm working on api side of those now. Hopefully, they will be done soon. 18:09:55 I have put up a Shell action spec, egafford will take it. Hoping cdh intel guys will take the Hbase lib one 18:10:00 1) TripleO integration is deprioritized a bit (until TripleO team has time to help address the integration.) 18:10:00 2) Working on Oozie shell job implementation as of this AM. 18:10:05 And, I'm putting up a spec for job-types endpoint today 18:10:19 tmckay: Quite. 18:10:22 security doc is up for review, making good progress but it needs a few more eyes on it. also, just posted a spec for barbican integration, and working on testing secure mode hadoop with freeipa. 18:10:48 I'm working on sahara heat stuff, finished migration to HOT and realized that this shouldn't be the first patch since code looks ugly with the current approach 18:11:01 will be a series of patches before migration 18:11:34 still working on having more generic (== working without vanilla) tempest API tests 18:11:35 i'm working on new integration tests and migration to openjdk 18:12:19 tosky, working without vanilla? 18:12:42 egafford, do tripleo staff changes influence? 18:12:46 tmckay: if the cluster has no vanilla support, tempest API tests currently fail 18:12:54 ah 18:12:57 tmckay: Tests that work with any subset of plugins enabled. 18:13:04 gotcha 18:13:31 alazarev: That's a really good question, and I don't have a full answer yet. 18:14:22 mattf, meeting is here, there is a pause :) 18:15:10 #topic Open discussion 18:15:28 do we have topics to discuss today? 18:15:38 note, after the job-types endpoint I'll be helping crobertsrh with default templates 18:15:45 Any Summit talk proposals? I know that elmiko has a security related one. 18:15:53 job-types should be short, not too hard 18:16:02 i'd still like to ask folks to take a look at the security doc 18:16:08 #link https://review.openstack.org/#/c/155052/ 18:16:10 oh, yes, read the security doc if you haven't already 18:16:18 jinx 18:16:31 and yea, what crobertsrh said 18:16:36 please vote for https://www.openstack.org/vote-vancouver/Presentation/secure-data-processing-building-a-hardened-sahara-installation 18:16:39 =) 18:17:17 I have a few. First, talked to a colleague about the sahara-stable-maint additions, and he said that now that slukjanov's notified stable-maint-core, we should be okay to add the new team members independently (which would be good; slukjanov's alone atm.) 18:17:29 nice 18:17:51 Second, re: integration tests: I don't believe that we currently have automatic retries on individual failed CI tests. Should we? 18:18:00 also vote for https://www.openstack.org/vote-vancouver/presentation/baremetal-and-hybrid-big-data-clusters-on-openstack-using-sahara :) 18:18:10 tmckay, i pulled a tmckay 18:18:32 I need to change my name 18:18:34 i'm curious about egafford's question as well, does anyone know? 18:18:56 At present we do a lot of sahara-ci-recheck to try to get the full set in one go, which could be avoided much more frequently by setting up, say, 2 automatic rechecks of only failing CI tests. It'd free up the cluster a lot. 18:18:56 elmiko, we've talked about it, but I don't believe we've ever implemented 18:19:18 similar case, aignatov brought up retires on provisioning steps at Summit, I believe 18:19:42 ok, i thought there was some sort of built-in retry on the ci gate, is that not true? 18:19:47 Are there any philosophical objections to a small number of automatic retries? There are valid objections, certainly, but if we just do it manually anyway it only slows us down. 18:20:03 so, potentially two places to add retry. With the benefit of retries in provisioning going to the production user, too 18:20:20 no objection from me for adding a retry on fail, but we'd need some sort of limit 18:20:48 it would almost have to be context-sensitive 18:21:02 elmiko: Of course. 3 failures = fail seems pretty standard and reasonable for this sort of thing. 18:21:03 you would want to retry cases where for instance the cluster doesn't go "Active" 18:21:11 for an unspecified reason 18:21:36 but that means log parsing instead of just a simple retry 18:21:43 right but we can't start duplicating the zuul infrastructure. we need to be more dumb about the retries 18:22:03 simple retry because of an exception, "None type object has no attribute blah" would be a waste of time 18:22:11 elmiko: +1 to dumbness (this line is pretty great out of context) 18:22:18 lol 18:22:34 I'm on the fence 18:22:35 I see 4 places where retry cloud be added: 1. nova, 2. heat (they are working on that), 3. sahara (we have several), 4. CI 18:22:41 tmckay: Sure, it would be a slight waste of time, but compared with rerunning the whole suite multiple times? 18:22:56 i'm just saying that if we are going to talk about the conditions under which we make retries, and we start to code rules, we might as well help make the zuul infra better by filling out tickets there 18:23:14 egafford, you're talking about a retry in the integration test itself, in the wrapper, not in the gate? 18:23:24 for production use it would be nice to have retries on 1-3, but I agree that we are making 4 manually 18:23:44 tmckay: The case that brought this up was the stable/juno 1-line manifest addition change, which took 5 rechecks to succeed despite an effective non-change to the CI tests. 18:24:00 alazarev: yea, i think the intent is to take away some of the manual retry on #4 18:24:13 alazarev, tmckay: Right, and we're retrying the entire suite 4 times, rather than tactically retrying only the problem tests. 18:24:43 okay, so the proposal is to inject "test til pass, limit X" loops in the integration tests 18:24:57 if a test passes once, we call it success 18:25:18 so maybe I spin the hdp cluster 3 times, but it passes the 3rd time, so great 18:25:25 PASS 18:25:28 tmckay: On a per-test basis, yes. There absolutely are problems with this approach (frequently covers random failure.) 18:26:10 in that case, we could be a little choosy 18:26:17 tmckay: But, if we just hit the button anyway until retry 4 or 5, we're not doing ourselves any favors by forcing all the tests to rerun, and forcing manual intervention at each step. 18:26:33 I think we can tell when a test fails because a cluster never goes Active or an EDP job never completes 18:26:44 egafford, ack 18:26:52 it will make us think that success rate 1 to 3 is a good enough thing... which is not I believe 18:27:22 egafford, initially I thought we were talking about changes to the scripting around the Saraha ci tests, not the test.py code. 18:27:45 alazarev: good point, we might end up hiding a bug that occurs infrequently 18:27:55 alazarev, but we don't pay attention to that now 18:28:09 we just assume that some odd thing in openstack failed and recheck 18:28:22 tmckay, because failures are too often 18:28:28 well, we should be checking the logs though. you are checking the log before recheck aren't you .... ;) 18:28:42 elmiko, lol :) 18:28:53 alazarev, elmiko: Right, that's the problem. This kind of approach explicitly has problems with random failure, BUT if we don't address it because we can't trust that failure means failure (which we don't at present)... 18:29:17 elmiko, actually, a lot of the time, yes, but if I can't find something explicit very fast I ignore 18:30:00 There are really good arguments in both directions, but it seems worth discussion. I lean on the side of automatic retries, or a very aggressive decision to address the current inconsistency in CI tests. 18:30:49 do we believe that one day openstack will be stable enough to have at least 95% of passed CI suites? 18:30:50 I've brought this up before. I think the only way to answer both questions is to find places for retry in Sahara itself 18:31:09 if not - I'm +1 on auto recheck 18:31:25 alazarev: Fully agreed. 18:31:31 Did a provisioning step timeout? Okay, try again. Still? try again, longer timeout .... 18:31:43 i'm ok with us implementing recheck, but i wonder if we shouldn't also be more active in creating bugs for zuul and the elastic-recheck? 18:31:51 or like aignatov suggested, allow partial cluster without "Error" 18:32:45 tmckay, we need such feature from heat first 18:33:11 tmckay, they are working on it, but I don't sure it well be ready in kilo 18:33:42 alazarev, ack. I'm okay with auto-recheck features in integration tests, but a small number. 2 or 3, no more. 18:33:52 After that a human should take a look 18:34:00 #link http://docs.openstack.org/infra/elastic-recheck/readme.html#adding-bug-signatures 18:34:08 i think we should also me contributing more to ^^ 18:34:15 elmiko, good suggestion 18:34:18 s/me/be 18:34:28 tmckay: +1. More than 2 rechecks (3 total runs) is getting on into negligence. 18:34:48 agreed, yes, I meant 3 total 18:35:31 and we need https://review.openstack.org/#/c/142632/ 18:35:41 put +1 please 18:35:46 tmckay: From what I've seen at other places with very asynchronous, large-scale, difficult tests (like ours) 3 total is pretty standard. 18:36:02 +1 to elmiko's suggestion 18:36:15 alazarev, will do 18:36:27 i just feel like we are reimplementing something that already has a huge effort behind it 18:36:41 granted, it's a pain to keep kicking the recheck machine 18:37:35 elmiko, that's partly why I like making Sahara itself more robust. I've seen weird error occasionally where ssh can 18:37:46 can't read "the banner". Huh? Try again 18:37:53 I should document that the next time it happens 18:37:59 tmckay: +1 18:38:07 some ssh burp shouldn't derail the cluster. 18:38:44 tmckay, I saw this too 18:39:03 are we still in the News/update part of the meeting? I have a question about diskimage-builder (future) version 18:39:15 tosky: open discussion, go for it ! 18:39:28 makes me wonder why I haven't addressed it before. In the middle of chasing something else, I suppose. But next, I vow. 18:40:11 the cut of py2.6 compatibility could impact the usage of newer versions of diskimage-builder; for example, see this new module: 18:40:12 https://github.com/openstack/diskimage-builder/blob/master/elements/package-installs/bin/package-installs-v2 18:40:21 tmckay: probably because it's a really difficult non-deterministic problem ;) 18:40:47 elmiko, that might be it. Or it always happens on Friday ;-) 18:40:49 which uses new subprocess methods not available in py2.6 - which could be a problem if you want to generate centos 6 image on centos 6 with a _newer_ dib 18:41:17 this is _not_ a problem now, as we use an older dib tag (0.1.29) but I wanted to ask what are the plans to handle that 18:41:23 or at least raise the issue 18:41:37 tosky: good issue to raise 18:42:08 we could always add something to check for python version and grab the last compatible version of dib from that? 18:42:15 while throwing a warning up for the user 18:42:31 tosky, all openstack doesn't support py26 now, diskimagebuilder is not an exception, so it was predicted 18:43:19 alazarev: sure it was, but then what to do? Distributions (RDO, but I bet also on Ubuntu) are going to ship newer dib packages, I suspect there will be pressure to use a newer version 18:43:38 or we can push HWX and CDH people to switch to newer distributions :D 18:43:56 tosky: +1 18:44:24 alazarev: I think I already asked: how do you generated the centos/hdp images on the sahara site? From Ubuntu or from CentOS? 18:45:10 tosky, I don't see a way around this except for distros like RDO to package older dibs. Or fork and fix newer versions. 18:45:51 tmckay: the first one is quite difficult (unless you package *also* the old version as a secondary package) 18:46:02 tosky, I don't know, it is better to ask deploy guys 18:46:02 not going to fly on Ubuntu or Debian, I suspect 18:46:05 i'm +1 for asking HWX and CDH to think about switch to newer distros, but that's a long uphill climb 18:46:18 OpenStack probably will move faster than centos. centos is tried and true, openstack is flashy and new every 6 months. Different mandate 18:46:53 well, at least after moving to centos7, we will have few years of peace 18:47:00 hehe 18:47:07 that is true 18:47:27 tosky i generate CentOS images from Ubuntu 18:47:41 sreshetnyak: so does infra (though not with sahara's elements) 18:47:58 sreshetnyak: oh, I see, so it wouldn't be an issue for you when a new dib will be used 18:48:31 clarkb: openstack-infra? What do they use, if not sahara sahara's elements? 18:48:44 tosky: we have our own elements for our nodes 18:49:20 building in the other direction is where things get tricky, using an old build host to build images for new OSes 18:49:33 clarkb: is there some specific reason for not using the ones in sahara-image-elements? I'm curious 18:49:33 but using new OS as build host to build images for old OSes has worked ok so far 18:50:06 tosky: we aren't doing anything with sahara 18:50:11 I am speaking generically to dib as a tool 18:50:16 clarkb: oh, I see, sorry 18:50:50 well, I've hit some issues building centos6 from centos7, that's why I was concerned, but I will recheck too 18:52:55 anything else to discuss? 18:53:06 nothing from me 18:53:07 nothing from me 18:53:18 final question about this: did you plan to bump dib version at some point in the near future? 18:53:28 Nothing else; thanks. 18:55:01 tosky, as I know, there are no such plans 18:55:11 #endmeeting