18:06:04 <alazarev> #startmeeting sahara
18:06:05 <openstack> Meeting started Thu Feb 19 18:06:04 2015 UTC and is due to finish in 60 minutes.  The chair is alazarev. Information about MeetBot at http://wiki.debian.org/MeetBot.
18:06:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
18:06:09 <openstack> The meeting name has been set to 'sahara'
18:06:19 <tmckay> now we
18:06:28 <alazarev> I was searching how to start it :)
18:06:31 <tmckay> we're cooking with gas, I was going to say
18:06:42 <tmckay> yeah, I forget :)
18:06:54 <alazarev> #topic sahara@horizon status (crobertsrh, NikitaKonovalov)
18:07:15 <crobertsrh> Reviews are still ongoing.
18:07:23 <NikitaKonovalov> ok, so we've a got some progress
18:07:26 <crobertsrh> Quite a few things have merged, which is good.
18:07:39 <NikitaKonovalov> patches are getting review
18:07:42 <NikitaKonovalov> sometimes
18:08:09 <NikitaKonovalov> I've also started working on event logs in UI
18:08:42 <alazarev> my patches are still on review
18:08:45 <NikitaKonovalov> so it will also be on review soon
18:09:16 <alazarev> thanks, Nikita
18:09:17 <alazarev> #topic News / updates
18:09:24 <crobertsrh> The edit node group/cluster template UI reviews will be back to workflow 0 soon.  I'm working on api side of those now.  Hopefully, they will be done soon.
18:09:55 <tmckay> I have put up a Shell action spec, egafford will take it.  Hoping cdh intel guys will take the Hbase lib one
18:10:00 <egafford> 1) TripleO integration is deprioritized a bit (until TripleO team has time to help address the integration.)
18:10:00 <egafford> 2) Working on Oozie shell job implementation as of this AM.
18:10:05 <tmckay> And, I'm putting up a spec for job-types endpoint today
18:10:19 <egafford> tmckay: Quite.
18:10:22 <elmiko> security doc is up for review, making good progress but it needs a few more eyes on it. also, just posted a spec for barbican integration, and working on testing secure mode hadoop with freeipa.
18:10:48 <alazarev> I'm working on sahara heat stuff, finished migration to HOT and realized that this shouldn't be the first patch since code looks ugly with the current approach
18:11:01 <alazarev> will be a series of patches before migration
18:11:34 <tosky> still working on having more generic (== working without vanilla) tempest API tests
18:11:35 <sreshetnyak> i'm working on new integration tests and migration to openjdk
18:12:19 <tmckay> tosky, working without vanilla?
18:12:42 <alazarev> egafford, do tripleo staff changes influence?
18:12:46 <tosky> tmckay: if the cluster has no vanilla support, tempest API tests currently fail
18:12:54 <tmckay> ah
18:12:57 <egafford> tmckay: Tests that work with any subset of plugins enabled.
18:13:04 <tmckay> gotcha
18:13:31 <egafford> alazarev: That's a really good question, and I don't have a full answer yet.
18:14:22 <tmckay> mattf, meeting is here, there is a pause :)
18:15:10 <alazarev> #topic Open discussion
18:15:28 <alazarev> do we have topics to discuss today?
18:15:38 <tmckay> note, after the job-types endpoint I'll be helping crobertsrh with default templates
18:15:45 <crobertsrh> Any Summit talk proposals?  I know that elmiko has a security related one.
18:15:53 <tmckay> job-types should be short, not too hard
18:16:02 <elmiko> i'd still like to ask folks to take a look at the security doc
18:16:08 <elmiko> #link https://review.openstack.org/#/c/155052/
18:16:10 <tmckay> oh, yes, read the security doc if you haven't already
18:16:18 <tmckay> jinx
18:16:31 <elmiko> and yea, what crobertsrh said
18:16:36 <elmiko> please vote for https://www.openstack.org/vote-vancouver/Presentation/secure-data-processing-building-a-hardened-sahara-installation
18:16:39 <elmiko> =)
18:17:17 <egafford> I have a few. First, talked to a colleague about the sahara-stable-maint additions, and he said that now that slukjanov's notified stable-maint-core, we should be okay to add the new team members independently (which would be good; slukjanov's alone atm.)
18:17:29 <elmiko> nice
18:17:51 <egafford> Second, re: integration tests: I don't believe that we currently have automatic retries on individual failed CI tests. Should we?
18:18:00 <alazarev> also vote for https://www.openstack.org/vote-vancouver/presentation/baremetal-and-hybrid-big-data-clusters-on-openstack-using-sahara :)
18:18:10 <mattf> tmckay, i pulled a tmckay
18:18:32 <tmckay> I need to change my name
18:18:34 <elmiko> i'm curious about egafford's question as well, does anyone know?
18:18:56 <egafford> At present we do a lot of sahara-ci-recheck to try to get the full set in one go, which could be avoided much more frequently by setting up, say, 2 automatic rechecks of only failing CI tests. It'd free up the cluster a lot.
18:18:56 <tmckay> elmiko, we've talked about it, but I don't believe we've ever implemented
18:19:18 <tmckay> similar case, aignatov brought up retires on provisioning steps at Summit, I believe
18:19:42 <elmiko> ok, i thought there was some sort of built-in retry on the ci gate, is that not true?
18:19:47 <egafford> Are there any philosophical objections to a small number of automatic retries? There are valid objections, certainly, but if we just do it manually anyway it only slows us down.
18:20:03 <tmckay> so, potentially two places to add retry.  With the benefit of retries in provisioning going to the production user, too
18:20:20 <elmiko> no objection from me for adding a retry on fail, but we'd need some sort of limit
18:20:48 <tmckay> it would almost have to be context-sensitive
18:21:02 <egafford> elmiko: Of course. 3 failures = fail seems pretty standard and reasonable for this sort of thing.
18:21:03 <tmckay> you would want to retry cases where for instance the cluster doesn't go "Active"
18:21:11 <tmckay> for an unspecified reason
18:21:36 <tmckay> but that means log parsing instead of just a simple retry
18:21:43 <elmiko> right but we can't start duplicating the zuul infrastructure. we need to be more dumb about the retries
18:22:03 <tmckay> simple retry because of an exception, "None type object has no attribute blah" would be a waste of time
18:22:11 <egafford> elmiko: +1 to dumbness (this line is pretty great out of context)
18:22:18 <elmiko> lol
18:22:34 <tmckay> I'm on the fence
18:22:35 <alazarev> I see 4 places where retry cloud be added: 1. nova, 2. heat (they are working on that), 3. sahara (we have several), 4. CI
18:22:41 <egafford> tmckay: Sure, it would be a slight waste of time, but compared with rerunning the whole suite multiple times?
18:22:56 <elmiko> i'm just saying that if we are going to talk about the conditions under which we make retries, and we start to code rules, we might as well help make the zuul infra better by filling out tickets there
18:23:14 <tmckay> egafford, you're talking about a retry in the integration test itself, in the wrapper, not in the gate?
18:23:24 <alazarev> for production use it would be nice to have retries on 1-3, but I agree that we are making 4 manually
18:23:44 <egafford> tmckay: The case that brought this up was the stable/juno 1-line manifest addition change, which took 5 rechecks to succeed despite an effective non-change to the CI tests.
18:24:00 <elmiko> alazarev: yea, i think the intent is to take away some of the manual retry on #4
18:24:13 <egafford> alazarev, tmckay: Right, and we're retrying the entire suite 4 times, rather than tactically retrying only the problem tests.
18:24:43 <tmckay> okay, so the proposal is to inject "test til pass, limit X" loops in the integration tests
18:24:57 <tmckay> if a test passes once, we call it success
18:25:18 <tmckay> so maybe I spin the hdp cluster 3 times, but it passes the 3rd time, so great
18:25:25 <tmckay> PASS
18:25:28 <egafford> tmckay: On a per-test basis, yes. There absolutely are problems with this approach (frequently covers random failure.)
18:26:10 <tmckay> in that case, we could be a little choosy
18:26:17 <egafford> tmckay: But, if we just hit the button anyway until retry 4 or 5, we're not doing ourselves any favors by forcing all the tests to rerun, and forcing manual intervention at each step.
18:26:33 <tmckay> I think we can tell when a test fails because a cluster never goes Active or an EDP job never completes
18:26:44 <tmckay> egafford, ack
18:26:52 <alazarev> it will make us think that success rate 1 to 3 is a good enough thing... which is not I believe
18:27:22 <tmckay> egafford, initially I thought we were talking about changes to the scripting around the Saraha ci tests, not the test.py code.
18:27:45 <elmiko> alazarev: good point, we might end up hiding a bug that occurs infrequently
18:27:55 <tmckay> alazarev, but we don't pay attention to that now
18:28:09 <tmckay> we just assume that some odd thing in openstack failed and recheck
18:28:22 <alazarev> tmckay, because failures are too often
18:28:28 <elmiko> well, we should be checking the logs though. you are checking the log before recheck aren't you .... ;)
18:28:42 <alazarev> elmiko, lol :)
18:28:53 <egafford> alazarev, elmiko: Right, that's the problem. This kind of approach explicitly has problems with random failure, BUT if we don't address it because we can't trust that failure means failure (which we don't at present)...
18:29:17 <tmckay> elmiko, actually, a lot of the time, yes, but if I can't find something explicit very fast I ignore
18:30:00 <egafford> There are really good arguments in both directions, but it seems worth discussion. I lean on the side of automatic retries, or a very aggressive decision to address the current inconsistency in CI tests.
18:30:49 <alazarev> do we believe that one day openstack will be stable enough to have at least 95% of passed CI suites?
18:30:50 <tmckay> I've brought this up before.  I think the only way to answer both questions is to find places for retry in Sahara itself
18:31:09 <alazarev> if not - I'm +1 on auto recheck
18:31:25 <egafford> alazarev: Fully agreed.
18:31:31 <tmckay> Did a provisioning step timeout?  Okay, try again.  Still?  try again, longer timeout ....
18:31:43 <elmiko> i'm ok with us implementing recheck, but i wonder if we shouldn't also be more active in creating bugs for zuul and the elastic-recheck?
18:31:51 <tmckay> or like aignatov suggested, allow partial cluster without "Error"
18:32:45 <alazarev> tmckay, we need such feature from heat first
18:33:11 <alazarev> tmckay, they are working on it, but I don't sure it well be ready in kilo
18:33:42 <tmckay> alazarev, ack.  I'm okay with auto-recheck features in integration tests, but a small number.  2 or 3, no more.
18:33:52 <tmckay> After that a human should take a look
18:34:00 <elmiko> #link http://docs.openstack.org/infra/elastic-recheck/readme.html#adding-bug-signatures
18:34:08 <elmiko> i think we should also me contributing more to ^^
18:34:15 <tmckay> elmiko, good suggestion
18:34:18 <elmiko> s/me/be
18:34:28 <egafford> tmckay: +1. More than 2 rechecks (3 total runs) is getting on into negligence.
18:34:48 <tmckay> agreed, yes, I meant 3 total
18:35:31 <alazarev> and we need https://review.openstack.org/#/c/142632/
18:35:41 <alazarev> put +1 please
18:35:46 <egafford> tmckay: From what I've seen at other places with very asynchronous, large-scale, difficult tests (like ours) 3 total is pretty standard.
18:36:02 <crobertsrh> +1 to elmiko's suggestion
18:36:15 <tmckay> alazarev, will do
18:36:27 <elmiko> i just feel like we are reimplementing something that already has a huge effort behind it
18:36:41 <elmiko> granted, it's a pain to keep kicking the recheck machine
18:37:35 <tmckay> elmiko, that's partly why I like making Sahara itself more robust.  I've seen weird error occasionally where ssh can
18:37:46 <tmckay> can't read "the banner".  Huh?  Try again
18:37:53 <tmckay> I should document that the next time it happens
18:37:59 <elmiko> tmckay: +1
18:38:07 <tmckay> some ssh burp shouldn't derail the cluster.
18:38:44 <alazarev> tmckay, I saw this too
18:39:03 <tosky> are we still in the News/update part of the meeting? I have a question about diskimage-builder (future) version
18:39:15 <elmiko> tosky: open discussion, go for it !
18:39:28 <tmckay> makes me wonder why I haven't addressed it before.  In the middle of chasing something else, I suppose. But next, I vow.
18:40:11 <tosky> the cut of py2.6 compatibility could impact the usage of newer versions of diskimage-builder; for example, see this new module:
18:40:12 <tosky> https://github.com/openstack/diskimage-builder/blob/master/elements/package-installs/bin/package-installs-v2
18:40:21 <elmiko> tmckay: probably because it's a really difficult non-deterministic problem ;)
18:40:47 <tmckay> elmiko, that might be it.  Or it always happens on Friday ;-)
18:40:49 <tosky> which uses new subprocess methods not available in py2.6 - which could be a problem if you want to generate centos 6 image on centos 6 with a _newer_ dib
18:41:17 <tosky> this is _not_ a problem now, as we use an older dib tag (0.1.29) but I wanted to ask what are the plans to handle that
18:41:23 <tosky> or at least raise the issue
18:41:37 <elmiko> tosky: good issue to raise
18:42:08 <elmiko> we could always add something to check for python version and grab the last compatible version of dib from that?
18:42:15 <elmiko> while throwing a warning up for the user
18:42:31 <alazarev> tosky, all openstack doesn't support py26 now, diskimagebuilder is not an exception, so it was predicted
18:43:19 <tosky> alazarev: sure it was, but then what to do? Distributions (RDO, but I bet also on Ubuntu) are going to ship newer dib packages, I suspect there will be pressure to use a newer version
18:43:38 <tosky> or we can push HWX and CDH people to switch to newer distributions :D
18:43:56 <egafford> tosky: +1
18:44:24 <tosky> alazarev: I think I already asked: how do you generated the centos/hdp images on the sahara site? From Ubuntu or from CentOS?
18:45:10 <tmckay> tosky, I don't see a way around this except for distros like RDO to package older dibs.  Or fork and fix newer versions.
18:45:51 <tosky> tmckay: the first one is quite difficult (unless you package *also* the old version as a secondary package)
18:46:02 <alazarev> tosky, I don't know, it is better to ask deploy guys
18:46:02 <tosky> not going to fly on Ubuntu or Debian, I suspect
18:46:05 <elmiko> i'm +1 for asking HWX and CDH to think about switch to newer distros, but that's a long uphill climb
18:46:18 <tmckay> OpenStack probably will move faster than centos.  centos is tried and true, openstack is flashy and new every 6 months.  Different mandate
18:46:53 <tosky> well, at least after moving to centos7, we will have few years of peace
18:47:00 <elmiko> hehe
18:47:07 <tmckay> that is true
18:47:27 <sreshetnyak> tosky i generate CentOS images from Ubuntu
18:47:41 <clarkb> sreshetnyak: so does infra (though not with sahara's elements)
18:47:58 <tosky> sreshetnyak: oh, I see, so it wouldn't be an issue for you when a new dib will be used
18:48:31 <tosky> clarkb: openstack-infra? What do they use, if not sahara sahara's elements?
18:48:44 <clarkb> tosky: we have our own elements for our nodes
18:49:20 <clarkb> building in the other direction is where things get tricky, using an old build host to build images for new OSes
18:49:33 <tosky> clarkb: is there some specific reason for not using the ones in sahara-image-elements? I'm curious
18:49:33 <clarkb> but using new OS as build host to build images for old OSes has worked ok so far
18:50:06 <clarkb> tosky: we aren't doing anything with sahara
18:50:11 <clarkb> I am speaking generically to dib as a tool
18:50:16 <tosky> clarkb: oh, I see, sorry
18:50:50 <tosky> well, I've hit some issues building centos6 from centos7, that's why I was concerned, but I will recheck too
18:52:55 <alazarev> anything else to discuss?
18:53:06 <tmckay> nothing from me
18:53:07 <crobertsrh> nothing from me
18:53:18 <tosky> final question about this: did you plan to bump dib version at some point in the near future?
18:53:28 <egafford> Nothing else; thanks.
18:55:01 <alazarev> tosky, as I know, there are no such plans
18:55:11 <alazarev> #endmeeting