19:01:38 <mtaylor> #startmeeting
19:01:39 <openstack> Meeting started Tue Jul  3 19:01:38 2012 UTC.  The chair is mtaylor. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:43 <clarkb> ohai
19:02:13 <mtaylor> anybody around want to talk about so you think you can dance?
19:02:15 <mtaylor> oh
19:02:17 <mtaylor> I mean
19:02:19 <mtaylor> openstack CI stuff?
19:02:32 <jeblair> o/
19:03:14 <mtaylor> neat
19:03:32 <mtaylor> so - jeblair I believe wanted to talk about jenkinsy failure and retrigger stuff, yeah
19:03:51 <jeblair> yep.
19:04:11 <jeblair> my main concern is global dependency list and how that relates to getting the pypi mirror stable
19:04:31 <jeblair> i see there's a mailing list thread, which unfortunately has some confusion associated with it.
19:04:46 <jeblair> i certainly haven't seen a viable suggestion other than your initial one.
19:05:09 <jeblair> would it be productive to talk about that here (perhaps summoning some people), or should we leave that on the ML for now?
19:05:13 <mtaylor> no. and I intend, for the record, to ignore all of the irrelvant things that have been said so far
19:05:36 <mtaylor> the ML thread is supposed to be informative, and then to ask an opinion on the name "openstack-requires"
19:05:52 <mtaylor> the one counter suggestion I've heard is "openstack-packaging" - which I don't REALLY like
19:06:21 <jeblair> yeah, i don't see a justification far that.  i might say openstack-requirements but it's close enough.
19:06:29 <mtaylor> although I do think we could certainly put in a dpkg-selections file and an rpm list so that devstack could consume the current state of depends
19:06:30 <jeblair> (or openstack-dependencies)
19:06:43 <jeblair> indeed.
19:06:45 <mtaylor> I have to think too much to type dependencies
19:06:50 <jeblair> heh
19:07:08 <clarkb> that is what tab keys are for
19:07:17 <clarkb> or ^N
19:07:43 <jeblair> so do you have an estimate for when we might be fully utilizing that (and can use only our pypi mirror)?
19:08:01 <jeblair> (and are there things other ppl can do to help it along?)
19:08:08 <mtaylor> there's a couple of stages
19:08:31 <mtaylor> I could post the new repo today (and just assume that when markmc gets back from vacation that he'll be unhappy with whatever the name is ;) )
19:08:44 <mtaylor> but then we have to start actually aligning the projects
19:08:53 <mtaylor> I don't see that happening realistically until F3
19:09:37 <clarkb> and alignment is what will actually make this useful towards stability?
19:10:00 <mtaylor> it will ... because once we're aligned once, then all of the packages will have come from that list
19:10:13 <mtaylor> so future divergence (like the list moving forward but nova not tagging along immediately)
19:10:24 <mtaylor> will still have all of the prior versions in the mirror (since we don't reap)
19:10:34 <mtaylor> ACTUALLY - I'm lying
19:10:41 <jeblair> but in all cases, devstack is going to test with exactly one version of each thing.
19:10:44 <mtaylor> we don't need convergence. we have the complete set of packages _today_
19:11:08 <mtaylor> all we need is for the repo to exist and the _policy_ to be that all new package versions must hit it first
19:12:05 <jeblair> yeah, we don't actually need changes to each project to get this merged.
19:12:28 <mtaylor> correct
19:12:53 <mtaylor> we just need the repo, and to add its lists to our pypi mirror creation - and then we need to trigger a pypi mirror run on changes from the repo
19:13:45 <jeblair> then perhaps we should go ahead and do that much, because it will make our mirror much more useful.
19:14:03 <jeblair> and then get devstack using the packages, and then get the copy-into-projects thing going.
19:14:39 <jeblair> you think we can get the first step done within a week or two?
19:14:48 <mtaylor> I do think so
19:15:24 <jeblair> okay.  so my second item was to explore an alternate plan in case we couldn't do that in a reasonable amount of time...
19:15:45 <mtaylor> I think if we can get vishy and bcwaldon and heckj and notmyname and danwent and devcamcar on board with at least attempting it
19:15:50 <jeblair> (something like build the mirror from the individual projects and use it exclusively except in the case of testing a change to the -requirements)
19:16:26 <jeblair> but perhaps we don't need to talk about the alternate plan if the main one looks viable.
19:16:27 <mtaylor> right. well - also, I should take this moment to point out that we were seeing a MUCH higher failure rate than normal because the mirror script had been silently failing for the last month
19:17:04 <jeblair> indeed, and thanks for fixing that!
19:17:29 <mtaylor> well... remind me next time _not_ to put 2>/dev/null in scripts that get run by cron :)
19:18:08 <jeblair> so point #3 i had was how to be more resilient to gerrit errors
19:18:26 <jeblair> i believe clarkb's exponential backoff script is in place now
19:18:56 <jeblair> and things seem to still work, so that's great.  that should help us avoid failing when a simple retry of the git command would succeed.
19:18:59 <clarkb> it is. I have been checking console output for jobs semi randomly to see if any of them have had to fetch more than once, but I haven't seen that happen
19:19:17 <jeblair> it might be useful to have that script log when it has to back off
19:19:36 <jeblair> perhaps it could syslog, and we could check up on it periodically
19:19:40 <jeblair> clarkb: what do you think?
19:19:45 <clarkb> sounds good. I will add that
19:19:48 <jeblair> (and maybe someday we'll have a syslog server)
19:20:08 <jeblair> cool, then we'll be able to track whether the incidences of transient gerrit failures are increasing or decreasing.
19:20:39 <clarkb> you have also increased the http timeout from 5ms to 5s
19:20:42 <jeblair> and of course, after our badgering, spearce found a bug in gerrit
19:20:47 <jeblair> yes, that one
19:21:09 <jeblair> there was a tuning parameter which i would have changed had the default not already been a 5 minute timeout
19:21:09 <mtaylor> I think that'll help
19:21:29 <jeblair> the bug was that it was interpreted as a 5 millisecond timeout, so that was pretty much useless.
19:21:56 <jeblair> it's definitely a parameter that's right in the middle of where we thought the problem might be, so yeah, pretty optimistic.
19:22:33 <mtaylor> also, I've got some apache rewrite rules up for review that I need to test that would allow all of our anon-http fetching to be done by normal git and apache - with packs themselves served out as static files by apache with no cgi anything in the way
19:22:40 <clarkb> you also restarted all the things after the leap second bug which I am sure didn't hurt
19:22:51 <mtaylor> so I'm hoping that helps too
19:23:13 <jeblair> mtaylor: yep.  that system is basically idle, plenty of room for apache to do what it does best.
19:23:21 <jeblair> okay so #4 is how to handle retriggers, because no matter how awesome everything else is, something is going to break, or someone is going to impreve the system.
19:23:47 <jeblair> and we need a not-ridiculous way for people to retrigger check and gate jobs.
19:24:42 <jeblair> so we've had two ideas about that
19:24:46 <clarkb> my idea which is a bit of a hack (but less so than pushing new patchsets) is to leave a comment with some string in it that zuul will interpret as meaning retrigger the jobs
19:25:38 <jeblair> and an earlier idea i had was to have a link in jenkins (maybe in the build description) that would retrigger the change in question.
19:25:53 <jeblair> my idea is not easily or elegantly implemented in zuul.
19:25:59 <jeblair> clarkb's idea is.
19:26:41 <jeblair> the only downside i see to clark's is that, by design, anyone would be able to say "reapprove" and launch the approve jobs, even before a change has been approved.  but that's really okay since gerrit won't merge them without the approved vote anyway.
19:27:05 <mtaylor> I'd say...
19:27:21 <mtaylor> we don't really need re-approve, since anyone with approval bit in gerrit can already re-approve
19:27:27 <jeblair> also, in magical pony world, i'd really like to have a button in gerrit, and clark's solution is more compatible with that possible future expansion.
19:27:42 <mtaylor> retrigger, on the other hand, meets a current missing need
19:27:49 <jeblair> well, before, anyone could retrigger an approval job
19:28:13 <jeblair> i think probably patchset authors want to be able to reapprove their own patches, since they're watching out for them, without bugging core reviewers
19:28:26 <mtaylor> good point
19:28:29 <mtaylor> ok. I'm fine with it
19:29:00 <jeblair> it's easy to do one, the other, or both with clarkb's change anyway, it's all just configuration.
19:29:06 <mtaylor> agree
19:29:11 * mtaylor is in favor of clark's change
19:29:17 * jeblair agrees
19:29:27 <mtaylor> and a long-term task to add a button to gerrit
19:29:35 <jeblair> so that just leaves 'what should the magic words be?'
19:29:44 <clarkb> https://review.openstack.org/#/c/9195/ adds this functionality to zuul
19:29:58 <jeblair> i'm not sure just 'retrigger' is a good idea, i mean, it might trigger jobs due to casual code reviews.
19:30:04 <mtaylor> I'd say that a comment left that is the text "retrigger" and only that text
19:30:10 <jeblair> ah ok.
19:30:25 <mtaylor> so: ^\s*retrigger\s*$
19:30:50 <jeblair> and retrigger itself is vague (retrigger what?)
19:31:05 <mtaylor> rebuild?
19:31:06 <jeblair> perhaps it should be recheck/reapprove
19:31:11 <mtaylor> recheck
19:31:12 <clarkb> the verbs I used when testing were reverfiy and recheck
19:31:12 <mtaylor> yeah
19:31:19 <clarkb> *reverify
19:31:20 <jeblair> and we need distinct values for the two kinds of jobs
19:31:36 <clarkb> recheck and reapprove sound good to me
19:31:41 <mtaylor> recheck for pre-approval, reverify for post-approval
19:31:50 <mtaylor> ?
19:32:12 <jeblair> slight preferene for recheck/reverify
19:32:14 <mtaylor> damn naming
19:32:17 <mtaylor> yeah. me too
19:32:21 <jeblair> (since jenkins isn't actually approving)
19:32:21 <clarkb> works for me
19:32:49 <mtaylor> cool. sold
19:33:02 <jeblair> okay, i think that's all the decision making i needed today.  :)
19:33:07 <mtaylor> anybody in channel who isn't the three of us have an opinion? you have exactly one minute
19:34:26 <jeblair> (and i even told the ml we'd talk about this at the meeting today)
19:36:51 <mtaylor> cool. ok. done
19:37:02 <mtaylor> #topic bragging
19:37:13 <mtaylor> client libs are auto-uploading to PyPI now
19:37:23 <mtaylor> #topic open discussion
19:37:27 <mtaylor> anything else?
19:37:47 <jeblair> the devstack-gate job output is _much_ cleaner now
19:38:00 <LinuxJedi> oh, I have something
19:38:10 <clarkb> I do too once LinuxJedi is done
19:38:12 <jeblair> jaypipes: any chance you're around?
19:38:15 <LinuxJedi> Gerrit is now using my row color theme patch
19:38:23 <mtaylor> yay!
19:38:24 <LinuxJedi> and that has been pushed for review upstream
19:38:29 <LinuxJedi> along with the other theme patch
19:38:35 <clarkb> (and JavaMelody)
19:38:39 <mtaylor> oh - and the monitoring patch is live - although if you didn't know that already, you probably don't have access to see it
19:38:56 <mtaylor> jaypipes: yeah - how's that tempest stuff coming along?
19:39:12 * mtaylor doesn't konw if that's what jeblair was pinging jaypipes about
19:39:16 <LinuxJedi> if you don't have access to see it, it is the pot of gold at the end of the rainbow you have all been looking for
19:39:45 <mtaylor> SO ... clarkb
19:39:56 <jeblair> yep.  we are so ready to run tempest on gates, but i don't think tempest is yet.
19:40:08 <devananda> chiming in randomly here, my openvz kernel scripts can now handle in-place kernel upgrades
19:40:17 <mtaylor> I _think_ there is some way to get melody to splat out its information in a form that collectd or icinga can pick up
19:40:23 <mtaylor> devananda: w00t!
19:40:52 <mtaylor> oh, I spoke with primeministerp earlier today and he's working on getting the hyper-v lab back up and going - so we might have more contrib testing from there
19:41:11 <jeblair> who is primeministerp?
19:41:22 <mtaylor> and Shrews may or may not be getting closer to or futher away from nova openvz support, fwiw
19:41:43 <jeblair> notmyname: are you here or on the road?
19:41:46 <clarkb> mtaylor: it has pdf exports :) its "enterprise"
19:41:48 <mtaylor> jeblair: can't think of real name - boston guy from suse/novell who worked with microsoft on interop
19:41:54 <Shrews> mtaylor: yeah, well, there's been a wrench thrown in that we should discuss
19:42:09 <mtaylor> Shrews: does the wrench involve buying me liquor?
19:42:17 <jeblair> ah, i remember him.
19:42:44 <Shrews> mtaylor: no. devananda gave me some news that the RS patch may be forthcoming soon
19:43:03 <mtaylor> o really?
19:43:15 <mtaylor> great. well, do you feel you've learned things?
19:43:30 <devananda> short version, it may arrive on github thursday, or it may not
19:43:33 <joearnold> jeblair: notmyname is on the road.
19:43:38 <mtaylor> on github?
19:43:43 <Shrews> github?
19:43:44 <mtaylor> why would it arrive on github?
19:43:47 <jeblair> joearnold: thanks.  bad day for getting updates from other people.  :)
19:44:00 <mtaylor> joearnold: unacceptable!
19:44:04 <joearnold> :)
19:44:05 <devananda> right. i don't know why.
19:44:12 <mtaylor> joearnold: notmyname is always to be available
19:44:24 <clarkb> I wanted to bring up cgroups and ulimits for jenkins slaves
19:44:29 <mtaylor> devananda: well, I suppose it's something :)
19:44:32 <mtaylor> clarkb: yes!
19:44:36 <jeblair> the change to add swift to the devstack gate worked without any particular drama, so it'd be nice to work on a plan to get that merged.
19:44:37 <LinuxJedi> clarkb: excellent!
19:44:45 <mtaylor> #topic cgroups and ulimits
19:44:56 <joearnold> mtaylor: true enough. He's on his way to flagstaff, az
19:45:00 <clarkb> the ulimits modules was merged and is straightforward to use
19:45:30 <clarkb> I think we are fairly safe limiting the jenkins user to some reasonable process limit using that module
19:46:01 <clarkb> two questions though. what is a reasonably safe process limit? and how does the jenkins user login is it through su?
19:46:29 <jeblair> clarkb: via ssh actually
19:46:47 <clarkb> awesome. ssh login has security limits applied by default on ubuntu
19:47:00 <jeblair> jenkins master ssh's into the slave host, runs a java process, and that process runs jobs.
19:47:01 <clarkb> but not for su
19:47:34 <jeblair> on the devstack nodes, that _job_ would su to another user (stack) who might also su to root to run devstack..
19:47:50 <jeblair> but since that happens on single use slaves with job timeouts, it's not such a priority.
19:48:02 <clarkb> so other than determining what a sane number for a process limit is the ulimit stuff is not very scary
19:48:19 <clarkb> cgroups on the other hand have the potential to be great fun
19:48:29 <LinuxJedi> clarkb: 640kbytes should be enough for anyone!
19:48:49 <jeblair> so we should probably monitor process count during, say, a nova unit test run.
19:49:02 <clarkb> the current cgroup change https://review.openstack.org/#/c/9074/ adds support for memory limits for the jenkins user on jenkins slaves but does not apply them in site.pp
19:49:20 <clarkb> jeblair: good idea
19:49:58 <jeblair> clarkb: how do you think we should apply the cgroups change?
19:50:07 <jeblair> carefully or recklessly?  :)
19:50:12 <clarkb> the cgroup configuration sets a soft memory limit of 512MB of memory for the jenkins user. This comes into play if there is any memory contention on the box
19:50:36 <clarkb> so jenkins would be limited to 512MB if something else was making the machine unhappy.
19:50:50 <clarkb> it also applies a hard limit of 75% of the physical memory on the machine
19:51:22 <clarkb> the hard limit is more dangerous, because by default OOM killer will be invoked to clean up jenkins' processes if it doesn't free memory when asked nicely
19:52:02 <clarkb> we can disable OOM killer which will cause memory overruns to force processes to sleep when they need more memory
19:52:41 <clarkb> or we can completely redo the numbers. I think not setting a hard limit and only setting a soft limit to 75% of physical memory would be safer
19:53:11 <clarkb> jeblair: I was thinking carefully would be best :)
19:53:32 <jeblair> so what happens if the soft limit is reached?
19:53:32 <clarkb> maybe add a node definition for a specific jenkins_slave (more specific than the current glob) and see how that host does
19:53:55 <jeblair> clarkb: that's a good idea.  then we can easily disable that node if it causes problems.
19:53:59 <clarkb> jeblair: soft limit only applies if there is memory contention on the host. In that case it acts like a hard limit
19:55:24 <LinuxJedi> clarkb: contention including swap?
19:55:34 <clarkb> LinuxJedi: I think so
19:55:59 <LinuxJedi> I know HP Cloud only applies to devstack but we give those like 100GB of swap due to the way the disks are configured
19:56:27 <clarkb> http://www.mjmwired.net/kernel/Documentation/cgroups/memory.txt kernel documentation doesn't quite spell out all of the details
19:56:28 * LinuxJedi really doesn't want to be using 100GB of swap on anything ;)
19:57:00 <clarkb> in that case we can set hard limits that are larger than 75% of physical memory
19:57:08 <clarkb> maybe physical memory * 2
19:57:23 <LinuxJedi> jeblair: what do you think?
19:57:25 <jeblair> well, i don't want to be swapping at all really.  :)
19:58:09 <jeblair> perhaps a hard limit of 90%?
19:58:15 <clarkb> jeblair: ok
19:58:22 <LinuxJedi> sounds good to me
19:58:23 <jeblair> at 5G, that leaves 400M for the rest of the system, which seems reasonable.
19:58:27 <jeblair> 4G, that is.
19:58:33 <clarkb> I will update the change after lunch with what that looks like
19:58:34 <LinuxJedi> we can always tweak it if it causes pain
19:58:43 <LinuxJedi> but I feel safe with that
19:59:14 <mtaylor> ++
19:59:17 <jeblair> ok.  and let's do clark's idea of applying it to just one jenkins slave
19:59:30 <clarkb> sounds good
19:59:42 <jeblair> 1 sec and i'll pick one.
20:00:18 * LinuxJedi watches jeblair use the scientific method of closing eyes and pointing to a random machine on the screen
20:00:52 <jeblair> precise8
20:01:12 <jeblair> LinuxJedi: close -- gate-nova-python27 runs there a lot.  :)
20:01:15 <ttx> hrm hrm.
20:01:24 <devcamcar> o/
20:01:34 <ttx> jeblair: time to call that meeting to an end :)
20:01:42 <jeblair> mtaylor: time to call that meeting to an end :)
20:01:46 <mtaylor> kk
20:01:50 <mtaylor> #stopmeeting
20:01:51 <jbryce> mtaylor: !
20:01:52 <mtaylor> thanks guys!
20:01:56 <mtaylor> #endmeeting