19:01:38 <mtaylor> #startmeeting 19:01:39 <openstack> Meeting started Tue Jul 3 19:01:38 2012 UTC. The chair is mtaylor. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:43 <clarkb> ohai 19:02:13 <mtaylor> anybody around want to talk about so you think you can dance? 19:02:15 <mtaylor> oh 19:02:17 <mtaylor> I mean 19:02:19 <mtaylor> openstack CI stuff? 19:02:32 <jeblair> o/ 19:03:14 <mtaylor> neat 19:03:32 <mtaylor> so - jeblair I believe wanted to talk about jenkinsy failure and retrigger stuff, yeah 19:03:51 <jeblair> yep. 19:04:11 <jeblair> my main concern is global dependency list and how that relates to getting the pypi mirror stable 19:04:31 <jeblair> i see there's a mailing list thread, which unfortunately has some confusion associated with it. 19:04:46 <jeblair> i certainly haven't seen a viable suggestion other than your initial one. 19:05:09 <jeblair> would it be productive to talk about that here (perhaps summoning some people), or should we leave that on the ML for now? 19:05:13 <mtaylor> no. and I intend, for the record, to ignore all of the irrelvant things that have been said so far 19:05:36 <mtaylor> the ML thread is supposed to be informative, and then to ask an opinion on the name "openstack-requires" 19:05:52 <mtaylor> the one counter suggestion I've heard is "openstack-packaging" - which I don't REALLY like 19:06:21 <jeblair> yeah, i don't see a justification far that. i might say openstack-requirements but it's close enough. 19:06:29 <mtaylor> although I do think we could certainly put in a dpkg-selections file and an rpm list so that devstack could consume the current state of depends 19:06:30 <jeblair> (or openstack-dependencies) 19:06:43 <jeblair> indeed. 19:06:45 <mtaylor> I have to think too much to type dependencies 19:06:50 <jeblair> heh 19:07:08 <clarkb> that is what tab keys are for 19:07:17 <clarkb> or ^N 19:07:43 <jeblair> so do you have an estimate for when we might be fully utilizing that (and can use only our pypi mirror)? 19:08:01 <jeblair> (and are there things other ppl can do to help it along?) 19:08:08 <mtaylor> there's a couple of stages 19:08:31 <mtaylor> I could post the new repo today (and just assume that when markmc gets back from vacation that he'll be unhappy with whatever the name is ;) ) 19:08:44 <mtaylor> but then we have to start actually aligning the projects 19:08:53 <mtaylor> I don't see that happening realistically until F3 19:09:37 <clarkb> and alignment is what will actually make this useful towards stability? 19:10:00 <mtaylor> it will ... because once we're aligned once, then all of the packages will have come from that list 19:10:13 <mtaylor> so future divergence (like the list moving forward but nova not tagging along immediately) 19:10:24 <mtaylor> will still have all of the prior versions in the mirror (since we don't reap) 19:10:34 <mtaylor> ACTUALLY - I'm lying 19:10:41 <jeblair> but in all cases, devstack is going to test with exactly one version of each thing. 19:10:44 <mtaylor> we don't need convergence. we have the complete set of packages _today_ 19:11:08 <mtaylor> all we need is for the repo to exist and the _policy_ to be that all new package versions must hit it first 19:12:05 <jeblair> yeah, we don't actually need changes to each project to get this merged. 19:12:28 <mtaylor> correct 19:12:53 <mtaylor> we just need the repo, and to add its lists to our pypi mirror creation - and then we need to trigger a pypi mirror run on changes from the repo 19:13:45 <jeblair> then perhaps we should go ahead and do that much, because it will make our mirror much more useful. 19:14:03 <jeblair> and then get devstack using the packages, and then get the copy-into-projects thing going. 19:14:39 <jeblair> you think we can get the first step done within a week or two? 19:14:48 <mtaylor> I do think so 19:15:24 <jeblair> okay. so my second item was to explore an alternate plan in case we couldn't do that in a reasonable amount of time... 19:15:45 <mtaylor> I think if we can get vishy and bcwaldon and heckj and notmyname and danwent and devcamcar on board with at least attempting it 19:15:50 <jeblair> (something like build the mirror from the individual projects and use it exclusively except in the case of testing a change to the -requirements) 19:16:26 <jeblair> but perhaps we don't need to talk about the alternate plan if the main one looks viable. 19:16:27 <mtaylor> right. well - also, I should take this moment to point out that we were seeing a MUCH higher failure rate than normal because the mirror script had been silently failing for the last month 19:17:04 <jeblair> indeed, and thanks for fixing that! 19:17:29 <mtaylor> well... remind me next time _not_ to put 2>/dev/null in scripts that get run by cron :) 19:18:08 <jeblair> so point #3 i had was how to be more resilient to gerrit errors 19:18:26 <jeblair> i believe clarkb's exponential backoff script is in place now 19:18:56 <jeblair> and things seem to still work, so that's great. that should help us avoid failing when a simple retry of the git command would succeed. 19:18:59 <clarkb> it is. I have been checking console output for jobs semi randomly to see if any of them have had to fetch more than once, but I haven't seen that happen 19:19:17 <jeblair> it might be useful to have that script log when it has to back off 19:19:36 <jeblair> perhaps it could syslog, and we could check up on it periodically 19:19:40 <jeblair> clarkb: what do you think? 19:19:45 <clarkb> sounds good. I will add that 19:19:48 <jeblair> (and maybe someday we'll have a syslog server) 19:20:08 <jeblair> cool, then we'll be able to track whether the incidences of transient gerrit failures are increasing or decreasing. 19:20:39 <clarkb> you have also increased the http timeout from 5ms to 5s 19:20:42 <jeblair> and of course, after our badgering, spearce found a bug in gerrit 19:20:47 <jeblair> yes, that one 19:21:09 <jeblair> there was a tuning parameter which i would have changed had the default not already been a 5 minute timeout 19:21:09 <mtaylor> I think that'll help 19:21:29 <jeblair> the bug was that it was interpreted as a 5 millisecond timeout, so that was pretty much useless. 19:21:56 <jeblair> it's definitely a parameter that's right in the middle of where we thought the problem might be, so yeah, pretty optimistic. 19:22:33 <mtaylor> also, I've got some apache rewrite rules up for review that I need to test that would allow all of our anon-http fetching to be done by normal git and apache - with packs themselves served out as static files by apache with no cgi anything in the way 19:22:40 <clarkb> you also restarted all the things after the leap second bug which I am sure didn't hurt 19:22:51 <mtaylor> so I'm hoping that helps too 19:23:13 <jeblair> mtaylor: yep. that system is basically idle, plenty of room for apache to do what it does best. 19:23:21 <jeblair> okay so #4 is how to handle retriggers, because no matter how awesome everything else is, something is going to break, or someone is going to impreve the system. 19:23:47 <jeblair> and we need a not-ridiculous way for people to retrigger check and gate jobs. 19:24:42 <jeblair> so we've had two ideas about that 19:24:46 <clarkb> my idea which is a bit of a hack (but less so than pushing new patchsets) is to leave a comment with some string in it that zuul will interpret as meaning retrigger the jobs 19:25:38 <jeblair> and an earlier idea i had was to have a link in jenkins (maybe in the build description) that would retrigger the change in question. 19:25:53 <jeblair> my idea is not easily or elegantly implemented in zuul. 19:25:59 <jeblair> clarkb's idea is. 19:26:41 <jeblair> the only downside i see to clark's is that, by design, anyone would be able to say "reapprove" and launch the approve jobs, even before a change has been approved. but that's really okay since gerrit won't merge them without the approved vote anyway. 19:27:05 <mtaylor> I'd say... 19:27:21 <mtaylor> we don't really need re-approve, since anyone with approval bit in gerrit can already re-approve 19:27:27 <jeblair> also, in magical pony world, i'd really like to have a button in gerrit, and clark's solution is more compatible with that possible future expansion. 19:27:42 <mtaylor> retrigger, on the other hand, meets a current missing need 19:27:49 <jeblair> well, before, anyone could retrigger an approval job 19:28:13 <jeblair> i think probably patchset authors want to be able to reapprove their own patches, since they're watching out for them, without bugging core reviewers 19:28:26 <mtaylor> good point 19:28:29 <mtaylor> ok. I'm fine with it 19:29:00 <jeblair> it's easy to do one, the other, or both with clarkb's change anyway, it's all just configuration. 19:29:06 <mtaylor> agree 19:29:11 * mtaylor is in favor of clark's change 19:29:17 * jeblair agrees 19:29:27 <mtaylor> and a long-term task to add a button to gerrit 19:29:35 <jeblair> so that just leaves 'what should the magic words be?' 19:29:44 <clarkb> https://review.openstack.org/#/c/9195/ adds this functionality to zuul 19:29:58 <jeblair> i'm not sure just 'retrigger' is a good idea, i mean, it might trigger jobs due to casual code reviews. 19:30:04 <mtaylor> I'd say that a comment left that is the text "retrigger" and only that text 19:30:10 <jeblair> ah ok. 19:30:25 <mtaylor> so: ^\s*retrigger\s*$ 19:30:50 <jeblair> and retrigger itself is vague (retrigger what?) 19:31:05 <mtaylor> rebuild? 19:31:06 <jeblair> perhaps it should be recheck/reapprove 19:31:11 <mtaylor> recheck 19:31:12 <clarkb> the verbs I used when testing were reverfiy and recheck 19:31:12 <mtaylor> yeah 19:31:19 <clarkb> *reverify 19:31:20 <jeblair> and we need distinct values for the two kinds of jobs 19:31:36 <clarkb> recheck and reapprove sound good to me 19:31:41 <mtaylor> recheck for pre-approval, reverify for post-approval 19:31:50 <mtaylor> ? 19:32:12 <jeblair> slight preferene for recheck/reverify 19:32:14 <mtaylor> damn naming 19:32:17 <mtaylor> yeah. me too 19:32:21 <jeblair> (since jenkins isn't actually approving) 19:32:21 <clarkb> works for me 19:32:49 <mtaylor> cool. sold 19:33:02 <jeblair> okay, i think that's all the decision making i needed today. :) 19:33:07 <mtaylor> anybody in channel who isn't the three of us have an opinion? you have exactly one minute 19:34:26 <jeblair> (and i even told the ml we'd talk about this at the meeting today) 19:36:51 <mtaylor> cool. ok. done 19:37:02 <mtaylor> #topic bragging 19:37:13 <mtaylor> client libs are auto-uploading to PyPI now 19:37:23 <mtaylor> #topic open discussion 19:37:27 <mtaylor> anything else? 19:37:47 <jeblair> the devstack-gate job output is _much_ cleaner now 19:38:00 <LinuxJedi> oh, I have something 19:38:10 <clarkb> I do too once LinuxJedi is done 19:38:12 <jeblair> jaypipes: any chance you're around? 19:38:15 <LinuxJedi> Gerrit is now using my row color theme patch 19:38:23 <mtaylor> yay! 19:38:24 <LinuxJedi> and that has been pushed for review upstream 19:38:29 <LinuxJedi> along with the other theme patch 19:38:35 <clarkb> (and JavaMelody) 19:38:39 <mtaylor> oh - and the monitoring patch is live - although if you didn't know that already, you probably don't have access to see it 19:38:56 <mtaylor> jaypipes: yeah - how's that tempest stuff coming along? 19:39:12 * mtaylor doesn't konw if that's what jeblair was pinging jaypipes about 19:39:16 <LinuxJedi> if you don't have access to see it, it is the pot of gold at the end of the rainbow you have all been looking for 19:39:45 <mtaylor> SO ... clarkb 19:39:56 <jeblair> yep. we are so ready to run tempest on gates, but i don't think tempest is yet. 19:40:08 <devananda> chiming in randomly here, my openvz kernel scripts can now handle in-place kernel upgrades 19:40:17 <mtaylor> I _think_ there is some way to get melody to splat out its information in a form that collectd or icinga can pick up 19:40:23 <mtaylor> devananda: w00t! 19:40:52 <mtaylor> oh, I spoke with primeministerp earlier today and he's working on getting the hyper-v lab back up and going - so we might have more contrib testing from there 19:41:11 <jeblair> who is primeministerp? 19:41:22 <mtaylor> and Shrews may or may not be getting closer to or futher away from nova openvz support, fwiw 19:41:43 <jeblair> notmyname: are you here or on the road? 19:41:46 <clarkb> mtaylor: it has pdf exports :) its "enterprise" 19:41:48 <mtaylor> jeblair: can't think of real name - boston guy from suse/novell who worked with microsoft on interop 19:41:54 <Shrews> mtaylor: yeah, well, there's been a wrench thrown in that we should discuss 19:42:09 <mtaylor> Shrews: does the wrench involve buying me liquor? 19:42:17 <jeblair> ah, i remember him. 19:42:44 <Shrews> mtaylor: no. devananda gave me some news that the RS patch may be forthcoming soon 19:43:03 <mtaylor> o really? 19:43:15 <mtaylor> great. well, do you feel you've learned things? 19:43:30 <devananda> short version, it may arrive on github thursday, or it may not 19:43:33 <joearnold> jeblair: notmyname is on the road. 19:43:38 <mtaylor> on github? 19:43:43 <Shrews> github? 19:43:44 <mtaylor> why would it arrive on github? 19:43:47 <jeblair> joearnold: thanks. bad day for getting updates from other people. :) 19:44:00 <mtaylor> joearnold: unacceptable! 19:44:04 <joearnold> :) 19:44:05 <devananda> right. i don't know why. 19:44:12 <mtaylor> joearnold: notmyname is always to be available 19:44:24 <clarkb> I wanted to bring up cgroups and ulimits for jenkins slaves 19:44:29 <mtaylor> devananda: well, I suppose it's something :) 19:44:32 <mtaylor> clarkb: yes! 19:44:36 <jeblair> the change to add swift to the devstack gate worked without any particular drama, so it'd be nice to work on a plan to get that merged. 19:44:37 <LinuxJedi> clarkb: excellent! 19:44:45 <mtaylor> #topic cgroups and ulimits 19:44:56 <joearnold> mtaylor: true enough. He's on his way to flagstaff, az 19:45:00 <clarkb> the ulimits modules was merged and is straightforward to use 19:45:30 <clarkb> I think we are fairly safe limiting the jenkins user to some reasonable process limit using that module 19:46:01 <clarkb> two questions though. what is a reasonably safe process limit? and how does the jenkins user login is it through su? 19:46:29 <jeblair> clarkb: via ssh actually 19:46:47 <clarkb> awesome. ssh login has security limits applied by default on ubuntu 19:47:00 <jeblair> jenkins master ssh's into the slave host, runs a java process, and that process runs jobs. 19:47:01 <clarkb> but not for su 19:47:34 <jeblair> on the devstack nodes, that _job_ would su to another user (stack) who might also su to root to run devstack.. 19:47:50 <jeblair> but since that happens on single use slaves with job timeouts, it's not such a priority. 19:48:02 <clarkb> so other than determining what a sane number for a process limit is the ulimit stuff is not very scary 19:48:19 <clarkb> cgroups on the other hand have the potential to be great fun 19:48:29 <LinuxJedi> clarkb: 640kbytes should be enough for anyone! 19:48:49 <jeblair> so we should probably monitor process count during, say, a nova unit test run. 19:49:02 <clarkb> the current cgroup change https://review.openstack.org/#/c/9074/ adds support for memory limits for the jenkins user on jenkins slaves but does not apply them in site.pp 19:49:20 <clarkb> jeblair: good idea 19:49:58 <jeblair> clarkb: how do you think we should apply the cgroups change? 19:50:07 <jeblair> carefully or recklessly? :) 19:50:12 <clarkb> the cgroup configuration sets a soft memory limit of 512MB of memory for the jenkins user. This comes into play if there is any memory contention on the box 19:50:36 <clarkb> so jenkins would be limited to 512MB if something else was making the machine unhappy. 19:50:50 <clarkb> it also applies a hard limit of 75% of the physical memory on the machine 19:51:22 <clarkb> the hard limit is more dangerous, because by default OOM killer will be invoked to clean up jenkins' processes if it doesn't free memory when asked nicely 19:52:02 <clarkb> we can disable OOM killer which will cause memory overruns to force processes to sleep when they need more memory 19:52:41 <clarkb> or we can completely redo the numbers. I think not setting a hard limit and only setting a soft limit to 75% of physical memory would be safer 19:53:11 <clarkb> jeblair: I was thinking carefully would be best :) 19:53:32 <jeblair> so what happens if the soft limit is reached? 19:53:32 <clarkb> maybe add a node definition for a specific jenkins_slave (more specific than the current glob) and see how that host does 19:53:55 <jeblair> clarkb: that's a good idea. then we can easily disable that node if it causes problems. 19:53:59 <clarkb> jeblair: soft limit only applies if there is memory contention on the host. In that case it acts like a hard limit 19:55:24 <LinuxJedi> clarkb: contention including swap? 19:55:34 <clarkb> LinuxJedi: I think so 19:55:59 <LinuxJedi> I know HP Cloud only applies to devstack but we give those like 100GB of swap due to the way the disks are configured 19:56:27 <clarkb> http://www.mjmwired.net/kernel/Documentation/cgroups/memory.txt kernel documentation doesn't quite spell out all of the details 19:56:28 * LinuxJedi really doesn't want to be using 100GB of swap on anything ;) 19:57:00 <clarkb> in that case we can set hard limits that are larger than 75% of physical memory 19:57:08 <clarkb> maybe physical memory * 2 19:57:23 <LinuxJedi> jeblair: what do you think? 19:57:25 <jeblair> well, i don't want to be swapping at all really. :) 19:58:09 <jeblair> perhaps a hard limit of 90%? 19:58:15 <clarkb> jeblair: ok 19:58:22 <LinuxJedi> sounds good to me 19:58:23 <jeblair> at 5G, that leaves 400M for the rest of the system, which seems reasonable. 19:58:27 <jeblair> 4G, that is. 19:58:33 <clarkb> I will update the change after lunch with what that looks like 19:58:34 <LinuxJedi> we can always tweak it if it causes pain 19:58:43 <LinuxJedi> but I feel safe with that 19:59:14 <mtaylor> ++ 19:59:17 <jeblair> ok. and let's do clark's idea of applying it to just one jenkins slave 19:59:30 <clarkb> sounds good 19:59:42 <jeblair> 1 sec and i'll pick one. 20:00:18 * LinuxJedi watches jeblair use the scientific method of closing eyes and pointing to a random machine on the screen 20:00:52 <jeblair> precise8 20:01:12 <jeblair> LinuxJedi: close -- gate-nova-python27 runs there a lot. :) 20:01:15 <ttx> hrm hrm. 20:01:24 <devcamcar> o/ 20:01:34 <ttx> jeblair: time to call that meeting to an end :) 20:01:42 <jeblair> mtaylor: time to call that meeting to an end :) 20:01:46 <mtaylor> kk 20:01:50 <mtaylor> #stopmeeting 20:01:51 <jbryce> mtaylor: ! 20:01:52 <mtaylor> thanks guys! 20:01:56 <mtaylor> #endmeeting