19:01:38 #startmeeting 19:01:39 Meeting started Tue Jul 3 19:01:38 2012 UTC. The chair is mtaylor. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:40 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:43 ohai 19:02:13 anybody around want to talk about so you think you can dance? 19:02:15 oh 19:02:17 I mean 19:02:19 openstack CI stuff? 19:02:32 o/ 19:03:14 neat 19:03:32 so - jeblair I believe wanted to talk about jenkinsy failure and retrigger stuff, yeah 19:03:51 yep. 19:04:11 my main concern is global dependency list and how that relates to getting the pypi mirror stable 19:04:31 i see there's a mailing list thread, which unfortunately has some confusion associated with it. 19:04:46 i certainly haven't seen a viable suggestion other than your initial one. 19:05:09 would it be productive to talk about that here (perhaps summoning some people), or should we leave that on the ML for now? 19:05:13 no. and I intend, for the record, to ignore all of the irrelvant things that have been said so far 19:05:36 the ML thread is supposed to be informative, and then to ask an opinion on the name "openstack-requires" 19:05:52 the one counter suggestion I've heard is "openstack-packaging" - which I don't REALLY like 19:06:21 yeah, i don't see a justification far that. i might say openstack-requirements but it's close enough. 19:06:29 although I do think we could certainly put in a dpkg-selections file and an rpm list so that devstack could consume the current state of depends 19:06:30 (or openstack-dependencies) 19:06:43 indeed. 19:06:45 I have to think too much to type dependencies 19:06:50 heh 19:07:08 that is what tab keys are for 19:07:17 or ^N 19:07:43 so do you have an estimate for when we might be fully utilizing that (and can use only our pypi mirror)? 19:08:01 (and are there things other ppl can do to help it along?) 19:08:08 there's a couple of stages 19:08:31 I could post the new repo today (and just assume that when markmc gets back from vacation that he'll be unhappy with whatever the name is ;) ) 19:08:44 but then we have to start actually aligning the projects 19:08:53 I don't see that happening realistically until F3 19:09:37 and alignment is what will actually make this useful towards stability? 19:10:00 it will ... because once we're aligned once, then all of the packages will have come from that list 19:10:13 so future divergence (like the list moving forward but nova not tagging along immediately) 19:10:24 will still have all of the prior versions in the mirror (since we don't reap) 19:10:34 ACTUALLY - I'm lying 19:10:41 but in all cases, devstack is going to test with exactly one version of each thing. 19:10:44 we don't need convergence. we have the complete set of packages _today_ 19:11:08 all we need is for the repo to exist and the _policy_ to be that all new package versions must hit it first 19:12:05 yeah, we don't actually need changes to each project to get this merged. 19:12:28 correct 19:12:53 we just need the repo, and to add its lists to our pypi mirror creation - and then we need to trigger a pypi mirror run on changes from the repo 19:13:45 then perhaps we should go ahead and do that much, because it will make our mirror much more useful. 19:14:03 and then get devstack using the packages, and then get the copy-into-projects thing going. 19:14:39 you think we can get the first step done within a week or two? 19:14:48 I do think so 19:15:24 okay. so my second item was to explore an alternate plan in case we couldn't do that in a reasonable amount of time... 19:15:45 I think if we can get vishy and bcwaldon and heckj and notmyname and danwent and devcamcar on board with at least attempting it 19:15:50 (something like build the mirror from the individual projects and use it exclusively except in the case of testing a change to the -requirements) 19:16:26 but perhaps we don't need to talk about the alternate plan if the main one looks viable. 19:16:27 right. well - also, I should take this moment to point out that we were seeing a MUCH higher failure rate than normal because the mirror script had been silently failing for the last month 19:17:04 indeed, and thanks for fixing that! 19:17:29 well... remind me next time _not_ to put 2>/dev/null in scripts that get run by cron :) 19:18:08 so point #3 i had was how to be more resilient to gerrit errors 19:18:26 i believe clarkb's exponential backoff script is in place now 19:18:56 and things seem to still work, so that's great. that should help us avoid failing when a simple retry of the git command would succeed. 19:18:59 it is. I have been checking console output for jobs semi randomly to see if any of them have had to fetch more than once, but I haven't seen that happen 19:19:17 it might be useful to have that script log when it has to back off 19:19:36 perhaps it could syslog, and we could check up on it periodically 19:19:40 clarkb: what do you think? 19:19:45 sounds good. I will add that 19:19:48 (and maybe someday we'll have a syslog server) 19:20:08 cool, then we'll be able to track whether the incidences of transient gerrit failures are increasing or decreasing. 19:20:39 you have also increased the http timeout from 5ms to 5s 19:20:42 and of course, after our badgering, spearce found a bug in gerrit 19:20:47 yes, that one 19:21:09 there was a tuning parameter which i would have changed had the default not already been a 5 minute timeout 19:21:09 I think that'll help 19:21:29 the bug was that it was interpreted as a 5 millisecond timeout, so that was pretty much useless. 19:21:56 it's definitely a parameter that's right in the middle of where we thought the problem might be, so yeah, pretty optimistic. 19:22:33 also, I've got some apache rewrite rules up for review that I need to test that would allow all of our anon-http fetching to be done by normal git and apache - with packs themselves served out as static files by apache with no cgi anything in the way 19:22:40 you also restarted all the things after the leap second bug which I am sure didn't hurt 19:22:51 so I'm hoping that helps too 19:23:13 mtaylor: yep. that system is basically idle, plenty of room for apache to do what it does best. 19:23:21 okay so #4 is how to handle retriggers, because no matter how awesome everything else is, something is going to break, or someone is going to impreve the system. 19:23:47 and we need a not-ridiculous way for people to retrigger check and gate jobs. 19:24:42 so we've had two ideas about that 19:24:46 my idea which is a bit of a hack (but less so than pushing new patchsets) is to leave a comment with some string in it that zuul will interpret as meaning retrigger the jobs 19:25:38 and an earlier idea i had was to have a link in jenkins (maybe in the build description) that would retrigger the change in question. 19:25:53 my idea is not easily or elegantly implemented in zuul. 19:25:59 clarkb's idea is. 19:26:41 the only downside i see to clark's is that, by design, anyone would be able to say "reapprove" and launch the approve jobs, even before a change has been approved. but that's really okay since gerrit won't merge them without the approved vote anyway. 19:27:05 I'd say... 19:27:21 we don't really need re-approve, since anyone with approval bit in gerrit can already re-approve 19:27:27 also, in magical pony world, i'd really like to have a button in gerrit, and clark's solution is more compatible with that possible future expansion. 19:27:42 retrigger, on the other hand, meets a current missing need 19:27:49 well, before, anyone could retrigger an approval job 19:28:13 i think probably patchset authors want to be able to reapprove their own patches, since they're watching out for them, without bugging core reviewers 19:28:26 good point 19:28:29 ok. I'm fine with it 19:29:00 it's easy to do one, the other, or both with clarkb's change anyway, it's all just configuration. 19:29:06 agree 19:29:11 * mtaylor is in favor of clark's change 19:29:17 * jeblair agrees 19:29:27 and a long-term task to add a button to gerrit 19:29:35 so that just leaves 'what should the magic words be?' 19:29:44 https://review.openstack.org/#/c/9195/ adds this functionality to zuul 19:29:58 i'm not sure just 'retrigger' is a good idea, i mean, it might trigger jobs due to casual code reviews. 19:30:04 I'd say that a comment left that is the text "retrigger" and only that text 19:30:10 ah ok. 19:30:25 so: ^\s*retrigger\s*$ 19:30:50 and retrigger itself is vague (retrigger what?) 19:31:05 rebuild? 19:31:06 perhaps it should be recheck/reapprove 19:31:11 recheck 19:31:12 the verbs I used when testing were reverfiy and recheck 19:31:12 yeah 19:31:19 *reverify 19:31:20 and we need distinct values for the two kinds of jobs 19:31:36 recheck and reapprove sound good to me 19:31:41 recheck for pre-approval, reverify for post-approval 19:31:50 ? 19:32:12 slight preferene for recheck/reverify 19:32:14 damn naming 19:32:17 yeah. me too 19:32:21 (since jenkins isn't actually approving) 19:32:21 works for me 19:32:49 cool. sold 19:33:02 okay, i think that's all the decision making i needed today. :) 19:33:07 anybody in channel who isn't the three of us have an opinion? you have exactly one minute 19:34:26 (and i even told the ml we'd talk about this at the meeting today) 19:36:51 cool. ok. done 19:37:02 #topic bragging 19:37:13 client libs are auto-uploading to PyPI now 19:37:23 #topic open discussion 19:37:27 anything else? 19:37:47 the devstack-gate job output is _much_ cleaner now 19:38:00 oh, I have something 19:38:10 I do too once LinuxJedi is done 19:38:12 jaypipes: any chance you're around? 19:38:15 Gerrit is now using my row color theme patch 19:38:23 yay! 19:38:24 and that has been pushed for review upstream 19:38:29 along with the other theme patch 19:38:35 (and JavaMelody) 19:38:39 oh - and the monitoring patch is live - although if you didn't know that already, you probably don't have access to see it 19:38:56 jaypipes: yeah - how's that tempest stuff coming along? 19:39:12 * mtaylor doesn't konw if that's what jeblair was pinging jaypipes about 19:39:16 if you don't have access to see it, it is the pot of gold at the end of the rainbow you have all been looking for 19:39:45 SO ... clarkb 19:39:56 yep. we are so ready to run tempest on gates, but i don't think tempest is yet. 19:40:08 chiming in randomly here, my openvz kernel scripts can now handle in-place kernel upgrades 19:40:17 I _think_ there is some way to get melody to splat out its information in a form that collectd or icinga can pick up 19:40:23 devananda: w00t! 19:40:52 oh, I spoke with primeministerp earlier today and he's working on getting the hyper-v lab back up and going - so we might have more contrib testing from there 19:41:11 who is primeministerp? 19:41:22 and Shrews may or may not be getting closer to or futher away from nova openvz support, fwiw 19:41:43 notmyname: are you here or on the road? 19:41:46 mtaylor: it has pdf exports :) its "enterprise" 19:41:48 jeblair: can't think of real name - boston guy from suse/novell who worked with microsoft on interop 19:41:54 mtaylor: yeah, well, there's been a wrench thrown in that we should discuss 19:42:09 Shrews: does the wrench involve buying me liquor? 19:42:17 ah, i remember him. 19:42:44 mtaylor: no. devananda gave me some news that the RS patch may be forthcoming soon 19:43:03 o really? 19:43:15 great. well, do you feel you've learned things? 19:43:30 short version, it may arrive on github thursday, or it may not 19:43:33 jeblair: notmyname is on the road. 19:43:38 on github? 19:43:43 github? 19:43:44 why would it arrive on github? 19:43:47 joearnold: thanks. bad day for getting updates from other people. :) 19:44:00 joearnold: unacceptable! 19:44:04 :) 19:44:05 right. i don't know why. 19:44:12 joearnold: notmyname is always to be available 19:44:24 I wanted to bring up cgroups and ulimits for jenkins slaves 19:44:29 devananda: well, I suppose it's something :) 19:44:32 clarkb: yes! 19:44:36 the change to add swift to the devstack gate worked without any particular drama, so it'd be nice to work on a plan to get that merged. 19:44:37 clarkb: excellent! 19:44:45 #topic cgroups and ulimits 19:44:56 mtaylor: true enough. He's on his way to flagstaff, az 19:45:00 the ulimits modules was merged and is straightforward to use 19:45:30 I think we are fairly safe limiting the jenkins user to some reasonable process limit using that module 19:46:01 two questions though. what is a reasonably safe process limit? and how does the jenkins user login is it through su? 19:46:29 clarkb: via ssh actually 19:46:47 awesome. ssh login has security limits applied by default on ubuntu 19:47:00 jenkins master ssh's into the slave host, runs a java process, and that process runs jobs. 19:47:01 but not for su 19:47:34 on the devstack nodes, that _job_ would su to another user (stack) who might also su to root to run devstack.. 19:47:50 but since that happens on single use slaves with job timeouts, it's not such a priority. 19:48:02 so other than determining what a sane number for a process limit is the ulimit stuff is not very scary 19:48:19 cgroups on the other hand have the potential to be great fun 19:48:29 clarkb: 640kbytes should be enough for anyone! 19:48:49 so we should probably monitor process count during, say, a nova unit test run. 19:49:02 the current cgroup change https://review.openstack.org/#/c/9074/ adds support for memory limits for the jenkins user on jenkins slaves but does not apply them in site.pp 19:49:20 jeblair: good idea 19:49:58 clarkb: how do you think we should apply the cgroups change? 19:50:07 carefully or recklessly? :) 19:50:12 the cgroup configuration sets a soft memory limit of 512MB of memory for the jenkins user. This comes into play if there is any memory contention on the box 19:50:36 so jenkins would be limited to 512MB if something else was making the machine unhappy. 19:50:50 it also applies a hard limit of 75% of the physical memory on the machine 19:51:22 the hard limit is more dangerous, because by default OOM killer will be invoked to clean up jenkins' processes if it doesn't free memory when asked nicely 19:52:02 we can disable OOM killer which will cause memory overruns to force processes to sleep when they need more memory 19:52:41 or we can completely redo the numbers. I think not setting a hard limit and only setting a soft limit to 75% of physical memory would be safer 19:53:11 jeblair: I was thinking carefully would be best :) 19:53:32 so what happens if the soft limit is reached? 19:53:32 maybe add a node definition for a specific jenkins_slave (more specific than the current glob) and see how that host does 19:53:55 clarkb: that's a good idea. then we can easily disable that node if it causes problems. 19:53:59 jeblair: soft limit only applies if there is memory contention on the host. In that case it acts like a hard limit 19:55:24 clarkb: contention including swap? 19:55:34 LinuxJedi: I think so 19:55:59 I know HP Cloud only applies to devstack but we give those like 100GB of swap due to the way the disks are configured 19:56:27 http://www.mjmwired.net/kernel/Documentation/cgroups/memory.txt kernel documentation doesn't quite spell out all of the details 19:56:28 * LinuxJedi really doesn't want to be using 100GB of swap on anything ;) 19:57:00 in that case we can set hard limits that are larger than 75% of physical memory 19:57:08 maybe physical memory * 2 19:57:23 jeblair: what do you think? 19:57:25 well, i don't want to be swapping at all really. :) 19:58:09 perhaps a hard limit of 90%? 19:58:15 jeblair: ok 19:58:22 sounds good to me 19:58:23 at 5G, that leaves 400M for the rest of the system, which seems reasonable. 19:58:27 4G, that is. 19:58:33 I will update the change after lunch with what that looks like 19:58:34 we can always tweak it if it causes pain 19:58:43 but I feel safe with that 19:59:14 ++ 19:59:17 ok. and let's do clark's idea of applying it to just one jenkins slave 19:59:30 sounds good 19:59:42 1 sec and i'll pick one. 20:00:18 * LinuxJedi watches jeblair use the scientific method of closing eyes and pointing to a random machine on the screen 20:00:52 precise8 20:01:12 LinuxJedi: close -- gate-nova-python27 runs there a lot. :) 20:01:15 hrm hrm. 20:01:24 o/ 20:01:34 jeblair: time to call that meeting to an end :) 20:01:42 mtaylor: time to call that meeting to an end :) 20:01:46 kk 20:01:50 #stopmeeting 20:01:51 mtaylor: ! 20:01:52 thanks guys! 20:01:56 #endmeeting