19:01:38 <jeblair> #startmeeting infra 19:01:39 <openstack> Meeting started Tue Aug 20 19:01:38 2013 UTC and is due to finish in 60 minutes. The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:42 <openstack> The meeting name has been set to 'infra' 19:01:57 <pleia2> o/ 19:02:12 <anteaya> o/ 19:02:39 <jeblair> hi, i'd like to scrap our normal agenda and instead check in on the things we're doing to deal with the high load this week, and then go back to doing those things. 19:02:46 <mordred> o/ 19:02:51 <mordred> jeblair: ++ 19:02:53 <jeblair> any objections? 19:03:05 <fungi> wfm 19:03:20 <jeblair> okay, i'll start it off then 19:03:23 <ttx> high load this week and the next and the next 19:03:25 <pleia2> wfm 19:03:27 <jeblair> #topic nodepool 19:03:29 <jeblair> ttx: indeed :) 19:03:45 <jeblair> so, the good news is that we managed to get nodepool running in time to meet monday's rush 19:03:51 <ttx> should ~calm down once Featurefrozen on september 4 19:03:55 <jeblair> the bad news is that nodepool was running during monday's rush 19:04:39 <jeblair> on the one hand, it did help us actually provide the test nodes that jenkins needed 19:04:51 <zaro> o/ 19:05:07 <jeblair> so as far as our concern for being able to have enough test resources for this roush, i think we managed to do that 19:05:27 <clarkb> we (mostly jeblair) added 16 new precise slaves as well 19:05:34 <jeblair> but it showed us where the next problems are: 19:06:31 <jeblair> * nodepool is easily capable of gitting rate limits; i'm working on fixing that now (by serializing novaclient calls) 19:06:58 <jeblair> * it might be stressing jenkins a bit if it does a lot of deletes at once 19:07:50 <jeblair> * running so many jobs has increased the load on review.o.o, where all the jobs fetch repo updates (before the fetch changes from zuul) 19:08:26 <jeblair> so i'm pretty sure we can get a new version of nodepool that is a bit more stable and friendly to providers (first point) in today 19:08:39 <jeblair> after that, we may think about the second point (jenkins) 19:08:48 <jeblair> and we'll talk about the third point in a minute... 19:08:54 <jeblair> any questions about nodepool? 19:09:04 <clarkb> jeblair: is it still running az1 in a decreased capacity? 19:09:05 <jeblair> #link http://tinyurl.com/kmotmns 19:09:21 <jeblair> there's a graph, btw, of what nodepool has been doing ^ 19:09:31 <pleia2> wow 19:09:32 <fungi> what specific jenkins stress were we seeing during mass node deletion bursts? 19:09:45 <fungi> i think i must have missed that specific symptom 19:10:11 <jeblair> clarkb: yes, it's configured to only spawn up to 2 nodes at a time there, basically to keep us from hitting a rate limit (which seems to be different there) 19:10:28 <anteaya> is the graph time UTC? 19:11:13 <jeblair> fungi: we saw jenkins get very slow at one point, possibly related to node deletions. it's also possible it was related to a bug in the gearman-plugin. 19:11:24 <fungi> given the short time-to-deletion after completion of a job, i assume it's best to interpret "used" as really being "in-use" for the most part 19:11:40 <mordred> that's how I'm reading it 19:11:43 <fungi> unlike in the past where used slaves often stuck around for hours 19:11:43 <jeblair> fungi: i don't have much more specific than that, it's more of 'keep an eye on this' at this point 19:11:48 <mordred> like, 'not waiting in pool' 19:11:50 <fungi> jeblair: thanks 19:12:06 <jeblair> fungi: indeed, and even if it has trouble deleting, it tries to set the node to the delete state asap 19:12:15 <jeblair> you can see it had such trouble late yesterday (az1) 19:12:31 * fungi nods 19:12:48 <jeblair> anteaya: i have no idea what the timezone is. :( 19:13:04 <clarkb> jeblair: anteaya I think it is local to whatever your browser is giving out in its request headers 19:13:18 <jeblair> anteaya: it does not look like localtime or utc for me. 19:13:27 <anteaya> jeblair: okay, trying to understand the dip at 9pm chart time 19:13:35 <jeblair> to me, it looks like "now" is just past 2pm. 19:13:39 <anteaya> maybe that was when you restarted zuul? 19:13:52 <anteaya> clarkb: ah okay 19:14:12 <fungi> looks like utc-0500 to me, which is odd because my machine wants utc everywhere and i'm currently in utc-0700 19:14:13 <anteaya> jeblair: yes, I have "now" at just past 2pm too 19:14:35 <fungi> but not entirely relevant for now 19:14:40 <anteaya> k 19:14:41 <zaro> jeblair: are we deleteing or just putting slaves offline? 19:14:43 <jeblair> we should probably do something about that. 19:14:49 <jeblair> anteaya: yes, that dip was when zuul got stuck. 19:14:56 <anteaya> jeblair: okay thanks 19:15:31 <jeblair> zaro: nodepool deletes them 60 seconds after the job finishes (and gearman-plugin takes them offline) 19:16:13 <jeblair> #topic zuul 19:16:30 <jeblair> so zuul got stuck last night; it was processing its queues VERY slowly 19:16:38 <jeblair> and our logging was insufficient to determine why 19:17:11 <jeblair> it could be related to review.o.o being slow, but i don't think it's a direct relation (because it didn't get stuck early when we were most busy) 19:17:22 <clarkb> I have written a change that will dump each running thread's stack into the debug log on SIGUSR2 19:17:31 <clarkb> #link https://review.openstack.org/#/c/42959/ 19:17:48 <jeblair> clarkb: thanks, that should help us if we are in that kind of situation again 19:18:22 <jeblair> i also want (me or someone else) to go through the key points in the new queue processor and add some missing log entries 19:18:52 <jeblair> (because that will give us timing info, which would be very useful) 19:19:31 <jeblair> anything else re zuul? 19:19:55 <clarkb> Alex_Gaynor is reporting that jobs finishing and being recorded in zuul is slow 19:20:10 <clarkb> I believe this is potentially related to the blockage yesterday 19:20:14 <jeblair> clarkb: currently? 19:20:21 <clarkb> jeblair: yeah see -infra 19:20:51 <jeblair> clarkb: well, if we have to restart again, we'll make sure we get your patch in at least 19:21:10 <jeblair> #topic git.o.o 19:21:17 <clarkb> it isn't stuck like before, but does seem to be slow. There are a few lost jobs that may be lost due to the slowness. 19:21:42 <fungi> i wonder if this slowness is a precursor to the extreme slowness we saw later yesterday 19:22:19 <pleia2> having some trouble putting the git daemon behind an haproxy because specifying an alternate port while running it from xinetd is a bit of a mess 19:22:32 <fungi> load on review.o.o is up >100 again, fwiw 19:22:45 <pleia2> so currently working on just turning it into a service (writing init script now) and calling that specifically from puppet 19:22:45 <mordred> yoy 19:22:58 <mordred> pleia2: cool 19:23:07 <pleia2> that way we can specify port and stuff without doing something awful like modifying /etc/services 19:23:13 <clarkb> pleia2: then we can rely on haproxy for DDoS mitigation as it gives us a slightly more flexible set of tools to work with 19:23:20 <pleia2> yep 19:23:55 <clarkb> jeblair: I think we should merge the d-g change to use git.o.o https 19:24:13 <clarkb> that will hopefully allevaite some of the stress on review.o.o 19:24:27 <jeblair> clarkb, pleia2: how should we work that into horizontally-scaled git.o.o? 19:24:40 <jeblair> clarkb: well, yes, but it won't merge for a long time at this point. 19:24:48 <fungi> clarkb: load on review.o.o looks like it may primarily be concurrent git-upload-pack and git-pack-object activities 19:24:58 <clarkb> jeblair: should we force it in? 19:25:40 <jeblair> clarkb: i'd be open to that. it did pass tests. 19:25:56 <clarkb> jeblair: I think there are two possible ways of using haproxy in a horizontally scaled git.o.o. Either use the current haproxy stuff as a model for balancing several git daemons and run the balancing ourselves. Or use lbaas to balance a bunch of haproxies in front of git daemon 19:26:23 <clarkb> jeblair: I like the second option, as it allows us to keep our haproxy simple while taking advantage of our cloud services 19:26:59 <jeblair> clarkb: in option 2, would our haproxies be doing anything? 19:27:15 <clarkb> jeblair: they would be doing queueing 19:27:29 <jeblair> clarkb: possibly sub-optimally, though, right? 19:28:02 <fungi> clarkb: what if we're doing http(s) instead of git protocol though... then does haproxy buy us anything over what aache's already doing? 19:28:33 <clarkb> jeblair: yes, however I think haproxy has a dynamic round robin balance type that should shift load away from a server with a backed up queue 19:28:46 <clarkb> fungi: load balancing across many apaches 19:28:49 <clarkb> fungi: and we can do both 19:29:28 <fungi> oh, i guess if haproxy on the server is intelligently communicating load information back to the central lbaas then that makes sense 19:29:49 <clarkb> jeblair: fungi pleia2 my vote is we push the d-g change in to use https://git.o.o. Then we can make a better assessment of whether or not we need to keep focusing on git:// to get past the feature freeze 19:30:00 <fungi> clarkb: i concur 19:30:00 <pleia2> clarkb: +1 19:30:07 <jeblair> sounds good to me 19:30:31 <jeblair> the other part of git.o.o i'd like to consider is 3 aspects of how the server should be tuned: 19:31:06 <jeblair> 1) the git repo itself; we should look into things like packed-refs and decide if there's any other tuning on the repo itself we should do 19:31:51 <jeblair> 2) concurrency -- we should find out how many simultaneous git operations a server can handle (this is probably mostly about filesystem access/locking, etc) 19:32:24 <jeblair> it's also probably a tradeoff between number of clients and speed, so i think there may need to be some benchmarking there 19:32:32 <jeblair> with different values 19:33:05 <jeblair> 3) server sizing -- we have a huge server for git.o.o, but it's possible we could serve just as many git repos with a smaller one 19:33:11 <jeblair> similar benchmarking there, i think 19:33:17 <clarkb> ++ 19:33:17 <mordred> agre 19:33:22 <fungi> yep 19:33:27 <pleia2> yeah 19:34:03 <jeblair> ok, any other topics? 19:34:13 <clarkb> mysql backups 19:34:20 <jeblair> #topic mysql backups 19:34:28 <mordred> jeblair: re: 2) figuring out the trade off between artificially reducing concurrency 19:34:45 <mordred> and how much faster that may make individual operations - for reducing queue size 19:34:49 <mordred> I think should be factored in 19:34:51 <clarkb> the new mysql backup define is running on etherpad.o.o and etherpad-dev.o.o. 19:34:53 <jeblair> mordred: ++ 19:34:56 <clarkb> #link https://review.openstack.org/#/c/42785/ 19:35:22 <clarkb> that change will make the cron job quieter at the expence of possibly bit bucketing something important on stderr. I am open to suggestsion on how to deal with this better 19:35:56 <clarkb> I think once the problem 42785 addresses is corrected we are ready to start adding this to the gerrit servers and anywhere else we believe the mysql db is important 19:36:33 <fungi> it's possible to filter stderr through grep in the cron job to just ignore patterns you don't care about on an individual basi 19:36:41 <fungi> s 19:37:58 <clarkb> I am not as familiar as I should be with bup, what is the process of adding off host backups once we have the DB tarballs? 19:38:14 <jeblair> clarkb: http://ci.openstack.org/sysadmin.html#backups 19:38:19 <clarkb> jeblair: danke 19:38:56 <clarkb> fungi: yeah, could do that too, will need to brush up on bourne to figure out what they will look like 19:39:00 <jeblair> clarkb: we could probably puppet some of that too. 19:39:10 <jeblair> clarkb: fungi left a link in your review 19:39:25 <fungi> clarkb: i linked an example in that comment from another place i had tested it and gotten it working previousy 19:39:29 <fungi> previously 19:40:56 <jeblair> anything else? 19:40:59 <clarkb> fungi: the problem here is I am already piping stdout to gzip so I need to figure out some tee magic or something along those lines 19:41:12 <clarkb> fungi: or write a proper script 19:41:17 <clarkb> jeblair: no I think that is it for backups 19:41:56 <fungi> clarkb: ahh, yeah i'm redirecting stdout to a file and stderr to a pipe in my example 19:42:13 <fungi> clarkb: two pipes may add a little extra complexity there 19:43:19 <dhellmann> clarkb: I think you need to use a fifo 19:43:50 <fungi> yeah, that's the first solution i'd reach for, but maybe overkill in this case 19:44:06 <fungi> perhaps mysqldump has options to skip specific tables 19:44:23 <mordred> it does 19:44:30 <mordred> and I'm thinking that might be a better option 19:44:39 <mordred> than writing crazy output filtering/processing things 19:45:02 <clarkb> ok I will hunt down that option 19:45:12 <fungi> yeah, if we can tell mysqldump it doesn't need to care about the events table, then so much the better 19:45:38 <mordred> clarkb: --ignore-table=name 19:45:43 <clarkb> thanks 19:49:04 <clarkb> Anyone else or are we ready to jump back into making things work better? 19:49:40 <jeblair> thanks everyone! 19:49:42 <jeblair> #endmeeting