#openstack-meeting log

19:01:38 <jeblair> #startmeeting infra
19:01:39 <openstack> Meeting started Tue Aug 20 19:01:38 2013 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:42 <openstack> The meeting name has been set to 'infra'
19:01:57 <pleia2> o/
19:02:12 <anteaya> o/
19:02:39 <jeblair> hi, i'd like to scrap our normal agenda and instead check in on the things we're doing to deal with the high load this week, and then go back to doing those things.
19:02:46 <mordred> o/
19:02:51 <mordred> jeblair: ++
19:02:53 <jeblair> any objections?
19:03:05 <fungi> wfm
19:03:20 <jeblair> okay, i'll start it off then
19:03:23 <ttx> high load this week and the next and the next
19:03:25 <pleia2> wfm
19:03:27 <jeblair> #topic nodepool
19:03:29 <jeblair> ttx: indeed :)
19:03:45 <jeblair> so, the good news is that we managed to get nodepool running in time to meet monday's rush
19:03:51 <ttx> should ~calm down once Featurefrozen on september 4
19:03:55 <jeblair> the bad news is that nodepool was running during monday's rush
19:04:39 <jeblair> on the one hand, it did help us actually provide the test nodes that jenkins needed
19:04:51 <zaro> o/
19:05:07 <jeblair> so as far as our concern for being able to have enough test resources for this roush, i think we managed to do that
19:05:27 <clarkb> we (mostly jeblair) added 16 new precise slaves as well
19:05:34 <jeblair> but it showed us where the next problems are:
19:06:31 <jeblair> * nodepool is easily capable of gitting rate limits; i'm working on fixing that now (by serializing novaclient calls)
19:06:58 <jeblair> * it might be stressing jenkins a bit if it does a lot of deletes at once
19:07:50 <jeblair> * running so many jobs has increased the load on review.o.o, where all the jobs fetch repo updates (before the fetch changes from zuul)
19:08:26 <jeblair> so i'm pretty sure we can get a new version of nodepool that is a bit more stable and friendly to providers (first point) in today
19:08:39 <jeblair> after that, we may think about the second point (jenkins)
19:08:48 <jeblair> and we'll talk about the third point in a minute...
19:08:54 <jeblair> any questions about nodepool?
19:09:04 <clarkb> jeblair: is it still running az1 in a decreased capacity?
19:09:05 <jeblair> #link http://tinyurl.com/kmotmns
19:09:21 <jeblair> there's a graph, btw, of what nodepool has been doing ^
19:09:31 <pleia2> wow
19:09:32 <fungi> what specific jenkins stress were we seeing during mass node deletion bursts?
19:09:45 <fungi> i think i must have missed that specific symptom
19:10:11 <jeblair> clarkb: yes, it's configured to only spawn up to 2 nodes at a time there, basically to keep us from hitting a rate limit (which seems to be different there)
19:10:28 <anteaya> is the graph time UTC?
19:11:13 <jeblair> fungi: we saw jenkins get very slow at one point, possibly related to node deletions.  it's also possible it was related to a bug in the gearman-plugin.
19:11:24 <fungi> given the short time-to-deletion after completion of a job, i assume it's best to interpret "used" as really being "in-use" for the most part
19:11:40 <mordred> that's how I'm reading it
19:11:43 <fungi> unlike in the past where used slaves often stuck around for hours
19:11:43 <jeblair> fungi: i don't have much more specific than that, it's more of 'keep an eye on this' at this point
19:11:48 <mordred> like, 'not waiting in pool'
19:11:50 <fungi> jeblair: thanks
19:12:06 <jeblair> fungi: indeed, and even if it has trouble deleting, it tries to set the node to the delete state asap
19:12:15 <jeblair> you can see it had such trouble late yesterday (az1)
19:12:31 * fungi nods
19:12:48 <jeblair> anteaya: i have no idea what the timezone is.  :(
19:13:04 <clarkb> jeblair: anteaya I think it is local to whatever your browser is giving out in its request headers
19:13:18 <jeblair> anteaya: it does not look like localtime or utc for me.
19:13:27 <anteaya> jeblair: okay, trying to understand the dip at 9pm chart time
19:13:35 <jeblair> to me, it looks like "now" is just past 2pm.
19:13:39 <anteaya> maybe that was when you restarted zuul?
19:13:52 <anteaya> clarkb: ah okay
19:14:12 <fungi> looks like utc-0500 to me, which is odd because my machine wants utc everywhere and i'm currently in utc-0700
19:14:13 <anteaya> jeblair: yes, I have "now" at just past 2pm too
19:14:35 <fungi> but not entirely relevant for now
19:14:40 <anteaya> k
19:14:41 <zaro> jeblair: are we deleteing or just putting slaves offline?
19:14:43 <jeblair> we should probably do something about that.
19:14:49 <jeblair> anteaya: yes, that dip was when zuul got stuck.
19:14:56 <anteaya> jeblair: okay thanks
19:15:31 <jeblair> zaro: nodepool deletes them 60 seconds after the job finishes (and gearman-plugin takes them offline)
19:16:13 <jeblair> #topic zuul
19:16:30 <jeblair> so zuul got stuck last night; it was processing its queues VERY slowly
19:16:38 <jeblair> and our logging was insufficient to determine why
19:17:11 <jeblair> it could be related to review.o.o being slow, but i don't think it's a direct relation (because it didn't get stuck early when we were most busy)
19:17:22 <clarkb> I have written a change that will dump each running thread's stack into the debug log on SIGUSR2
19:17:31 <clarkb> #link https://review.openstack.org/#/c/42959/
19:17:48 <jeblair> clarkb: thanks, that should help us if we are in that kind of situation again
19:18:22 <jeblair> i also want (me or someone else) to go through the key points in the new queue processor and add some missing log entries
19:18:52 <jeblair> (because that will give us timing info, which would be very useful)
19:19:31 <jeblair> anything else re zuul?
19:19:55 <clarkb> Alex_Gaynor is reporting that jobs finishing and being recorded in zuul is slow
19:20:10 <clarkb> I believe this is potentially related to the blockage yesterday
19:20:14 <jeblair> clarkb: currently?
19:20:21 <clarkb> jeblair: yeah see -infra
19:20:51 <jeblair> clarkb: well, if we have to restart again, we'll make sure we get your patch in at least
19:21:10 <jeblair> #topic git.o.o
19:21:17 <clarkb> it isn't stuck like before, but does seem to be slow. There are a few lost jobs that may be lost due to the slowness.
19:21:42 <fungi> i wonder if this slowness is a precursor to the extreme slowness we saw later yesterday
19:22:19 <pleia2> having some trouble putting the git daemon behind an haproxy because specifying an alternate port while running it from xinetd is a bit of a mess
19:22:32 <fungi> load on review.o.o is up >100 again, fwiw
19:22:45 <pleia2> so currently working on just turning it into a service (writing init script now) and calling that specifically from puppet
19:22:45 <mordred> yoy
19:22:58 <mordred> pleia2: cool
19:23:07 <pleia2> that way we can specify port and stuff without doing something awful like modifying /etc/services
19:23:13 <clarkb> pleia2: then we can rely on haproxy for DDoS mitigation as it gives us a slightly more flexible set of tools to work with
19:23:20 <pleia2> yep
19:23:55 <clarkb> jeblair: I think we should merge the d-g change to use git.o.o https
19:24:13 <clarkb> that will hopefully allevaite some of the stress on review.o.o
19:24:27 <jeblair> clarkb, pleia2: how should we work that into horizontally-scaled git.o.o?
19:24:40 <jeblair> clarkb: well, yes, but it won't merge for a long time at this point.
19:24:48 <fungi> clarkb: load on review.o.o looks like it may primarily be concurrent git-upload-pack and git-pack-object activities
19:24:58 <clarkb> jeblair: should we force it in?
19:25:40 <jeblair> clarkb: i'd be open to that.  it did pass tests.
19:25:56 <clarkb> jeblair: I think there are two possible ways of using haproxy in a horizontally scaled git.o.o. Either use the current haproxy stuff as a model for balancing several git daemons and run the balancing ourselves. Or use lbaas to balance a bunch of haproxies in front of git daemon
19:26:23 <clarkb> jeblair: I like the second option, as it allows us to keep our haproxy simple while taking advantage of our cloud services
19:26:59 <jeblair> clarkb: in option 2, would our haproxies be doing anything?
19:27:15 <clarkb> jeblair: they would be doing queueing
19:27:29 <jeblair> clarkb: possibly sub-optimally, though, right?
19:28:02 <fungi> clarkb: what if we're doing http(s) instead of git protocol though... then does haproxy buy us anything over what aache's already doing?
19:28:33 <clarkb> jeblair: yes, however I think haproxy has a dynamic round robin balance type that should shift load away from a server with a backed up queue
19:28:46 <clarkb> fungi: load balancing across many apaches
19:28:49 <clarkb> fungi: and we can do both
19:29:28 <fungi> oh, i guess if haproxy on the server is intelligently communicating load information back to the central lbaas then that makes sense
19:29:49 <clarkb> jeblair: fungi pleia2 my vote is we push the d-g change in to use https://git.o.o. Then we can make a better assessment of whether or not we need to keep focusing on git:// to get past the feature freeze
19:30:00 <fungi> clarkb: i concur
19:30:00 <pleia2> clarkb: +1
19:30:07 <jeblair> sounds good to me
19:30:31 <jeblair> the other part of git.o.o i'd like to consider is 3 aspects of how the server should be tuned:
19:31:06 <jeblair> 1) the git repo itself; we should look into things like packed-refs and decide if there's any other tuning on the repo itself we should do
19:31:51 <jeblair> 2) concurrency -- we should find out how many simultaneous git operations a server can handle (this is probably mostly about filesystem access/locking, etc)
19:32:24 <jeblair> it's also probably a tradeoff between number of clients and speed, so i think there may need to be some benchmarking there
19:32:32 <jeblair> with different values
19:33:05 <jeblair> 3) server sizing -- we have a huge server for git.o.o, but it's possible we could serve just as many git repos with a smaller one
19:33:11 <jeblair> similar benchmarking there, i think
19:33:17 <clarkb> ++
19:33:17 <mordred> agre
19:33:22 <fungi> yep
19:33:27 <pleia2> yeah
19:34:03 <jeblair> ok, any other topics?
19:34:13 <clarkb> mysql backups
19:34:20 <jeblair> #topic mysql backups
19:34:28 <mordred> jeblair: re: 2) figuring out the trade off between artificially reducing concurrency
19:34:45 <mordred> and how much faster that may make individual operations - for reducing queue size
19:34:49 <mordred> I think should be factored in
19:34:51 <clarkb> the new mysql backup define is running on etherpad.o.o and etherpad-dev.o.o.
19:34:53 <jeblair> mordred: ++
19:34:56 <clarkb> #link https://review.openstack.org/#/c/42785/
19:35:22 <clarkb> that change will make the cron job quieter at the expence of possibly bit bucketing something important on stderr. I am open to suggestsion on how to deal with this better
19:35:56 <clarkb> I think once the problem 42785 addresses is corrected we are ready to start adding this to the gerrit servers and anywhere else we believe the mysql db is important
19:36:33 <fungi> it's possible to filter stderr through grep in the cron job to just ignore patterns you don't care about on an individual basi
19:36:41 <fungi> s
19:37:58 <clarkb> I am not as familiar as I should be with bup, what is the process of adding off host backups once we have the DB tarballs?
19:38:14 <jeblair> clarkb: http://ci.openstack.org/sysadmin.html#backups
19:38:19 <clarkb> jeblair: danke
19:38:56 <clarkb> fungi: yeah, could do that too, will need to brush up on bourne to figure out what they will look like
19:39:00 <jeblair> clarkb: we could probably puppet some of that too.
19:39:10 <jeblair> clarkb: fungi left a link in your review
19:39:25 <fungi> clarkb: i linked an example in that comment from another place i had tested it and gotten it working previousy
19:39:29 <fungi> previously
19:40:56 <jeblair> anything else?
19:40:59 <clarkb> fungi: the problem here is I am already piping stdout to gzip so I need to figure out some tee magic or something along those lines
19:41:12 <clarkb> fungi: or write a proper script
19:41:17 <clarkb> jeblair: no I think that is it for backups
19:41:56 <fungi> clarkb: ahh, yeah i'm redirecting stdout to a file and stderr to a pipe in my example
19:42:13 <fungi> clarkb: two pipes may add a little extra complexity there
19:43:19 <dhellmann> clarkb: I think you need to use a fifo
19:43:50 <fungi> yeah, that's the first solution i'd reach for, but maybe overkill in this case
19:44:06 <fungi> perhaps mysqldump has options to skip specific tables
19:44:23 <mordred> it does
19:44:30 <mordred> and I'm thinking that might be a better option
19:44:39 <mordred> than writing crazy output filtering/processing things
19:45:02 <clarkb> ok I will hunt down that option
19:45:12 <fungi> yeah, if we can tell mysqldump it doesn't need to care about the events table, then so much the better
19:45:38 <mordred> clarkb: --ignore-table=name
19:45:43 <clarkb> thanks
19:49:04 <clarkb> Anyone else or are we ready to jump back into making things work better?
19:49:40 <jeblair> thanks everyone!
19:49:42 <jeblair> #endmeeting