#openstack-meeting log

19:02:53 <jeblair> #startmeeting infra
19:02:53 <openstack> Meeting started Tue Aug 27 19:02:53 2013 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:02:54 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:02:56 <openstack> The meeting name has been set to 'infra'
19:03:08 <jeblair> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting
19:03:08 <zaro> \o
19:03:12 <jeblair> #link http://eavesdrop.openstack.org/meetings/infra/2013/infra.2013-08-20-19.01.html
19:03:26 <jeblair> #topic Operational issues update (jeblair)
19:03:55 <jeblair> so i figured let's start with updates on all the exciting things from last week
19:04:18 <jeblair> nodepool is now running the code in git master
19:04:23 <jeblair> and the config file in puppet
19:04:39 <jeblair> so system administration on that host has returned to normal (fully puppet managed)
19:04:54 <jeblair> and it seems to be doing a pretty good job
19:05:16 <jeblair> there is one new change that went into a restart this morning that hasn't seen production testing, and that's fixing the image cleanup code
19:05:35 <jeblair> so that, will either work, do nothing, or delete all the images nodepool uses and everything will stop.
19:05:49 <jeblair> we'll know soon.  :)
19:05:58 <anteaya> let's hope for one of the first two options
19:06:06 <clarkb> jeblair: so much confidence :)
19:06:27 <clarkb> jeblair: can you link to the change?
19:07:01 <jeblair> #link https://review.openstack.org/#/c/43623/
19:07:48 <jeblair> clarkb: after we fixed git.o.o, i think the last lingering issues we know about were unstable jobs due to static.o.o and lost jobs due to jenkins not being able to talk to slaves
19:07:54 <jeblair> sound about right?
19:07:59 <clarkb> jeblair: yup
19:08:13 <clarkb> both of which should be addressed as of this morning right?
19:08:17 <jeblair> i believe we have worked around the lost jobs issue by having zuul detect that situation and re-launch the job
19:08:37 <jeblair> that change has been in production for a bit
19:09:03 <jeblair> long enough that i believe i've seen it work (it takes a bit to track down because it does try to be invisible to the user)
19:09:31 <jeblair> and then for static, we moved our intensive filesystem maintenance (compressing and deleting logs) to the weekend
19:09:43 <jeblair> which is a stopgap, but a good one, i think.
19:10:05 <clarkb> and you also spun up a new larger static node with working ipv6 and grew the filesystems that store data
19:10:06 <jeblair> it should be fine until we have a smarter log receiving/publishing service
19:10:24 <clarkb> did the new node get a AAAA record in DNS?
19:10:26 <jeblair> clarkb: yes, the additional cpus on static should help if we see contention there
19:10:40 <jeblair> clarkb: yes, that happened too
19:11:09 <jeblair> so the status and logs servers are now reachable via ipv6
19:11:14 <jeblair> and pypi
19:12:00 <jeblair> i think all of the bottlenecks we saw last week have been addressed, and so as we're pushing further up the stack
19:12:12 <jeblair> the current bottleneck is zuul preparing to merge changes
19:12:30 <jeblair> it takes 1 minute to process a change for nova before it even starts tests
19:12:43 <jeblair> we just merged some patches to zuul to make that much, much smaller
19:13:01 <jeblair> and i plan on restarting zuul this afternoon to pick it, and a bunch of other small bugfixes up
19:13:21 <jeblair> it will be a disruptive restart, because the graceful shutdown is currently broken, but that's one of the bugfixes
19:13:31 <jeblair> so hopefully it'll be better
19:13:53 <clarkb> it will be nice to get those fixes in
19:14:09 <jeblair> and i think after that, we'll probably be pretty selective about zuul upgrades as we approach h3
19:14:57 <jeblair> clarkb: want to describe the current git server config?
19:15:04 <clarkb> jeblair: I think we are getting really close to being selective about all changes
19:15:07 <clarkb> sure
19:15:22 <pleia2> http://ci.openstack.org/git.html has super basics
19:15:28 <pleia2> (might want to update it now :))
19:15:52 <clarkb> one of the bottlenecks we ran into last week was fetching git refs from review.openstack.org. It caused load averages >200 on review.o.o frequently which was bad for tests and reviewers
19:16:38 <clarkb> to work around this we pushed pleia2's cgit server into production quickly but it quickly got bogged down as well. To work around that we put an haproxy load balancer in front of 4 identical cgit servers
19:17:18 <clarkb> today we have one haproxy node balancing 4 git servers. The git servers are running git-daemon, apache, and cgit for all of your git needs (git:// http(s) cgit browsability)
19:17:51 <clarkb> In getting that going we discovered that having a lot of loose refs files made git on cetnos very slow. So we are packing all refs once per day and it makes a major difference
19:18:18 <clarkb> #link http://cacti.openstack.org/cacti/graph_view.php?action=tree&tree_id=2
19:18:21 <jeblair> and that's only on the mirror; the repos in gerrit still have their refs unpacked
19:18:37 <jeblair> (which has come in handy in the past, when we removed all the zuul refs)
19:18:47 <clarkb> #link https://git.openstack.org/cgit
19:19:21 <clarkb> thank you pleia2 for getting the base puppet stuff for that going. It ended up being quite flexible when we needed haproxy in front of it
19:19:21 <jeblair> yeah, that graph suggests we're a little overprovisioned, but i think that's a good place to be for h3, so i'm in favor of leaving it as is and seeing how those 4 servers perform
19:19:29 <clarkb> jeblair: ++
19:20:20 <jeblair> oh, according to rackspace there could be network disruption tomorrow
19:21:11 <jeblair> August 28th from 12:01 - 4:00 AM CDT
19:21:36 <clarkb> reading the announcement it didn't appear like it would be serious
19:21:50 <clarkb> but the possibility for network outages of up to 10 minutes is there
19:22:22 <jeblair> anything else about operational issues?
19:22:59 <jeblair> ttx: ^ there's an update to catch you up
19:23:22 <jeblair> #topic Backups (jeblair)
19:23:31 <jeblair> this may be more of a clarkb topic at this point
19:23:51 <jeblair> (but i'll just add that with groups-dev, we may have our first trove database that we want to backup)
19:24:12 <clarkb> we have a puppet module that adds a define to mysqldump mysql servers and gzip that dump allowing us to do our own backups
19:24:17 <ttx> jeblair: ack
19:24:44 <clarkb> it is currently running on etherpad and etherpad-dev. A change to make the cron quiet merged Sunday so I think it is ready to go onto review-dev and review
19:25:03 <jeblair> clarkb: and maybe add it to wiki (which is already being backed up)?
19:25:05 <clarkb> it will need a little work to backup trove DBs but nothing major (use username and passowrd instead of a defaults file)
19:25:24 <clarkb> jeblair: oh yes. Are there any other hosts running mysql that need backups?
19:25:28 <jeblair> clarkb: on what host do you think we should do the mysqldumps for trove?
19:25:39 <jeblair> clarkb: etherpad?
19:26:01 <clarkb> jeblair: I was thinking that running the trove backups on the server consuming the trove DB would be easiest to keep the DB backups with the backups for that server
19:26:14 <clarkb> jeblair: but that assumes one trove DB per app and not multitenancy
19:26:14 <jeblair> clarkb: sounds reasonable
19:26:34 <jeblair> clarkb: i think we can do that.
19:26:49 <clarkb> that way you don't have to think too hard in a recovery situation
19:27:46 <clarkb> list of things that need backups: review(-dev), wiki, etherpad(-dev)
19:27:50 <jeblair> #topic Tarballs move (jeblair)
19:28:03 <jeblair> i think we decided to defer this for a while, maybe till after h3, yeah?
19:28:21 <clarkb> yeah, it isn't super important but is definitely nice to have
19:28:28 <jeblair> #topic Asterisk server (jeblair, pabelanger, russelb)
19:28:36 * russellb perks up
19:28:39 <jeblair> so this took a back seat to everything blowing up last week
19:29:01 <jeblair> but i'm thinking i should be able to spin up those other servers today
19:29:12 <russellb> cool, sounds good
19:29:23 <jeblair> so we can test if the latency is better from a couple different network points
19:29:33 <clarkb> jeblair: will that include hpcloud server(s)?
19:29:49 <russellb> need to identify some sort of ... metric for how to compare the different test systems
19:29:57 <russellb> really what we're after is audio quality
19:30:09 <russellb> system load didn't seem to be a big concern
19:30:28 <russellb> but we don't really have a good tool other than our perceived quality of the call
19:30:35 <jeblair> clarkb: sure; it'll be good to collect data.  if we _love_ them at hpcloud then we'll have to deal with the question of hpcloud's SLA, but we can kick that down the road
19:31:02 <jeblair> russellb: i agree, especially since actual network latency between the voip provider and pbx was minimal
19:31:13 <russellb> yeah, i don't think that was it ...
19:31:41 <jeblair> russellb: so we may need to have a series of calls and do a subjective test?
19:31:56 <russellb> yeah
19:32:25 <jeblair> i'm planning on varying the size and data center of the servers i spin up
19:32:33 <anteaya> where can we post the agreed upon times for testing the calls? can we put them on the wiki page for now?
19:33:07 <anteaya> #link https://wiki.openstack.org/wiki/Infrastructure/Conferencing
19:33:25 <jeblair> anteaya: or we could send out another email to the -infra list
19:33:31 <anteaya> okay
19:34:06 <jeblair> anything else on this topic?
19:34:43 <jeblair> #topic open discussion
19:35:02 <clarkb> This is a long weekend for those of us in the US
19:35:32 <zaro> i will be on vacation starting tomorrow until 9/4
19:36:08 <zaro> new patch was submitted to upstream gerrit.
19:36:13 <zaro> #link https://gerrit-review.googlesource.com/#/c/48254/8
19:37:27 <jeblair> zaro: neat, looks david pursehouse is working with you on it
19:38:25 <zaro> jeblair: yes, looking good so far, got one +1, and one -1 since new patch.  -1 was just a nit pick.
19:39:48 <jeblair> anything else?
19:40:05 <anteaya> great work this week jeblair and clarkb
19:40:16 <anteaya> *applause*
19:40:40 <clarkb> nothing from me
19:40:45 <jeblair> anteaya: thanks for your help :)
19:40:49 <pleia2> thanks jeblair
19:40:51 <anteaya> welcome
19:40:57 <anteaya> :D
19:41:02 <jeblair> #endmeeting