#openstack-meeting log

19:02:59 <jeblair> #startmeeting infra
19:03:01 <openstack> Meeting started Tue Oct 15 19:02:59 2013 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:03:02 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:03:04 <openstack> The meeting name has been set to 'infra'
19:03:39 <sdague> <- lurking
19:03:57 <zaro> o/
19:04:06 <jeblair> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting
19:04:26 <jeblair> #link http://eavesdrop.openstack.org/meetings/infra/2013/infra.2013-10-08-19.01.html
19:04:59 <mordred> o/
19:05:11 <krtaylor> o/
19:05:15 <jeblair> #topic Actions from last meeting
19:05:25 <jeblair> #action jeblair move tarballs.o.o and include 50gb space for heat/trove images
19:05:49 <jeblair> ok, so i still didn't do that.  :(  sorry.  the 'quiet' periods where i think i'm going to do that keep being not quiet
19:06:06 <jeblair> but at least i think it's still low priority; i don't think it's impacting anything
19:06:14 <mordred> agree
19:06:22 <jeblair> clarkb announced and executed the etherpad upgrade!
19:06:34 <jeblair> clarkb: thank you so much for that
19:06:42 <pleia2> \o/
19:06:58 <clarkb> it seems to be holding up so far too
19:06:59 <jeblair> i'm unaware of any issues, other than needing to remind people to force-reload occasionally
19:07:02 <fungi> it's awesome! i can make text permanently monospace now and don't have to keep resetting my personal prefs instead
19:07:14 <mordred> ++
19:07:17 <clarkb> also we theoretically have bup backups for that host now, but I haven't done a recovery yet
19:07:21 * mordred enjoys our new etherpad overlords
19:07:40 <sdague> yes, also headings now enabled, which makes organizing pads nicer
19:08:37 <jeblair> #topic Trove testing (mordred, hub_cap)
19:09:13 <hub_cap> helo
19:09:17 <jeblair> mordred, hub_cap: what's the latest here, and are you blocked on any infra stuff?
19:09:19 <hub_cap> i just shot a msg to the room
19:09:22 <mordred> I have done no additional work in this area.
19:09:31 <hub_cap> im going to take over the caching work
19:09:35 <hub_cap> so, looking @ the "image caching" job for the heat/trove images. i was wondering if it'd make sense to go the easy route, and just put a few more #IMAGE_URLS= into stackrc, and let the job automagically grab them
19:09:45 <hub_cap> ^ ^ sent to -infra, we can discuss offline if u want
19:09:47 <mordred> I think that's probably a great first step
19:10:03 <mordred> since these ARE images that you're intending on using as part of a d-g run
19:10:08 <hub_cap> its a simple fix, and your nodepool stuff already grabs it
19:10:15 <jeblair> mordred, hub_cap: do we have an etherpad plan for this somewhere?
19:10:35 * mordred feels like we did, but is not sure where
19:10:43 <hub_cap> clarkb: had one
19:10:48 * hub_cap thought
19:10:58 <jeblair> i can't recall whether our current thinking is that job runs on devstack-precise nodes, or ar we making a new nodepool image type...
19:11:10 <jeblair> mordred: i think devstack-precise, right?
19:11:40 <mordred> I think that devstack-precise was the current thinking - until proven otherwise
19:12:05 <clarkb> I started one with the notes that were on my whiteboard /me finds a link
19:12:06 <jeblair> hub_cap: i think that would be fine, but it's also easy to throw them into the nodepool scripts.  so either way; probably depends on what devstack core thinks is appropriate.
19:12:19 <clarkb> #link https://etherpad.openstack.org/p/testing-heat-trove-with-dib
19:12:35 <hub_cap> ill ping sdague / dtroyer on the matter and see what they think
19:12:46 <hub_cap> its definitely the path of least resistance in terms of nodepool
19:12:57 <jeblair> hub_cap: ok.  know where the nodepool scripts are if the answer swings the other way?
19:13:00 <hub_cap> yer already wget'ing the IMAGE_URL's and caching them
19:13:05 <sdague> hub_cap: didn't we just merge a big chunk of trove code?
19:13:17 <hub_cap> sdague: yes this is not trove specific persay
19:13:22 * ttx <- lurking too
19:13:26 <hub_cap> its diskimage-builder image locations
19:13:37 <sdague> ok
19:13:38 <hub_cap> jeblair: itd be pretty simple scripting too
19:13:43 <hub_cap> talked w/ lifeless
19:14:04 <hub_cap> he says its not worht our while to do more than a wget (as in, reading from the dib scripts like mordred and i talke dabout)
19:14:18 <lifeless> not yet anyhow. Crawl. Walk. Run.
19:14:21 <mordred> k. great. then I think getting motion at all is great
19:14:22 <hub_cap> fall
19:14:37 <jeblair> hub_cap: oh, so you're actually talking about the part where we get a dib image that has previously been published to tarballs.o.o, right?
19:14:46 <hub_cap> oh no im not
19:15:08 <jeblair> hub_cap: ok, so you're talking about the job to create that image?
19:15:15 <hub_cap> im talking about caching the dib images that woud lnormally be downloaded
19:15:15 <jeblair> hub_cap: which is the first bullet point on etherpad
19:15:20 <hub_cap> yes correct jeblair
19:15:47 <jeblair> hub_cap: what are the images you're talking about putting in IMAGE_URLS then?
19:15:50 <hub_cap> then ill be working on some of the other bullet points
19:15:53 <hub_cap> jeblair: sec
19:15:59 <mordred> the base ubuntu and fedora images
19:16:11 <mordred> jeblair: the dib image build process starts with an upstream base cloud images
19:16:29 <jeblair> ah, ok.  i don't think that changes any of the things we've said, except i understand them better now.  :)
19:16:31 <hub_cap> example: http://cloud.fedoraproject.org/fedora-19.x86_64.qcow2
19:16:42 <mordred> jeblair: I agree :)
19:17:20 <jeblair> cool, anything else on this topic?
19:17:24 <hub_cap> nosah
19:17:27 <hub_cap> <3
19:17:34 <jeblair> hub_cap: thanks
19:17:43 <hub_cap> word
19:17:44 <jeblair> #topic Tripleo testing (mordred, clarkb, lifeless, pleia2)
19:18:02 <pleia2> so I have a test nodepool up per mordred's instructions yesterday, debugging :)
19:18:07 <mordred> woot!
19:18:39 <jeblair> pleia2: awesome; if you feel like committing those instructions to the repo, that'd be cool.  :)
19:18:55 <pleia2> at the moment it's erroring with `No current image for tripleo-precise on tripleo-test-cloud` so I need to dig around a bit
19:19:07 <mordred> pleia2: so - that's normal
19:19:16 <mordred> it'll then run for a while and try to build a new imagre
19:19:26 <pleia2> jeblair: great, will do, also found a list of dpkg dependencies needed if installing it on a new 12.04 vm
19:19:27 <fungi> not an error, just a debug message
19:19:38 <mordred> pleia2: I'm assuming you have the creds for the grizzly cloud in your yaml file?
19:19:49 <pleia2> oh, it kept repeating so I thought it was spinning
19:19:52 <pleia2> mordred: yeah
19:20:00 <pleia2> so I should just start it and let it run?
19:20:04 <mordred> yes
19:20:05 <jeblair> mordred: ++ but also, if you end up shutting it down while it's doing that, if there's still a 'building' record in the db for that image, it _won't_ start trying to build a new one on restart, you'll need to 'nodepool image-delete' that record.
19:20:11 * pleia2 makes so
19:20:15 <mordred> jeblair: very true
19:20:31 <jeblair> pleia2: so make sure that ^ case doesn't apply, otherwise it really might not be doing anything
19:20:32 <mordred> pleia2: nodepool image-list and nodepool image-delete are your friends
19:20:49 <pleia2> great, thanks
19:20:50 <mordred> my instructions may not  be full docs
19:21:05 <fungi> pretty much all of nodepool-tabtab is full of awesome
19:21:16 <pleia2> all the patches have merged, so it was nice not to have to apply those at least
19:21:23 <jeblair> pleia2, mordred: also worth knowing -- once you get to the stage where it's actually running the scripts, the stderr from the ssh commands are output _after_ the stdout
19:21:35 <pleia2> jeblair: thanks
19:21:40 <mordred> excellent
19:21:44 <jeblair> if anyone figures out the right way to get those interleaved correctly out of paramiko, i'll give them a cookie.
19:22:10 <jeblair> i sort of gave up and said "at least it's recorded" and moved on.
19:22:13 <mordred> I like cookies
19:22:45 <jeblair> there's some serious magic going on in paramiko with those streams.
19:23:12 <mordred> related to that, btw, now that I know how to nodepool test things, I'm going to nodepool against some of our potential other clouds
19:23:23 <mordred> that may not be released or in prod yet
19:23:57 <jeblair> mordred: nice.  isn't it exciting how each new cloud requires source changes?
19:24:16 <pleia2> hah
19:24:18 <fungi> exciting is not the word i'd choose
19:24:32 <mordred> jeblair: yah. although - to be fair, I think that the grizzly cloud mostly just found issues for us - we didn't have to put in new behavior forks
19:24:57 <jeblair> mordred: ok cool
19:25:11 <pleia2> State: building :)
19:25:18 <mordred> pleia2: w00t
19:25:30 <jeblair> anything else related to tripleo?
19:25:34 <pleia2> I think that's it
19:25:44 <mordred> the tripleo cloud is deploying now
19:26:02 <mordred> which means we're getting closer to it being non-destructively updating - at which point infra should be able to consume vms from it
19:26:06 <mordred> or start thinking about doing that
19:26:08 <jeblair> mordred: yay
19:26:16 <mordred> just for those who haven't been folowing along
19:26:34 <jeblair> #topic Next bug day: Tuesday October 22nd at 1700UTC (pleia2)
19:26:49 <pleia2> just a reminder, next week!
19:26:59 <jeblair> i'm going to be at linuxcon eu then
19:27:11 <mordred> me too
19:27:24 <jeblair> i will try to show up if i can, but i may have to flake out this time
19:27:33 <mordred> me too
19:27:36 <fungi> i'll be here and bugtastic
19:27:43 * clarkb wonders if jeblair and mordred find conferences that conflict with bug days on purpose :{
19:27:46 <clarkb> * :P
19:27:47 <jeblair> (i doubt rescheduling would change that)
19:27:48 <pleia2> hehe
19:27:49 <clarkb> I will be around
19:28:01 <pleia2> jeblair: yeah, the next week gets mighty close to summit
19:28:34 <zaro> i will be on my way to jenkins conf
19:28:41 <jeblair> clarkb: s/bug days/work/ ? :)
19:29:00 <pleia2> zaro: good luck with that! I've been sending folks your way :)
19:29:01 <jeblair> zaro: where you will be speaking, yeah?
19:29:30 <zaro> yes
19:29:39 <jeblair> #topic Open discussion
19:30:02 <clarkb> this morning I started looking at upgrading a bunch of our logstash stuff
19:30:21 <jeblair> my schedule for the next month is roughly: linuxcon eu for a week, vacation for a week, summit for a week, vacation for a week.
19:30:34 <clarkb> basically want to upgrade logstash to 1.2.1 which requires an elasticsearch upgrade to 0.90.3 and may require an upgrade to kibana3
19:30:37 <jeblair> so i'm going to be in and out
19:30:52 <sdague> jeblair: when you thinking about laying out infra summit plan? there were a couple of sessions which could be either qa / infra and I wanted to figure out if I should add them to a track or you were
19:30:55 <fungi> i *will* be somewhat unavailable later in the week though. allthingsopen is wednesday and thursday and i want to catch at least some of it since it's local
19:31:08 <fungi> the week==next week
19:31:16 <jeblair> sdague: was planning on doing that next week, will that work for you?
19:31:17 <clarkb> there is a lot to change and while I *think* I can do it non destructively, I would like to be able to just do it more organically and if we lose data oh well
19:31:18 <mordred> my schedule is similar to jeblair's, except I'm doing many more conferences in that stretch
19:31:21 <pleia2> thankfully I'm home for pretty much the rest of the year aside from summit
19:31:22 <sdague> jeblair: sure
19:31:24 <clarkb> sdague: jog0: do you have opinions on that attitude?
19:31:24 <fallenpegasus> I will see you there fungi
19:31:25 <mordred> and by many, I mean 2x
19:31:37 <fungi> fallenpegasus: looking forward to it!
19:31:44 <sdague> clarkb: we can always rebuild logstash, right?
19:31:48 <mordred> so consider me useless as usual
19:31:59 <pleia2> but locally I'm speaking at balug tonight on 'code review for sysadmins' (same as oscon talk)
19:32:00 <sdague> so I consider temp data loss to be an "oh well"
19:32:06 <clarkb> sdague: yup and great
19:32:19 <clarkb> sdague: well its temp in that indexes may go away
19:32:25 <jeblair> clarkb: the havana release is sched for oct 17
19:32:26 <clarkb> and I don't feel like reindexing the data
19:32:39 <jeblair> clarkb: my feeling is maybe wait till next week, then go for it?
19:32:51 <sdague> clarkb: well for elastic-recheck, we'd want to reindex that last 7 days of data regardless
19:32:55 <clarkb> jeblair: yup, that was what I was thinking. After release is the best time for this sort of thing
19:33:07 <sdague> so we can figure out bug trends
19:33:08 <clarkb> sdague: yeah, I am asserting that I would like to not do that :)
19:33:08 <mordred> pleia2: ++
19:33:19 <sdague> clarkb: can't we do it as a background job after?
19:33:34 <jeblair> sdague: how critical is that after the release?
19:33:35 <sdague> clarkb: ok, I just assumed temp outage :)
19:33:49 <clarkb> sdague: we can, but if I start worrying about stuff like that I am worried I won't get this done in a timely manner
19:34:04 <sdague> clarkb: ok, so lets take the hit now
19:34:14 <clarkb> sdague: right I figured after release was the best time for that hit
19:34:29 <sdague> but in future it would be nice to have a reindex process for the data
19:34:33 <mordred> this: https://review.openstack.org/#/q/status:open+project:openstack-infra/config,n,z is a terrifying list
19:34:40 <clarkb> sdague: also, future upgrades will hopefully be less painful. logstash is doing a schema change and elasticsearch is changing major versions of lucene. It is a bit of a perfect storm
19:34:52 <sdague> because - http://status.openstack.org/elastic-recheck/ even in it's current form, it super useful in understanding trending
19:34:56 <sdague> so we'll have a blind spot
19:35:18 <sdague> clarkb: yeh, though honestly, if they broke like that this time, I expect they'll break in the future
19:35:33 <sdague> so bulk reindex process is probably in order
19:35:59 <jeblair> sdague: it would be nice, but we've always said this stuff is transient
19:36:23 <jeblair> and given our current staffing levels vs workload, i think we're going to have to accept that some things like this will have rough edges
19:36:35 <jeblair> as mordred just pointed out
19:36:39 <clarkb> sdague: they might, How about this. If the old indexes don't derp due to the lucene upgrade (they shouldn't but it is a warning they give) I will work on reindexing after the upgrade
19:36:49 <sdague> clarkb: that would be awesome
19:36:54 <clarkb> if they do derp, we move on
19:37:04 <sdague> yeh, I'm fine on that for now
19:37:10 <mordred> jeblair: ++
19:38:04 <sdague> jeblair: I get it, just logstash has a lot of consumers now :) you'll hear it on irc if it comes back empty
19:38:58 <jeblair> sdague: yep, and their contributions to the maintenance of the system will be welcome.  :)
19:39:15 <sdague> fair enough
19:39:17 <clarkb> yeah, I will make a best effort, but I think doing it perfectly will require far too much time
19:39:26 <sdague> clarkb: yeh, don't stress on it
19:39:38 <sdague> now is as good a time as ever to take the hit
19:39:40 <clarkb> and in theory since all of the upstreams are working together now this sort of pain will be less painful in the future (I really hope so)
19:40:30 <sdague> any idea how borky it's going to make e-r? like if there are enough data structure changes that we're going to need to do some emergency fixes there?
19:40:47 <clarkb> sdague: it won't be too bad, I will propose an updated query list
19:40:54 <sdague> ok, cool
19:41:01 <sdague> the metadata adds going to go in as part of this?
19:41:10 <clarkb> sdague: the schema is being flattened and silly symbols are being removed. so @message becomes message and @fields.build_foo is just build_foo
19:41:18 <sdague> cool
19:41:36 <clarkb> sdague: yeah, I was planning on looking at that as part of this giant set of changes :)
19:41:42 <sdague> great
19:41:59 <clarkb> I am also planning on trying the elasticsearch http output so that we can decouple elasticsearch upgrades from logstash
19:42:11 <clarkb> but we need to upgrade elasticsearch anyways
19:42:48 <sdague> cool, well just keep me in the loop as things upgrade, I'll see what I can do to hotfix anything on the e-r to match
19:43:15 <clarkb> sdague: thanks
19:43:18 <clarkb> and will do
19:46:28 <jeblair> there's a thread on the infra list about log serving
19:46:34 <jeblair> Subject: "Log storage/serving"
19:47:20 <jeblair> it seems like the most widely accepted ideas are to store and statically serve directly from swift, and pre-process before uploading
19:48:03 <jeblair> if anyone else wants to weigh in, that would be great
19:48:18 <jeblair> sdague: ^ mentioned this the other day, just a reminder
19:48:36 <jeblair> jog0: ^ may be of interest to you as well
19:49:03 <sdague> oh, yeh, actually
19:49:17 <clarkb> can I get reviews for https://review.openstack.org/#/c/47928/ that is step one in this whole process of upgrading stuff?
19:49:19 <sdague> so my experience so far on the filters, is the dynamic nature is kind of important
19:49:52 <sdague> so pre-process is something I'd actually tend to avoid if we could (though we could maybe build filters out of swift?)
19:50:27 <jeblair> sdague: well, we won't be running our own swift, so i don't believe we can write a middleware to do it
19:50:33 <jog0> jeblair: nice, what about the issues about how swift doesn't fit this use case exactly
19:51:06 <jeblair> sdague: i agree, i like being able to process them as we serve them -- but considering that we tend to focus on the most recent logs...
19:51:08 <sdague> jeblair: so that being the case, I'd kind of lean on the dynamic filters against an FS model.
19:51:43 <jeblair> sdague: if we keep upgrading the pre-processing, the logs we're looking at most will very shortly have those updates
19:51:49 <sdague> jeblair: right, but if we change a filter output, to do something like link req ids in, then we have to go back and bulk process eerything
19:52:08 <jeblair> sdague: or just accept that older logs aren't as featureful
19:52:36 <sdague> jeblair: that also means processing them multiple ways, as we do things like dynamic level selectors
19:52:51 <sdague> we doing a summit session on this?
19:53:01 <sdague> might be good to do it there
19:53:01 <mordred> why is it that we didn't want to do a log-serving app like jeblair was suggesting originally?
19:53:22 <clarkb> mordred: because then we have to run the thing. If we use swift and mostly just swift someone else deals with it :)
19:53:28 <mordred> it seems like storing the logs in swift, with an entry in a db that tells you pointers to teh data blobs
19:53:29 <sdague> mordred: I think I'm arguing exactly for the log serving app approach
19:53:38 <mordred> so that if you want to get complex view, you go through the log view app
19:53:39 <jeblair> sdague: i guess i'm saying that it's worth weighing the benefit of being able to add new processing to old logs against the simplicity of being able to use swift more or less straight up.
19:53:42 <sdague> yeh, we could still put raw in swift
19:53:47 <mordred> but if you just want raw data, you pull the data directly from swift
19:54:27 <jeblair> sdague: you could get dynamic features by doing the filtering in javascript (and by encoding tags in the file, still make filtering easily available to logstash pusher)
19:54:42 <sdague> jeblair: with the size of these files... you really can't
19:55:06 <jeblair> sdague: ?
19:55:15 <sdague> n-cpu logs, htmlized, cripple anything but chrome
19:55:22 <sdague> 35MB html uncompressed
19:55:43 <jeblair> sdague: aren't they already htmlized?
19:55:52 <sdague> no, that's the point of the filter
19:56:09 <jeblair> sdague: we sometimes don't convert them to html?
19:56:17 <sdague> jeblair: right, for logstash
19:56:20 <sdague> or if you wget
19:56:30 <jeblair> sdague: but that doesn't cripple a non-chrome browser
19:56:33 <sdague> we're doing content negotiation, so you can get html or text/plain
19:56:36 <jeblair> i don't think i'm following
19:56:52 <sdague> the overhead of a 35 MB html dom kills most browsers
19:57:10 <jeblair> sdague: so you're saying it only works because it doesn't default to DEBUG?
19:57:12 <sdague> javascript manipulating it would be even worse
19:57:24 <jeblair> sdague: and if you click debug, it'll kill your browser
19:57:44 <sdague> we actually default to debug, and most people use chrome when it kills firefox
19:58:02 <jeblair> sdague: ok, so as far as the html goes, there would be no difference
19:58:11 <sdague> but a future enhancement I was going to add was detecting browser, and defaulting to a lower level
19:58:37 <sdague> anyway, face to face, maybe we can sort the various concerns
19:58:45 <jeblair> sdague: well, if you have a minute, could you add these thoughts to the thread?
19:58:59 <sdague> sure
19:59:14 <jeblair> sdague: because so far, the idea of running a log serving app, which i originally suggested, has very few supporters
19:59:53 <jeblair> sdague: and yeah, i'll propose a summit session on this
20:00:11 <jeblair> and i think we're at time
20:00:18 <jeblair> thanks everyone
20:00:19 <fungi> i suspect that's mainly because the main use cases for swift are directly serving files rather than using it purely as a storage backend
20:00:30 <jeblair> #endmeeting