#openstack-meeting log

19:01:22 <jeblair> #startmeeting infra
19:01:23 <openstack> Meeting started Tue Mar  4 19:01:22 2014 UTC and is due to finish in 60 minutes.  The chair is jeblair. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:25 <mordred> o/
19:01:26 <openstack> The meeting name has been set to 'infra'
19:01:37 <jeblair> #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting
19:01:39 <jeblair> agenda ^
19:01:45 <jeblair> #link http://eavesdrop.openstack.org/meetings/infra/2014/infra.2014-02-25-19.02.html
19:01:48 <jeblair> last meeting ^
19:01:56 <mordred> morganfainberg: I've got split attention - can you ping me later about whatever you're talking about re: manpage above?
19:02:11 <jeblair> #topic  Actions from last meeting
19:02:17 <jeblair> mordred send new-project service degredation annoucment
19:02:17 <clarkb> o/
19:02:18 <morganfainberg> mordred, absolutely
19:02:20 <jeblair> i think he did that
19:02:33 <anteaya> I saw it
19:02:34 <mordred> I did something
19:02:43 <fungi> mordred: you sent an e-mail
19:02:43 <jeblair> and i think on friday some projects were created...
19:02:50 <jeblair> how did all that work out?
19:02:55 <clarkb> they were thanks to fungi and anteaya's hard work
19:03:06 <jeblair> (a) do we need to alter the process?
19:03:20 <anteaya> clarkb has a patch
19:03:23 <jeblair> (b) did we learn anything that will help mordred in his work on manage_projects?
19:03:25 <fungi> spotted some bugs, fixes have been proposed (i don't have links handy)
19:03:38 <fungi> the process from last time worked well, i think
19:03:40 <jeblair> fungi: ah cool
19:03:42 <anteaya> that will be applied next time, that may mean we don't need to change the process, if it works
19:03:43 <clarkb> #link https://review.openstack.org/#/c/77314/
19:03:55 <fungi> we should repeat, clarkb has a pending patch we didn't test last time because it came moments too late
19:04:12 <clarkb> and I think it may fix the bulk of the non orchestration related trouble
19:04:17 <fungi> the other proposed patches were hand-applied on review.o.o and worked well
19:04:37 <clarkb> we still need some ordering around processes on different nodes, but the bugs outside of that seem to be getting killed \o/
19:04:44 <fungi> i think i +2'd them all with comments if i used them during the run
19:04:45 <jeblair> awesome!
19:05:08 <jeblair> since i think the orchestration bits were what mordred had the most concrete ideas about, this should help a lot
19:05:08 <fungi> all jeepyb patches
19:05:24 <mordred> agree
19:05:39 <anteaya> I have a request for next round
19:06:01 <anteaya> if we can start earlier I can be here for it, my taxi picks me up at 6pm on friday EST
19:06:04 <fungi> i have a couple of bugs flagged i need to pick up for initial group members for the projects we created on friday, but those will get wrapped up today
19:06:40 <fungi> the next round will probably go faster. a lot of the time was spent assembling the list and doing last-minute reviewing
19:06:47 <anteaya> agreed
19:06:55 <clarkb> fungi: I will try to review an dapprove those that were tested
19:07:03 <fungi> and we had a lot of changes waiting in the queue for that one since we'd put them off for some weeks
19:07:08 <anteaya> do we have an etherpad started for next round?
19:07:34 <jeblair> zaro: ping
19:07:58 <fungi> we should probably copy the "needs work" section from the last one to seed the next
19:08:02 <fungi> #link https://etherpad.openstack.org/p/new-projects-2014-02-28
19:08:02 <zaro> ohh yeah.  here
19:08:11 <anteaya> fungi: k
19:08:20 <anteaya> I can mix up a new etherpad
19:08:30 <jeblair> anteaya, fungi: thanks
19:08:34 <jeblair> let's move on...
19:08:39 <jeblair> #topic  Convert gerrit db tables to UTF8 (zaro)
19:08:50 <fungi> this time around it will be faster to build the list, since anybody wo isn't on the old etherpad doesn't have an excuse for not setting an appropriate topic
19:08:53 <jeblair> zaro: what's the latest thinking on this?
19:08:54 <zaro> ok.  i think all the info is in the bug
19:09:00 <zaro> let me find it
19:09:24 <zaro> #link https://bugs.launchpad.net/openstack-ci/+bug/979227
19:10:07 <zaro> so i believe we left off on jeblair wanting some more info on where the dups are in the conversion.
19:10:30 <zaro> #link https://launchpadlibrarian.net/165584391/case_insensitive_dups.txt
19:11:18 <mordred> oh - I think those are pretty easy to fix by hand :)
19:11:28 <jeblair> mordred: how can you fix them?
19:12:01 <zaro> did you see last sentense? "I'm not sure why line 1590340 is a duplicate because I could not find a duplicate entry for it but there are many more like this one."
19:12:01 <mordred> the emails can just be fixed - you're right though - the usernames are a bit ugh
19:12:26 <jeblair> mordred: the emails can't be fixed, actually -- the localpart in emails is case sensitive
19:12:38 <jeblair> it seems that the problem is that utf8 is not case sensitive and there is not a general cs utf8 collation
19:13:29 <mordred> right. so, the local part is case -sensitive - but I don't believe that gmail behaves that way. that leaves us with just Daviey who might be a problem
19:13:40 <jeblair> mordred: or others in the future
19:13:50 <jeblair> there has to be a right solution to this
19:14:25 <jeblair> what's the deal with the utf8_general_cs collation?  http://bugs.mysql.com/bug.php?id=65830
19:14:43 <jeblair> how does that manifest in current and next ubuntu lts?
19:15:40 <zaro> is there a problem with using utf8_bin ?
19:16:00 <zaro> mordred suggested that before
19:16:37 <jeblair> mordred: what do you think about that?
19:16:40 <mordred> bin doesn't understand sorting of international characters properly - but it should at least keep the things separate
19:16:52 <mordred> I don't think username sorting is very important to us
19:17:26 <jeblair> mordred: i suspect you are right; i'm finding it difficult to come up with a place in gerrit that could bite us
19:17:58 <jeblair> zaro: do you want to try that on review-dev and see if any problems manifest?
19:18:21 <mordred> zaro: (thanks for reminding me)
19:18:22 <jeblair> (i still think knowing the answer to whether utf8_general_cs will be available in next ubuntu lts would be useful)
19:18:44 <jeblair> we might be able to move to it later if it is
19:19:00 <mordred> well, Davi Arnaut seems to believe that's it's an experimental collation anyway
19:19:00 <zaro> i believe that i was testing with review-dev data before testing with review data.
19:19:14 <jeblair> mordred: oh, so bad idea anyway?
19:19:15 <zaro> i didn't have a problem with either when using utf8_bin
19:20:27 <mordred> https://github.com/svagner/MM-Percona-Server/blob/master/config/ac-macros/character_sets.m4#L365
19:20:29 <zaro> i mean no errors poppped up during the conversion.
19:21:09 <mordred> _cs has troubling comments aroudn it in the build files :)
19:21:15 <mordred> so I vote for just doing utf8_bin
19:22:02 <SergeyLukjanov> unfortunately, I don't see any problems with utf8_bin, should I?
19:22:03 <jeblair> i wonder what gerrit actually expects?
19:22:53 <jeblair> it seems very strange to have case-sensitive usernames.  but it also seems strange to have case-insensitive change messages and emails...
19:23:02 <jeblair> perhaps there is no coherent intent.  :/
19:23:11 <jeblair> anyone object to utf8_bin?
19:23:30 <fungi> seems okay to me
19:23:42 <mordred> it's really only relevant for sorting and for unique keys
19:23:47 <fungi> i suspect they just didn't put much thought into it
19:23:57 <mordred> utf8-bin will still work for unique keys
19:24:05 <mordred> sorting might be weird in some edge-case contexts
19:24:18 <clarkb> but would work in most cases as long as the fields are consistentish?
19:24:19 <mordred> because sorting will be essentially done numerically by underlying hex code
19:24:30 <clarkb> ah, so anything ascii would be fine
19:24:34 <mordred> pretty much
19:24:42 <fungi> and non-text fields would still sort normally
19:24:47 <mordred> yes
19:24:48 <jeblair> clarkb: though it's asciibetical not alphabetical
19:24:53 <jeblair> ABC,abc
19:25:03 * fungi prefers c sort order anyway ;)
19:25:15 <mordred> yeah
19:25:17 <mordred> but
19:25:26 <mordred> we don't REALLY show alpha-sorted list
19:25:47 <clarkb> right most sorts in gerrit are changenumber/changeid or date based
19:25:49 <jeblair> so it seems like it's worth a try, and if sorting strangely does show up some place, we can consider going back to latin1 or breaking the case-sensitive fields
19:26:27 <fungi> i have to wonder whether any part of gerrit which cares about text sorting actually asks mysql to sort the results and uses that straight in the ui anyway
19:26:39 <jeblair> clarkb: changeid (eg hexsha) sorts could be affected, but sorting a uuid seems weird.
19:26:44 <fungi> in all probability they perform their own sorting on the requery results
19:26:48 <jeblair> fungi: would not suprise me
19:26:49 <fungi> query results
19:27:07 <jeblair> #agreed convert gerrit tables to utf8_bin collation
19:27:19 <jeblair> #topic  Upgrade gerrit (zaro)
19:27:32 <jeblair> zaro: i'm guessing you're blocked on the buck stuff, yeah?
19:27:50 <zaro> nope.  although az2 has been a pain in the butt
19:28:04 <jeblair> no kidding
19:28:16 <clarkb> oh right, I was going to try and look into that more today
19:28:18 <zaro> anyways, i believe all patches that are required are ready for review and just waiting for you gents to review them
19:28:54 <jeblair> okay.  nice.
19:29:20 <jeblair> #topic  Removing openstack-ci-admins ML from LP (fungi)
19:29:23 <fungi> i've had shell loops running for the past 24 hours trying to get image rebuilds in az2 to stick
19:29:34 <zaro> once all patches are merged then we can re-puppet review-dev to see if it all works.
19:29:35 <fungi> i should have more accurately made that "deactivating openstack-ci-admins ml on lp"
19:30:03 <fungi> since the creation of the openstack-infra list on lists.o.o we have 4 messages in the archive for the openstack-ci-admins list
19:30:25 <fungi> but we keep getting people caught in moderation e-mailing that about gate failures and requests for assistance on gerrit accounts
19:30:48 <fungi> so it's an attractive nuisance. i think we should disable it (the archives would still be published for historical purposes)
19:30:50 <jeblair> we might have that in some docs somewhere... :/
19:30:58 <fungi> but wanted to see whether there are objections to that
19:31:11 <jeblair> that works for me.  the infra list seems a reasonable place for that now.
19:31:15 <fungi> fair point. i'll search teh wiki and git repos
19:31:18 <clarkb> I don't object but we should grep for places we may be advertising it
19:31:42 <jeblair> nibalizer: ping?  (i think you said you had to run...)
19:31:46 <fungi> #action fungi check for remaining recommendations of openstack-ci-admins
19:31:55 <fungi> #action fungi disable openstack-ci-admins list
19:32:51 <jeblair> #topic  Monitoring of Infra Resources / Systems (morganfainberg)
19:32:55 <morganfainberg> o/
19:33:12 <jeblair> morganfainberg: what's on your mind?
19:33:47 <nibalizer> jeblair: im about
19:33:58 <morganfainberg> There was a brief discussion that we might want to start adding monitoring of Infra resources (e.g. bots) and possibly some aggrgation alarms (not page duty, but at-a-glance we've hit a threshold) that takes into account more than the individual cacti graphs
19:34:01 <jeblair> nibalizer: cool, we'll come back to you in a bit
19:34:09 <morganfainberg> pleia2 said she had some thoughts on this as well
19:34:42 <pleia2> nothing has been hashed out yet, but I'm inclined to say we should set up Nagios for some monitoring
19:34:48 <morganfainberg> this was added as an introduction to the additional monitoring - unfortunately I don't have much more at this point.
19:34:58 <morganfainberg> with the milestone my focus is a little split :)
19:35:48 <clarkb> I see this potentially as supplementing the status pages
19:35:56 <morganfainberg> clarkb, ++
19:35:59 <pleia2> we don't actually have a bug for setting up monitoring beyond cacti
19:36:07 <jeblair> okay, let me share my thoughts -- i'm not opposed to monitoring; i think it's very important (i set up cacti and graphite after all)...
19:36:12 <clarkb> instead of needing humans to ping us and say is X broken 500 times during an outage. Have auomated checks that update a state page that everyone can check
19:36:12 <fungi> however, i think it would be a pretty significant time sink to tune, groom and polish
19:36:19 <morganfainberg> i also don't want Pager-duty esqe stuff
19:36:25 <morganfainberg> we're volunteers, we don't need that
19:36:41 <jeblair> but i'm a little skeptical about the traditional nagios style monitoring..
19:36:46 <jeblair> morganfainberg: agreed about pager duty
19:36:48 <morganfainberg> fungi, it wont happen overnight. and i wouldn't expect it to
19:36:49 <fungi> a monitoring system which is half red and 95% is from false negative results is of use to noone
19:36:54 <pleia2> fungi: so I think we start out with a pretty minimal setup, checking disk space and ping kind of thing
19:36:57 <jeblair> i also just deleted 15,000 emails from our servers that i have not read
19:37:01 <pleia2> disk space has bitten us more than once
19:37:33 <fungi> sure--we already have all that information published and available, but not enough people to sit and stare at it
19:37:35 <pleia2> there are some nagios bots for IRC, so instead of email it could alert to IRC
19:37:42 <jeblair> and my previous experience with things like nagios is that you spend a _lot_ of time adjusting paging thresholds for things like disk space and dealing with false positives..
19:37:56 * anteaya continues to stare at zuul status
19:38:11 <fungi> i've worked jobs where those sorts of systems were a great advantage. we also had a noc with 50+ people staffed around the clock and a department to keep things from firing incorrectly
19:38:28 <pleia2> last I used it I was managing over 100 servers, but we don't have that many and we can target specific ones to monitor that we might be concerned about
19:39:01 <fungi> well, we actually have way more than 100 servers, but we may not care about much in the way of live metrics on most of them
19:39:07 <jeblair> heh
19:39:20 <pleia2> right, well, "static" servers :)
19:39:49 <anteaya> here is my question, what will change as a result of this info?
19:40:03 <pleia2> anteaya: we get alerts if a server goes offline, or disk fills up
19:40:07 <morganfainberg> fungi, i think this is something we start super small with, hit the bigger ticket items and use it in addition to status pages to help us identify issues a bit earlier than "oops" or "hey X is broken" (from 1000 people)
19:40:22 <pleia2> morganfainberg: yeah, that's what I'm thinking
19:40:32 <fungi> and yes, we do have actual disk space utilization data, network interface error stats, and so on trended and accessible in raw form from cacti, which could be consumed by anyone wanting to give us a heads up on broken things too, which might be a good place to start
19:40:35 <anteaya> but if a server goes offline, someone posts in infra
19:40:41 <pleia2> anteaya: sadly now, we sometimes find out when someone joins channel and complains about something not working :\
19:40:48 <anteaya> and as a team, we have fairly good channel coverage
19:40:52 <jeblair> i like monitoring, but i'm not keen on alerting in our environment.  i'd be more happy with status pages that anyone can check.  i'm less keen on email or irc alerts.
19:40:57 <pleia2> anteaya: that's kind of embarassing
19:41:07 <jeblair> pleia2: why?  they usually tell us before nagios would anyway
19:41:09 <sdague> so the issue is there are some normal fail modes, which mostly I find at 6am over coffee. So sean-nagios is something I'd like to stop doing
19:41:10 <anteaya> I don't see how that is going to change
19:41:19 <morganfainberg> sdague, ++
19:41:28 <jeblair> sdague: normal failure modes should be corrected...
19:41:29 <anteaya> since someone telling us will probably happen at teh same time as the alert anyway
19:41:32 <morganfainberg> sdague, not that we want you to stop being you...or stop enjoying coffee
19:41:32 <jeblair> take the logs as an example
19:41:45 <pleia2> jeblair: hurts my sysadmin feelings, I should know what my servers are up to, not have users tell me
19:41:49 <morganfainberg> jeblair, i think this can be used to help also identify the normal failmodes over a longer period of time.
19:41:52 <fungi> sdague: you would prefer to find those failures neatly organized on a status page over coffee instead?
19:41:53 <morganfainberg> jeblair, with the right tool.
19:42:03 <jeblair> sdague, morganfainberg: we have had the log server fill up on disk space before.  our solution to that is to _stop using the log server and put logs in swift_.
19:42:07 <sdague> fungi: yeh, so I don't need to spend brain power deducing them
19:42:18 <morganfainberg> how many times has X failed?
19:42:28 <sdague> or - hey... is my irc bot dead?
19:42:40 <jeblair> sdague, morganfainberg: that's a big project and is going to take some time, but putting our _very_ limited resources into writing and reviewing and implementing that change is WAY better in my opinion than investing in monitoring it
19:42:41 <fungi> debug irc bot, apply fixes, repeat
19:42:45 <nibalizer> an irc bot for alerting could squack in a different channel than the -infra channel
19:42:52 <nibalizer> that way its very opt-in
19:43:08 <fungi> nibalizer: i especially love the circular meta concept of an irc bot warning you that your irc bots are broken ;)
19:43:10 <nibalizer> i would be more inclined to check that, especially it being event driven, than a status webpage
19:43:33 <pleia2> fungi: hehe
19:43:38 <nibalizer> fungi: no worse than the nagios email to tell you email is down
19:43:50 <morganfainberg> jeblair, this is why i brought it up. i know we have limited resources, but it's worth considering
19:44:01 <morganfainberg> jeblair, even if the answer is "not now, maybe later"
19:44:12 <fungi> nibalizer: like jeblair, i basically already have no time to read all the e-mails our systems send me. so sure, no worse than that ;)
19:44:19 <morganfainberg> jeblair, or "lets do something else and see if we need it still down the line"
19:44:29 <sdague> fungi: honestly, I don't want it as alerts, I want status page
19:44:39 <morganfainberg> sdague, ++ that was my initial thought
19:44:42 <pleia2> nagios can do a public-ish status page
19:45:03 <morganfainberg> pleia2, we could use any number of tools for it.
19:45:05 <sdague> we already have all sorts of data we put in zuul status to let us know when things are crazy
19:46:02 <sdague> and I think there are a class of other things that knowing if something just bust, ends up saving me 2 - 3 hrs debugging before fungi gets up and can check on something
19:46:12 <jeblair> pleia2: will it offend your inner sysadmin if that page is red all the time?  how much time will you spend adjusting filessystem usage thresholds?
19:46:31 <jeblair> sdague: let's try to identify what those are and if we can expose them specifically...
19:46:38 <pleia2> jeblair: I guess I haven't had the same experiences as you, my Nagios is quite green
19:46:49 <fungi> i think if there are people who want to spear-head adding a nagios server and doing the tuning necessary to get it usable, then i'm not directly opposed... but that's probably lots and lots of little reviews to tweak the configuration accordingly
19:46:58 <morganfainberg> jeblair, pleia2, i've had both experiences
19:47:12 <sdague> jeblair: sure, my instinct would be this is incremental, in the same way that something like er was
19:47:22 <morganfainberg> jeblair, pleia2, the "everything red" experience is usually because you try and add everything at once and never get any of them right
19:47:27 <pleia2> morganfainberg: yeah
19:47:37 <morganfainberg> jeblair, or no time to spend on it at all
19:47:49 <morganfainberg> jeblair, some orgs are like that :P
19:47:56 <fungi> most of the problem is that there are things which are easy to monitor with very low false negative rates, but those are also the things which just about never have a problem. the things which are more useful to find out about are also the things which need a lot of thought around threshholds
19:48:22 <jeblair> fungi: agreed (with the last 2 things you said)
19:48:38 <sdague> fungi: yeh, I don't care about the easy to monitor things. I care about the things that I bug you about. Like er bot.
19:48:49 <morganfainberg> sdague, ++
19:49:18 <jeblair> sdague: i suspect what you want to know about er-bot is almost impossible to monitor in nagios...
19:49:28 <fungi> sdague: and i think that's an example of "let's figure out why this is broken" while finding out a little more quickly that it broke isn't quite as beneficial over the long term
19:49:31 <sdague> jeblair: out of the box, for sure
19:49:33 <pleia2> yeah, I have bot monitors but they're all "see if this process is running"
19:50:27 <jeblair> pleia2: they are almost always running.  sometimes they are netsplit, sometimes they are stuck due to an irc library bug.  i just yesterday started upgrading irclib to see if we can fix that...
19:50:31 <sdague> yeh, I could write a plugin for this. The biggest thing right now is there was really no infrastructure to register against.
19:50:35 <pleia2> jeblair: yeah, that's what I was afraid of
19:50:39 <jeblair> but all that came from reading log files
19:50:44 * pleia2 nods
19:51:16 <jeblair> sdague: and the solution to "we hit an irclib bug" isn't to write a nagios plugin
19:51:18 <sdague> but I mostly consider this unit testing for some services
19:51:28 <sdague> jeblair: sure
19:51:41 <pleia2> sdague: maybe we should create a wishlist bug to brainstorm on?
19:51:51 <pleia2> then we can come back to this and evaluate
19:52:18 <morganfainberg> pleia2, sdague, i like that.
19:52:57 <jeblair> okay, i don't think we have consensus, but i think you know the concerns.  and i'd like to move on to the next topic
19:52:59 <fungi> well, in some organizations "monitor to let us know every time this fails" is normal operating procedure, which stems from never having time to debug failures and just accepting that things break and you're going to fight fires, bring them back up asap and maybe sometimes discover why they broke in the process
19:53:15 <jeblair> #topic  Infrastructure Priorities
19:53:16 <sdague> fungi: yeh, I don't think we are that org though
19:53:25 <jeblair> #link https://etherpad.openstack.org/p/infrastructure-priorities
19:53:33 <jeblair> there are a lot of new people around...
19:54:01 <jeblair> and when we say things like something is or isn't a priority for our limited resources, it's possible it seems like we're making it up as we go along...
19:54:04 <jeblair> but we aren't.  :)
19:54:19 <mordred> ++
19:54:21 <jeblair> it turns out we set priorities fairly clearly actually at the summits
19:54:42 <jeblair> but we're really bad about communicating those between summits
19:54:51 <jeblair> i'm hoping storyboard will make that better
19:54:58 <mordred> I was just about to say storyboard
19:55:07 <morganfainberg> yay storyboard!
19:55:11 <fungi> mordred: it's there!
19:55:14 <jeblair> but for now, here's a list of things based on the last (approximately) 2 summits
19:55:32 * ttx sighs
19:55:38 <jeblair> so if you're looking for something to work in in infra, check the bugs, but these are our highest priority items
19:55:47 * mordred hands ttx some wine and cheese and a sausage
19:55:55 * mordred agrees
19:55:57 <jeblair> and if you're looking to prioritize reviews, this same list will help
19:56:10 * mordred supports everyone who is in channel working on the list
19:56:26 * mordred will consider kindly anyone who does
19:56:57 <jeblair> and understand that things not on the list may well be good ideas, but i'm personally always thinking about this when i'm writing and reviewing
19:57:06 <nibalizer> ok
19:57:16 <jeblair> and would love it if we could focus on completing some of these before we get too distracted
19:57:47 <nibalizer> fwiw puppetboard 1,2 are done and 3 is in review
19:57:53 <pleia2> \o/
19:58:22 <clarkb> jeblair: thank you for the reminder, I need to hunker down on the last backup bits and close that bug
19:58:24 <jeblair> nibalizer: yeah, i think that's going to unblock a lot of work
19:58:29 <anteaya> fungi: did you ever have a patch up on this? Write Jenkins Job which sends the salt command from salt-trigger slave
19:58:42 <clarkb> fungi: after feature freeze do you think we can sit together and get the sensitive stuff backups sorted?
19:59:01 <fungi> anteaya: that part is a) extremely trivial at this point and b) useless until we can have the reactor-based dependencies implemented
19:59:05 <pleia2> with FF this week, I'm also hoping we can do another bug day next week - Tuesday March 11th at 17:00 UTC
19:59:09 <anteaya> fungi: k
19:59:18 <fungi> clarkb: sure thing
20:00:11 <jeblair> thanks everyone, sorry we didn't get to nibalizer's thing, but i think that topic will be more relevant next week; hopefully we'll have a puppetboard by then!
20:00:19 <jeblair> #endmeeting