*** SimonChung1 has quit IRC | 00:01 | |
*** SimonChung has joined #openstack-operators | 00:07 | |
*** SimonChung1 has joined #openstack-operators | 00:12 | |
*** SimonChung has quit IRC | 00:12 | |
*** dmsimard is now known as dmsimard_away | 00:17 | |
*** blair has joined #openstack-operators | 00:19 | |
*** Marga_ has quit IRC | 00:22 | |
*** furlongm_ has joined #openstack-operators | 00:22 | |
*** Marga_ has joined #openstack-operators | 00:23 | |
*** furlongm has quit IRC | 00:24 | |
*** Marga_ has quit IRC | 00:27 | |
*** furlongm_ has quit IRC | 00:29 | |
*** furlongm has joined #openstack-operators | 00:30 | |
*** Marga_ has joined #openstack-operators | 00:52 | |
*** Marga_ has quit IRC | 00:52 | |
*** Marga_ has joined #openstack-operators | 00:53 | |
*** SimonChung1 has quit IRC | 00:57 | |
*** VW_ has quit IRC | 01:02 | |
*** SimonChung has joined #openstack-operators | 01:15 | |
*** alop has quit IRC | 01:16 | |
*** david-lyle is now known as david-lyle_afk | 01:18 | |
klindgren | Does anyone here use the metadata feature of keystone? | 01:19 |
---|---|---|
klindgren | IE the ability to attach arbitrary key=value stuff to keystone projects and the like? | 01:19 |
*** mdorman has quit IRC | 01:20 | |
klindgren | Example use case: Marking a project as "production" without having to make sure that the project name is "PROD-<name>" | 01:20 |
klindgren | or "test" or "dev" or something of that nature | 01:20 |
dvorak | pretty sure we don't, but that actually explains something I saw in a designate video I was watching earlier today | 01:21 |
dvorak | jlk: I assume you guys are building from internal repos using giftwrap? | 01:22 |
dvorak | klindgren: we use the description of the project to distingish between internal projects and our (internal) customers projects | 01:23 |
dvorak | seems like the metadata would be a lot better way to do that though | 01:23 |
klindgren | Asking because we wanted to use metadata - to you know set metadata stuff about a project. | 01:24 |
klindgren | but its in the process of being removed in kilo with no replacement | 01:24 |
klindgren | Trying to see if anyone wants to or is currently using that feature. | 01:25 |
klindgren | Thinking that we might try to do something like the description as json blob to support that | 01:33 |
*** dmsimard_away is now known as dmsimard | 01:34 | |
*** SimonChung has quit IRC | 01:38 | |
*** markvoelker has joined #openstack-operators | 01:54 | |
*** dmsimard is now known as dmsimard_away | 02:03 | |
*** signed8bit has joined #openstack-operators | 02:22 | |
dvorak | well, the description shows up in horizon, so that'd be kind of ugly, but I imagine it'd work ok | 02:41 |
*** harlowja is now known as harlowja_away | 03:56 | |
*** VW_ has joined #openstack-operators | 03:56 | |
*** signed8bit has quit IRC | 04:08 | |
jlk | dvorak: mostly from upstream repos, but we have forked a few | 04:21 |
*** VW_ has quit IRC | 04:26 | |
*** blairo has joined #openstack-operators | 04:26 | |
*** VW_ has joined #openstack-operators | 04:27 | |
*** blair has quit IRC | 04:29 | |
*** blairo has quit IRC | 04:31 | |
*** VW_ has quit IRC | 04:39 | |
*** SimonChung has joined #openstack-operators | 05:19 | |
*** SimonChung1 has joined #openstack-operators | 05:21 | |
*** SimonChung has quit IRC | 05:23 | |
*** Gala-G has joined #openstack-operators | 05:25 | |
*** blair has joined #openstack-operators | 05:25 | |
klindgren | dvorak, in our particular case - we dont use horizon at all. We have our own frontend that we expose to end users. | 05:34 |
jlk | A common story ^ | 05:36 |
*** sanjayu has joined #openstack-operators | 05:50 | |
*** Marga_ has quit IRC | 05:54 | |
*** markvoelker has quit IRC | 07:06 | |
*** blair has quit IRC | 07:09 | |
*** racedo has quit IRC | 07:13 | |
*** harlowja_away has quit IRC | 07:23 | |
*** subscope has quit IRC | 07:29 | |
*** zerda has joined #openstack-operators | 07:29 | |
*** subscope has joined #openstack-operators | 07:44 | |
*** belmoreira has joined #openstack-operators | 08:02 | |
*** blair has joined #openstack-operators | 08:27 | |
*** zz_avozza is now known as avozza | 08:28 | |
*** subscope has quit IRC | 08:29 | |
beddari | ah .. too much is wrong about this -> http://cloudscaling.com/blog/openstack/vanilla-openstack-doesnt-exist-and-never-will/ | 08:30 |
*** matrohon has joined #openstack-operators | 08:38 | |
*** subscope has joined #openstack-operators | 08:45 | |
*** bvandenh has joined #openstack-operators | 08:55 | |
*** matrohon has quit IRC | 08:59 | |
*** derekh has joined #openstack-operators | 09:24 | |
*** Marga_ has joined #openstack-operators | 11:14 | |
*** markvoelker has joined #openstack-operators | 11:40 | |
*** markvoelker has quit IRC | 11:47 | |
*** Marga_ has quit IRC | 12:02 | |
*** Marga_ has joined #openstack-operators | 12:04 | |
*** reed has joined #openstack-operators | 12:06 | |
*** todin has joined #openstack-operators | 12:10 | |
*** zerda has quit IRC | 12:14 | |
*** markvoelker has joined #openstack-operators | 12:43 | |
*** markvoelker has quit IRC | 12:47 | |
*** VW_ has joined #openstack-operators | 12:50 | |
*** Marga_ has quit IRC | 12:53 | |
*** subscope has quit IRC | 12:55 | |
*** Marga_ has joined #openstack-operators | 12:56 | |
*** markvoelker has joined #openstack-operators | 13:01 | |
*** Marga_ has quit IRC | 13:08 | |
*** VW_ has quit IRC | 13:10 | |
*** subscope has joined #openstack-operators | 13:11 | |
*** matrohon has joined #openstack-operators | 13:11 | |
*** Ctina has joined #openstack-operators | 13:23 | |
*** pboros has joined #openstack-operators | 13:28 | |
*** sanjayu has quit IRC | 13:29 | |
*** matrohon has quit IRC | 13:40 | |
*** subscope has quit IRC | 13:55 | |
*** VW_ has joined #openstack-operators | 13:57 | |
*** subscope has joined #openstack-operators | 14:10 | |
*** signed8bit has joined #openstack-operators | 14:42 | |
*** signed8b_ has joined #openstack-operators | 14:44 | |
*** signed8bit has quit IRC | 14:47 | |
*** Gala-G has quit IRC | 14:58 | |
*** VW__ has joined #openstack-operators | 15:07 | |
*** VW_ has quit IRC | 15:09 | |
*** VW__ has quit IRC | 15:19 | |
*** david-lyle_afk is now known as david-lyle | 15:23 | |
*** VW_ has joined #openstack-operators | 15:24 | |
*** Marga_ has joined #openstack-operators | 15:28 | |
*** VW_ has quit IRC | 15:32 | |
*** VW_ has joined #openstack-operators | 15:33 | |
*** jaypipes has joined #openstack-operators | 15:43 | |
*** VW_ has quit IRC | 15:46 | |
*** Ctina_ has joined #openstack-operators | 15:48 | |
*** Ctina has quit IRC | 15:50 | |
*** VW_ has joined #openstack-operators | 15:51 | |
*** signed8b_ is now known as signed8bit_ZZZzz | 15:51 | |
*** signed8bit_ZZZzz is now known as signed8b_ | 15:53 | |
*** VW_ has quit IRC | 15:55 | |
jlk | beddari: I tend to agree with it. | 15:58 |
*** VW_ has joined #openstack-operators | 16:01 | |
*** Marga_ has quit IRC | 16:02 | |
*** Marga_ has joined #openstack-operators | 16:04 | |
*** avozza is now known as zz_avozza | 16:07 | |
klindgren | jlk - also agreed | 16:40 |
klindgren | besides most people I know already have some investment in either networking/san storage that they are happy with | 16:41 |
klindgren | and aren't really looking to replace what they know | 16:41 |
klindgren | they jsut want to be able to hookup what they already have/know and put an API/UI infront of it | 16:41 |
klindgren | The one thing that openstack is sorely missing is any sort of HA around the vm's that it spins up and trying its best to keep a vm from error state | 16:43 |
klindgren | it seems like any transient error immediatly throws a vm into "ERROR" | 16:43 |
*** mdorman has joined #openstack-operators | 16:48 | |
jlk | That's Cloud | 16:50 |
jlk | autoscale can help there, by keeping a minimum # of roles running | 16:51 |
*** alop has joined #openstack-operators | 16:51 | |
jlk | so if one goes down, the scaler brings up a replacement. | 16:51 |
jlk | I'd rather see more effort around HA for control services. | 16:52 |
jlk | Too many single points of failure | 16:52 |
klindgren | >_> | 17:05 |
klindgren | *cough* rabbitmq *cough* | 17:06 |
klindgren | <_< | 17:06 |
mgagne | jaypipes: As an operator (or casual contributor), how can I make sure my change/blueprint gets reviewed before the feature freeze? Example: https://review.openstack.org/#/c/115409/ Should I poke cores directly until one takes care of it? I wish to avoid this bad experience in the future. | 17:17 |
jaypipes | mgagne: one moment... on a call. | 17:18 |
mgagne | jaypipes: sure, np | 17:18 |
*** VW__ has joined #openstack-operators | 17:19 | |
*** VW_ has quit IRC | 17:21 | |
*** signed8b_ has quit IRC | 17:23 | |
*** VW__ has quit IRC | 17:35 | |
jaypipes | mgagne: alrighty... so lemme take a looksie at the above patch. | 17:36 |
jaypipes | mgagne: generally if it's a smallish patch that is well-defined and contains unit tests, it shouldn't be a problem to ask a couple cores on IRc for a review. | 17:37 |
jaypipes | mgagne: in this case, it looks like you got some feedback from a number of nova drivers team (sdague, jogo, johnthetubaguy, and mriedem) on the blueprint back in November and December | 17:38 |
mgagne | jaypipes: blueprint which was created months after the initial patch after someone commented: doesn't it need a blueprint? | 17:39 |
jaypipes | mgagne: one sec, still reading back through the comments :) | 17:39 |
mgagne | jaypipes: IIRC, I poked a core about that one which then required a blueprint | 17:39 |
*** belmoreira has quit IRC | 17:41 | |
jaypipes | mgagne: and, I agree with you that it's not fair that this was sitting for a long time, with an approved BP and now has been blocked. it's just that it's not on the priority list of reviews, I'm afraid, and there's only a certain amount of core reviewers :( That said, you may certainly apply for a feature freeze exception for this. You'll need 2 cores to sponsor the patch. Gimme a little while to review it and I wi | 17:42 |
jaypipes | ll let you know if I can sponsor it, ok? | 17:42 |
*** Marga_ has quit IRC | 17:43 | |
jlk | klindgren: ugh, don't remind me. So many problems with rabbit. | 17:43 |
mgagne | jaypipes: I'm fine going down with the exception. I however wish to learn the way to avoid this situation again because I (honestly) think I did all I had to do to make it work. | 17:43 |
*** VW_ has joined #openstack-operators | 17:44 | |
*** VW_ has quit IRC | 17:44 | |
jaypipes | mgagne: the best way to avoid the situation is to pester cores early and often I'm afraid. | 17:44 |
*** VW_ has joined #openstack-operators | 17:44 | |
mgagne | jaypipes: and I feel befriending cores look to be the only way to fast forward stuff | 17:44 |
mgagne | jaypipes: alright then | 17:44 |
jaypipes | It's not about befriending :) it's about being a salesperson for your BP/patch | 17:44 |
jaypipes | and just being persistent. | 17:45 |
jaypipes | remember, at any given time, there are more than 600 patches in the review queue... | 17:45 |
jaypipes | so it's easy for cores to lose track of a patch. so it behooves you to gently remind them ;) | 17:45 |
mgagne | jaypipes: I understand/lived this unfortunately reality ^^' | 17:45 |
jaypipes | :) | 17:45 |
mgagne | jaypipes: thanks for your help | 17:46 |
jaypipes | any time! | 17:46 |
mfisch | klindgren: you here? | 17:48 |
klindgren | mfisch, I am | 17:48 |
mfisch | klindgren: wanted to talk more about rabbit and how you guys use it | 17:48 |
mfisch | we're currently using the "list all nodes in the config file" method | 17:48 |
klindgren | Sure - though mdorman did alot of the config on that | 17:49 |
klindgren | we are currently in the clustered rabbitmq + LB | 17:49 |
klindgren | and it *SUCKS* | 17:49 |
mfisch | talking to some guys who worked for Pivotal on rabbit they recommend LB, but they have NFC about openstack | 17:49 |
mfisch | I was looking into switching to haproxy but everyone hate sit | 17:49 |
klindgren | yea suggestion to not | 17:49 |
mfisch | so I was toying with haproxy + primary/backup | 17:49 |
klindgren | we have harpoxy in dev/test | 17:49 |
klindgren | and we are goign to pull it | 17:50 |
klindgren | and go back to all server listed | 17:50 |
mfisch | the biggest issue I have, haproxy or not, is how to restart rabbitmq and not break all services | 17:50 |
klindgren | issue comes with haproxy long connection timeout stuff | 17:50 |
mfisch | openstack services go full retard when rabbit goes away | 17:50 |
klindgren | yea | 17:50 |
*** Marga_ has joined #openstack-operators | 17:50 | |
mfisch | thats an issue haproxy or no haproxy | 17:50 |
klindgren | full retard + sometimes no error messages about not being correctly connected to rabbitmq | 17:50 |
mfisch | yes | 17:51 |
mfisch | thats an openstack issue | 17:51 |
_nick | i've found rabbitmq to be a lot flakier with haproxy added into the mix | 17:51 |
mfisch | IMHO | 17:51 |
klindgren | yea | 17:51 |
klindgren | waiting on oslo.messaging | 17:51 |
_nick | but yeah, openstack goes batshit mental regardless if rabbitmq goes awol | 17:51 |
mfisch | so the basic idea I had before I thought about it more was that if I had haproxy I could bleed connections off before restarting a node | 17:51 |
mfisch | however the openstack connections last like forever | 17:51 |
mfisch | so thats not going to work | 17:51 |
dvorak | it does if you restart all the services :) | 17:52 |
mgagne | mfisch: no actual exp. with haproxy+rabbitmq but operators suggest to not do it. Rabbit knows better about the state of the cluster and queue replication than haproxy alone. | 17:52 |
mfisch | yeah that was your idea | 17:52 |
klindgren | 1.5.2(?) to get commited | 17:52 |
mfisch | the fundamental issue here imho is openstack's reaction to rabbit going away | 17:52 |
dvorak | if you list everything in the config file, how do you take down a node then? | 17:52 |
jlk | ooh rabbit talk | 17:52 |
jlk | yes | 17:52 |
dvorak | we do that now, and restarting rabbit is miserable | 17:52 |
mfisch | rabbit is on my time machine list | 17:53 |
jlk | We tried rabbit as a list in configs | 17:53 |
jlk | but when we failed over to another rabbit server, the services just sat there looking dumb | 17:53 |
klindgren | yea | 17:53 |
mfisch | I dont think haproxy solves that issue | 17:53 |
_nick | jlk: exactly what we've experienced as well | 17:53 |
jlk | we've got back to rabbit running on two systems | 17:53 |
klindgren | its suppsoe to have a heartbeat (coming soon) | 17:53 |
jlk | but not as a cluster | 17:53 |
klindgren | that should figure out tis dead pretty fast and move to another server | 17:53 |
mgagne | dvorak: we are able to take down any node without problem. However if you restart the 3 of them one after the other, AFAIK, you could have problem | 17:53 |
jlk | when failover happens, queues get re-created on the waiting rabbit server | 17:53 |
jlk | and services seem to notice pretty quickly | 17:53 |
dvorak | mgagne: well, and that's usually what we want to do :) | 17:53 |
jlk | there is a tiny chance at lost messages | 17:53 |
mfisch | it seems that no matter what you do there's a chance of lost messages with rabbit | 17:54 |
mgagne | dvorak: then you take out your ansible toolbox and restart the openstack world ^^' | 17:54 |
dvorak | mgagne: that's kind of what I've been suggesting we do anyway :) | 17:54 |
mfisch | mgagne: 3 control nodes and 3 rabbit servers = 27 restarts to restart all of rabbit though | 17:54 |
dvorak | I don't want to have to schedule an API outage every time we need to reconfigure or upgrade rabbit | 17:55 |
jlk | so far, we've had good success with just a single rabbit server running at a time | 17:55 |
jlk | with services pointing at a floating IP for it | 17:55 |
mfisch | jlk: whats your 2nd server exactly in that environment? | 17:55 |
mgagne | mfisch: yep, I think we are all in the same boat regarding flaky support for heartbeat/failover | 17:55 |
jlk | at least in our limited juno testing | 17:55 |
*** Marga_ has quit IRC | 17:55 | |
jlk | mfisch: it's just a rabbit server, not configured for HA or clustering or anything | 17:55 |
mfisch | we're almost to juno, i'm sure all problems will be solved | 17:56 |
jlk | just sitting there unused | 17:56 |
mfisch | jlk: so how do you failover exactly? | 17:56 |
jlk | ucarp floating IP | 17:56 |
klindgren | bascially, as it sits right now our process is if doing anything with rabbitmq | 17:56 |
dvorak | klindgren: is there a specific patch in review for heartbeats? | 17:56 |
klindgren | make sure rabbitmq cluster is not fbuar | 17:56 |
jlk | when the IP moves, services timeout on their old connection and reconnect, which re-creates all the necessary queues | 17:56 |
klindgren | restart world | 17:56 |
klindgren | and life continues | 17:56 |
mgagne | that never ending hunt for stability and bug fixes. You will find new bugs in juno and wish you could upgrade to kilo asap | 17:56 |
klindgren | dvorak, yes | 17:56 |
mfisch | mgagne: it pays the bills at least | 17:56 |
klindgren | https://review.openstack.org/#/c/146047/ | 17:57 |
mgagne | mfisch: can't argue with that haha | 17:57 |
klindgren | supposedly other peopel were saying oslo.messaging 1.5.1 had some fixes in it to make rabbit stuff better | 17:57 |
klindgren | and that guy should make things atleast recover from full retard | 17:57 |
jlk | when we tried doing rabbit clustering with juno, it was not a good story | 17:58 |
jlk | when one of our rabbit nodes goes down, the services never seem to jump to the other one | 17:58 |
jlk | so we'd have to restart any service that uses rabbit | 17:58 |
dvorak | we're runnign 1.4.1 :( | 17:58 |
mgagne | but then you have your distro that is stuck to an archaic version of oslo.messaging and you wish you had venv instead. | 17:58 |
jlk | and wait for half of their worker threads to timeout to the downed rabbit server | 17:58 |
klindgren | yea from what someone else was saying the switch from whateve r messaging to oslo-messaging in icehouse | 17:58 |
dvorak | mfisch: we need to ditch all these UCA packages. | 17:58 |
mgagne | dvorak: hahaha | 17:58 |
klindgren | was a huge set back in terms of stability | 17:58 |
*** Ctina_ has quit IRC | 17:59 | |
jlk | Mirantis seems to have found a bunch of stuff wrong with oslo rabbit | 17:59 |
klindgren | dvorak, we are running 1.4.1 as well | 17:59 |
jlk | and a bunch of PRs are up | 17:59 |
dvorak | yeah, 1.4.1 is the latest available from canonical | 17:59 |
mgagne | but unfortunately you developed stockholm syndrome with UCA | 17:59 |
jlk | I'm hoping they get merged in Kilo, and we'll try clustered rabbit then again | 17:59 |
mfisch | I dont feel like I've got a good sense that I should bother to change anything right now | 18:00 |
mfisch | more like stay where we're at and wait | 18:00 |
dvorak | mgagne: I really don't want to be using UCA, but I don't want to package all this crap myself either. Looking into giftwrap currently. | 18:00 |
mgagne | dvorak: the idea is floating here too. let me know how it goes. | 18:00 |
mfisch | hey dvorak if we dont want to use UCA we could go work for godaddy, problem solved! | 18:00 |
klindgren | personlly - we are jsut going to roll a new version of oslo.messaging under juno | 18:01 |
klindgren | and see if things aren't instantly better | 18:01 |
dvorak | I'd argue that RPM packaging is significantly less stupid than debian packaging. | 18:01 |
klindgren | I can't take rabbitmq issues anymore | 18:01 |
klindgren | "reliable" messaging my ass | 18:01 |
dvorak | klindgren: well, I think at least half the problem is the applications | 18:01 |
mfisch | I'd say 90% | 18:01 |
mgagne | ^ | 18:01 |
dvorak | if you don't at least do basic things like turning on heartbeats... | 18:01 |
mfisch | I think that rabbit would say "use rabbit like this" and openstack said "whatever" | 18:02 |
dvorak | yes, that's what I suspect. | 18:02 |
*** Piet has joined #openstack-operators | 18:02 | |
mfisch | if rabbit was this bad why would anyone use it? plenty of apps run without issues | 18:02 |
*** Gala-G has joined #openstack-operators | 18:02 | |
mgagne | I wonder if any dev are actually running openstack in a production env. | 18:03 |
* mgagne takes cover | 18:03 | |
mfisch | we all know the answer to that one | 18:03 |
mfisch | a fresh built devstack every morning is prod to most people | 18:03 |
*** derekh has quit IRC | 18:03 | |
jlk | mfisch: there are few alternatives to rabbit | 18:03 |
jlk | in the amqp space | 18:03 |
klindgren | pretty sure if you upgrade oslo.messaging and everything uses oslo-messaging and you add the heartbeats and stuff to the configs - I think it should get better no? Not exactly sure how much of the rabbitmq implementation is in oslo.messaging | 18:03 |
dvorak | I know RDO used to use qpid, but I don't know if the still are. | 18:04 |
klindgren | qpid and zeromq? | 18:04 |
dvorak | klindgren: I'd expect that all of it is in oslo.messaging | 18:04 |
jlk | zeromq is a worse story actually | 18:04 |
jlk | not enough testing | 18:04 |
mgagne | "Have you tried to ./unstack.sh and ./stack.sh it back up?" | 18:04 |
klindgren | I dunno anyone running qpid? | 18:04 |
mfisch | is rabbitmq complaints on the mid-cycle meetup list? | 18:04 |
mfisch | mgagne: lol | 18:04 |
jlk | apid is a huge pile of java, which we'd rather avoid | 18:04 |
klindgren | having ran activemq before | 18:04 |
klindgren | not only no - but hello no on that one | 18:05 |
klindgren | hell* | 18:05 |
klindgren | activemq made another application switch to "unreliable messaginging" | 18:05 |
klindgren | which imho is a great way to do it | 18:05 |
klindgren | if something has picked up your request in x amount of time - send it again | 18:05 |
klindgren | I would be fine about a session on messaging in mid-cycle | 18:06 |
jlk | yeah, a mutual gripe session | 18:06 |
mfisch | we can make cars drive themselves but not reliable IPC in 2015... | 18:06 |
klindgren | I would be ok talking about what we have and how its working/pain points | 18:06 |
mfisch | I hope self-driving cars arent using rabbit now that I think about it | 18:07 |
jlk | same. | 18:07 |
dvorak | mfisch and I spent most of two days with some pivotal guys and picked up a lot. I might even remember some of it | 18:07 |
klindgren | honestly though - I dont know anyone who is happy with their rabbitmq/messaging setup | 18:07 |
mfisch | I'll add it if its not here | 18:07 |
mfisch | https://etherpad.openstack.org/p/PHL-ops-meetup | 18:07 |
klindgren | atleast under openstack | 18:07 |
dvorak | klindgren: I've certainly talked to people that are really happy using rabbit for things that aren't openstack | 18:08 |
mfisch | added to the list, please +1 | 18:08 |
dvorak | but most of those people have developers on staff that wrote the apps that talk to rabbit and know what they're doing | 18:08 |
mfisch | dvorak: solution, we should go work on those apps | 18:08 |
dvorak | I've actually done development against rabbitmq, but it was a long time ago | 18:09 |
*** VW__ has joined #openstack-operators | 18:09 | |
klindgren | dvorak, we use rabbitmq in our logstash setup | 18:09 |
klindgren | and honestly - it jsut works | 18:09 |
klindgren | outside of standard rabbitmq clustering upgrade issues | 18:10 |
klindgren | eg: new version of erlang? | 18:10 |
*** VW_ has quit IRC | 18:10 | |
dvorak | yeah, that's the thing that's biting us right now | 18:11 |
mfisch | new version of erlang is going to cause restarts | 18:11 |
dvorak | there is a new version of erlang available from ubuntu and we're afraid to upgade | 18:11 |
dvorak | upgrade | 18:11 |
dvorak | it's really sad when rabbitmq makes galera look like an easy to manage application | 18:12 |
*** VW__ has quit IRC | 18:12 | |
jlk | the problem is that most development focuses on thing working under optimal dondition | 18:13 |
jlk | condition | 18:13 |
mfisch | I'd marry galera after dating rabbit | 18:13 |
jlk | not focusing on things working when rabbit fails underneath them | 18:13 |
jlk | or fails over to another rabbit server. | 18:13 |
*** VW_ has joined #openstack-operators | 18:13 | |
mfisch | I dont think devs ever see the rabbit issues we have, atleast not often | 18:13 |
mfisch | you can't simulate them well on a laptop | 18:13 |
jlk | you can if you launch two instances for rabbit | 18:14 |
jlk | and kill one of them :) | 18:14 |
jlk | need multi-node devstack | 18:14 |
mfisch | I wonder how often launching 2 rabbits and killing one is tested | 18:14 |
jlk | Rackspace just took a different approach. They only have one rabbit server and a hot spare. If one dies, well, it's restart the world time. | 18:15 |
jlk | mfisch: approaching 0 | 18:15 |
*** Gala-G has quit IRC | 18:15 | |
mfisch | thats the way I was headed with a primary/backup behind haproxy | 18:15 |
mfisch | until I heard the issues here | 18:15 |
jlk | nod | 18:15 |
jlk | I like our current approach | 18:15 |
jlk | "like" | 18:15 |
jlk | as in prefer it over the alternatives. But I don't really "like" it | 18:16 |
klindgren | jlk - honestly - looking to move to something like that | 18:16 |
*** Marga_ has joined #openstack-operators | 18:16 | |
jlk | my main motivation is to not have to have somebody have to restart a bunch of things if there is a network failover | 18:16 |
mfisch | klindgren: have you tried packaging up the newer oslo-messaging? | 18:16 |
*** signed8bit has joined #openstack-operators | 18:18 | |
mgagne | jlk: and it's done by having a active/standby setup? | 18:19 |
*** signed8bit is now known as signed8bit_ZZZzz | 18:19 | |
jlk | mgagne: kinda, more like solo active and solo active. Neither rabbit server knows about the other one | 18:19 |
mgagne | jlk: any other issue related to firewall session timeout or such? | 18:19 |
jlk | no, our firewall just allows the incoming rabbit port. It's a local firewall on the host, not a device. | 18:20 |
jlk | our clouds are very small so everything is fairly localized | 18:20 |
mgagne | right | 18:20 |
mfisch | so if rabbit gets restarted on a box, what happens jlk | 18:20 |
mfisch | or the node reboots | 18:20 |
jlk | if the node reboots, ucarp moves the floating IP over. All hosts are pointed at the floating IP | 18:21 |
jlk | on the non-active rabbit it'll suddenly start getting connections, and queues get re-created by all the producers/consumers | 18:21 |
mfisch | so messages already in queues are lost? | 18:21 |
jlk | anything uncomsumed yes | 18:21 |
mfisch | unconsumed messages | 18:21 |
mfisch | ok | 18:21 |
klindgren | eh jlk - from xp - pretty sure thats not going to work out very well | 18:22 |
jlk | it's up to the producers to know that their message wasn't consumed | 18:22 |
jlk | either they don't care, or they do and handle it | 18:22 |
klindgren | we have the same thing happen with haproxy where it hits an idle timout out and a connection gets moved from one server to naother | 18:22 |
mfisch | I agree in theory, not sure thats how openstack works | 18:22 |
*** signed8bit_ZZZzz is now known as signed8bit | 18:22 | |
klindgren | and the client doesn't really realizes that its ben diconeccted | 18:22 |
klindgren | been disconnected* | 18:22 |
jlk | at least in juno, some of the clients do | 18:23 |
klindgren | and were runnign clustered rabbitmq there so it should have access to the same queues on the new stuff as the old stuff | 18:23 |
jlk | like nova-conductor and nova-compute | 18:23 |
jlk | klindgren: we aren't running clustered, that may be the key | 18:23 |
klindgren | kk | 18:23 |
jlk | nor do we have HA queues turned on in the openstack configs | 18:23 |
*** radez_g0n3 is now known as radez | 18:23 | |
mfisch | we have ha queues enabled | 18:24 |
jlk | IIRC you're only supposed ot use that with clustered rabbit | 18:24 |
mfisch | do any of you guys have TTL set on messages to help cleanup the random queues with no consumers? | 18:24 |
klindgren | we don't afaik | 18:24 |
mdorman | hey guys just getting back here. | 18:24 |
jlk | no, we haven't had any problems with unconsumed messages | 18:24 |
jlk | we have a sensu check for unconsumed messages even, and it never fires unless something really weird happens. | 18:24 |
mfisch | we occassionally will have queues that have say 900 messages with no producers and no consumers | 18:25 |
klindgren | mdorman, knows more about our rabbitmq setup | 18:25 |
jlk | (like a network failover without a server actually restarting, a couple queues could get left stale on the now no-longer active rabbit) | 18:25 |
mfisch | mdorman: drive on up to FTC and I'll buy you a lemonade | 18:25 |
klindgren | mfisch, I haven't tried packaging a newer oslo.messaging - however when that heartbeat stuff makes it in -its on the list to do | 18:25 |
klindgren | I have a story every sprint that I am moving so I keep track of it | 18:26 |
mdorman | actually i’m in arizona right now, so maybe in a couple weeks :) | 18:26 |
mfisch | klindgren: same here | 18:26 |
mdorman | anyways, got anothe rmeeting in 30 min so i gotta go get some lunch | 18:26 |
mfisch | jlk: whats your procedure for a "nice" failover in your environment? | 18:27 |
mfisch | jlk: like a scheduled maintenance | 18:27 |
mdorman | mfisch: i’ll tell you about RMQ if you tell me about federated keystone :) | 18:27 |
mfisch | ours is more like in quotes federated | 18:28 |
mdorman | haha | 18:28 |
jlk | mfisch: we would initiate a floating IP failover | 18:28 |
jlk | well, we'd do maint on the non-active one first, then do a falover and do maint on the other one | 18:28 |
dvorak | other than the galera crashes, it's kind of like federated keystone on "easy" setting | 18:28 |
mfisch | jlk: do you restart any services? shutdown rabbit etc? | 18:28 |
jlk | depends on what we're touching | 18:28 |
jlk | typically rabbit itself doesn't need to get messed with | 18:28 |
mfisch | lets assume rabbit has to restart so connections are lose | 18:28 |
jlk | only openstack services | 18:28 |
jlk | if you restart rabbit itself, but don't move the floating IP, I think clients notice | 18:29 |
mgagne | jlk: have you changed workplace recently? (looking at past openstack summit presenation) | 18:29 |
jlk | but I haven't tested that in a bit | 18:29 |
jlk | mgagne: yes, I was at Rackspace previously, I'm at Blue Box now | 18:29 |
jlk | in experience, we haven't had to do anything specific to rabbit in like... ever | 18:29 |
mfisch | jlk: I'm trying to figure out how your setup would differ from ours in that respect. If I assume connection to rabbit is lost I know we have to sometimes restart openstack services | 18:29 |
jlk | outside of needing to reboot the system rabbit runs on | 18:30 |
*** VW_ has quit IRC | 18:30 | |
mgagne | jlk: this explains the discrepancy between your workplace and profile text on sched ;) | 18:30 |
jlk | heh | 18:30 |
jlk | mfisch: yeah there are some scenarios where a restart may be necessary | 18:30 |
*** harlowja has joined #openstack-operators | 18:30 | |
mfisch | apt-get install --upgrade erlang is on our radar | 18:31 |
jlk | mfisch: at Rackspace we've ran into that, when the rabbit server itself hiccuped or the network between the two hiccuped | 18:31 |
jlk | I designed an ansible playbook there that would reboot any "bus" service | 18:31 |
jlk | er not reboot, but restart. | 18:31 |
mgagne | jlk: sounds like we use the same workaround ^^' | 18:31 |
jlk | so things like nova-scheduler, nova-conductor, nova-compute. But not nova-api | 18:31 |
jlk | wasn't pretty | 18:32 |
jlk | I was hoping clustered rabbit would resolve that | 18:32 |
jlk | alas.... | 18:32 |
klindgren | nova-cells? | 18:32 |
jlk | oh yeah, nova-cells | 18:32 |
jlk | I forget about that, we don't use cells at BB | 18:32 |
mgagne | jlk: any issue related to firewall? due to missing rabbit heartbeat, sessions often get dropped if we don't bump tcp keepalive in kernel =( | 18:32 |
klindgren | I thought that nova-api would stuff messages on the queue? | 18:32 |
jlk | the nice thing about all those services is that customers don't directly hit them, so they can be restarted at-will | 18:33 |
klindgren | eh | 18:33 |
klindgren | not really | 18:33 |
klindgren | if nova-compute is downloading a backing image | 18:33 |
jlk | mgagne: at Rackspace there was a firewall rule that allowed long lived connections to rabbit | 18:33 |
klindgren | and your reboot it - boom dead vm | 18:33 |
jlk | klindgren: graceful restart | 18:33 |
jlk | that's actually a thing in nova | 18:33 |
klindgren | shit | 18:34 |
klindgren | need to figure out how to do that with systemd | 18:34 |
mgagne | jlk: haha, our netadmin would go berserk if we asked him to do the same ^^' | 18:34 |
klindgren | I got 99 problems and systemd is one | 18:34 |
jlk | klindgren: nova accepts a TERM signal | 18:34 |
jlk | when it gets that, it'll go into clean up mode. Do all it's running stuff, don't take any new on | 18:34 |
jlk | it'll run indefinitely in that mode until either all running stuff completes, or it gets a kill signal | 18:35 |
jlk | on debian platforms, start-stop-daemon was nice for that, since you could feed it a set of signals and a timer between them | 18:35 |
jlk | so we baked that into our init script | 18:36 |
jlk | graceful shutdowns, nova-conductor, kinda awesome for doing low impact sweeping restarts | 18:37 |
klindgren | ah - jlk do you have anything for systemd? :-) | 18:38 |
jlk | no. I was unsuccessful in convincing RAX to move to anything newer than debian squeeze when I was there. | 18:39 |
jlk | and BB is still on Ubuntu Precise | 18:39 |
jlk | I'm quietly making a play to move to CentOS7 but I haven't spent much time on that | 18:39 |
jlk | first up was getting from havana to juno | 18:39 |
klindgren | moved to cent7 and got chaos monkey for free (systemd) | 18:39 |
jlk | I'm not as angered by systemd, but I was a RHT employee for many years and was part of the transition | 18:40 |
*** mdorman is now known as mdorman_away | 18:40 | |
jlk | I'd take systemd over upstart, but mostly I just want consistency. | 18:41 |
jlk | with precise, some things are upstart, some things are traditional init.d stuff, and it's difficult to manage | 18:42 |
klindgren | eh - I am not as angry about systemd. But immediately ran into problems | 18:42 |
klindgren | like systemd nuking cgroups created by libvirt | 18:42 |
jlk | that would be a bummer | 18:42 |
klindgren | logstash not starting on bootup randomly | 18:42 |
klindgren | systemctl saying that everything is happy and fine when service really == dead | 18:42 |
klindgren | jlk, yea - its been fixed now | 18:43 |
klindgren | but that was a fun one :-) | 18:43 |
*** bvandenh has quit IRC | 18:45 | |
klindgren | Is anyone from mirantis going to be at the ops-midycle? | 18:49 |
*** signed8bit has quit IRC | 18:50 | |
klindgren | Since we see a bunch of patches re: oslo.messaging and pretty much everyone who talked here had issues with openstack + rabbitmq wondering ifthey have some secret sauce figure out for making it less of pain | 18:51 |
*** signed8bit has joined #openstack-operators | 18:51 | |
*** zz_avozza is now known as avozza | 18:52 | |
mgagne | klindgren: +1 on that one | 18:53 |
*** signed8bit is now known as signed8bit_ZZZzz | 19:01 | |
*** avozza is now known as zz_avozza | 19:03 | |
*** signed8bit_ZZZzz is now known as signed8bit | 19:04 | |
*** bradm has quit IRC | 19:12 | |
*** VW_ has joined #openstack-operators | 19:14 | |
*** signed8b_ has joined #openstack-operators | 19:27 | |
*** signed8bit has quit IRC | 19:31 | |
jaypipes | mgagne: btw, I haven't forgotten about you. still going through a review. | 19:40 |
mgagne | jaypipes: sure, I know you are busy, we are all busy =) | 19:40 |
mgagne | guys, how about we document somewhere the common pitfall with rabbit and openstack? If there are known solutions (or absence of solution), lets write them down too. | 19:46 |
jlk | it should go in the HA guide | 19:46 |
jlk | which right now sets people up for failure (doing a rabbit cluster with HA queues) | 19:46 |
mgagne | I don't consult much the guide, lots of good ideas but I often think I'm better or different and implement a different solution. But the pitfalls stay the same (and undocumented). | 19:48 |
jaypipes | mgagne: ++ on your suggestion about rabbit+openstack issues. | 19:49 |
klindgren | yea - their was a ops mailing list thing that went around a few weeks ago - thats what turned me on to the heartbeat stuff | 19:52 |
mgagne | Same with MySQL/MariaDB/Galera I would say. I'm still trying to understand the (not so clear) pitfalls of Galera (reading the topic on openstack-dev and jaypipes' blog post). I'm still on the fence about using Galera instead of traditional DRBD/heartbeat setup | 19:52 |
klindgren | I am responding to that to see if someone from miratnis who has been doing work on their HA stuff for fuel was going to be at the meetup - would like them to talk about their setup nd the fixes the commited | 19:53 |
jaypipes | mgagne: we ran 12 availability zones at AT&T in a multi-writer Galera cluster across the WAN and never had issues with it. For the internal AZ clusters (for Nova, Neutron, Cinder dbs), we used a 4-node Galera multi-writer cluster, load balanced equally across all nodes, and again never had issues. | 19:53 |
mgagne | I just don't know how to reconcile the "ops ways" of doing thing (casual wiki) vs the openstack manual (formal docbook) | 19:53 |
jaypipes | mgagne: now... is it possilbe to run a stress test that deliberately tries to swamp the database with concurrent requests for hotspot data and get a bunch of retries due to deadlocks? yes. does it really happen in production sites? not really. | 19:54 |
klindgren | for us atleast rabbitmq seems to be the source of a majority of our issues. IE issues where a consumers aren't picking up messages - openstack breaks after somethign with a rabbitmq node happens or LB failover or something like that, and a "restart world" solve the problem | 19:55 |
mgagne | jaypipes: that's the kind of stories I wish to hear. I don't care about endless discussion about theoretical situations which statistically never happens just for the sack of arguing about something. | 19:55 |
jaypipes | exactly. | 19:55 |
jaypipes | klindgren: at AT&T, I would say keeping RabbitMQ up and happy was our #1 issue from an ops perspective as well. | 19:56 |
jaypipes | we ran it in cluster mode, active/active spread round robin with sticky sessions and mnesia persistence. | 19:56 |
klindgren | jaypipes, and did that work for you? We have tried the clustered rabbit behind a LB and clustered rabbit + listing all the rabbitmq nodes in the configuration | 20:00 |
klindgren | and none of them seem to failover correctly | 20:01 |
jaypipes | on phone one sec | 20:01 |
klindgren | though from XP - the loadbalancer config seemed to be worse than listing servers in config. | 20:01 |
klindgren | Mainly because of connections getting transferred to other nodes due to reasons, and the client being totally oblivious to the fact something changed | 20:02 |
*** Marga_ has quit IRC | 20:03 | |
*** Marga_ has joined #openstack-operators | 20:03 | |
*** mdorman_away is now known as mdorman | 20:04 | |
*** Marga_ has quit IRC | 20:08 | |
alop | Does anyone have any experience 'inspecting' images in glance? | 20:08 |
alop | I'm thinking this would be something worth writing a blueprint for | 20:08 |
alop | we've been talking about it at work, like, we allow users to upload whatever they want, and can't impose a naming scheme | 20:09 |
alop | and we've gotten ourselves into a situation where we've agreed to report to Microsoft on the number of instances running windows | 20:09 |
alop | even if we don't provide it | 20:09 |
alop | ^Which I thought was a huge fail on the lawyers part | 20:09 |
alop | Like, if it was *my* image, that I was providing users, then I could do somethign with meta-data, naming scheme, etc to make it easy to report on usage | 20:10 |
klindgren | like people doing byoi and running windows with their own keys and the like? | 20:10 |
alop | ya | 20:10 |
alop | somehow, that' *my* problem | 20:11 |
klindgren | erm | 20:11 |
klindgren | suggestion to go back to microsoft and "ask" someone else | 20:11 |
alop | so, I'm looking at libguestfs, inspect-os | 20:11 |
alop | haha | 20:11 |
alop | I envision Monty Python and the Holy Grail | 20:11 |
alop | where King Arthur is talking to the French soldiers | 20:11 |
alop | "Is there someone else we can talk to?" | 20:12 |
klindgren | honestly - if you dont provide windows as an offering on your cloud | 20:12 |
klindgren | then people running images is not your problem | 20:12 |
klindgren | microsoft only cares about you using spla lics and providing windows to customers | 20:12 |
alop | Exactly, either the business/legal people totally misunderstand, | 20:12 |
klindgren | if you do that then you have all sorts of bullshit to deal with | 20:12 |
alop | exactly | 20:13 |
alop | but talking to the Business/Product people here is like talking to a Horse | 20:13 |
alop | j05h: ^^^ | 20:13 |
klindgren | well - I am pretty sure that if you report people running their own windows images + their own licensing - microsoft is going to come after you to pay for licensicing of those instances. | 20:14 |
alop | that's exactly what I said! | 20:14 |
alop | like, the users would have already paid for their own windows | 20:14 |
alop | and we're charging them again?!?! | 20:15 |
alop | Like, with Redhat, it's pretty easy. You don't register the instance, you get no package updates/repos/support | 20:15 |
klindgren | Anyway - its been a while since I did anything with microsoft. I try to avoid windows as much as possible these days. But some general googling seems to indicate that customers can only license servers that are on premise. | 20:25 |
klindgren | Though unsure whos problem that really becomes unless you want to provide the license and also take on the support burden of microsoft stuff. Including the crazy licenses requirements | 20:26 |
alop | yeah, I'm going to suggest that we get to see what the actual agreement was, It's possible that the business people are misinterpretting it | 20:27 |
klindgren | the other problem - from XP 4+ years ago | 20:27 |
klindgren | is the answer changes depending on who you talk to | 20:27 |
klindgren | :-) | 20:27 |
klindgren | even in the same division @ Microsoft | 20:28 |
dvorak | mgagne: mfisch and I have both spent a lot of time on galera care and feeding, upgrades etc. We'd be glad to help on anything | 20:29 |
dvorak | we run a local 3 node galera cluster in each region (2 regions) and a 6 node + arbitrator cluster across the two regions for keystone + horizon sessions | 20:29 |
mfisch | I am missing context all I see is MSFT talk | 20:30 |
alop | oh, someone is telling me that I need to figure out a way to determine which instances are running windows | 20:31 |
dvorak | sorry, it was way back in scrollback | 20:31 |
alop | and not just from "my" images | 20:31 |
alop | but from images they might add to glance | 20:31 |
alop | so, I'm looking at libguestfs "inspect-os" | 20:31 |
dvorak | alop: I'm not sure if we actually have any windows instances, but my understanding is that if you buy datacenter licenses for a compute node, the instances hosted on it are automatically covered | 20:32 |
mfisch | yeah I think our plan is to pay per host | 20:32 |
dvorak | I believe the plan has been to license a specific number of compute hosts and use aggregates to schedule windows instances only on those compute hosts | 20:32 |
alop | I'm being told that there's some specific german law at play here | 20:32 |
alop | it just seems like a fools errand | 20:33 |
dvorak | ah, ok. I couldn't begin to cover that :) | 20:33 |
*** zz_avozza is now known as avozza | 20:33 | |
alop | I'd love to write a blueprint for a disk inspector for glance | 20:33 |
alop | but the reasoning is upsetting | 20:33 |
mgagne | dvorak: that's the solution we went with: aggregates, image props + KMS | 20:33 |
*** Marga_ has joined #openstack-operators | 20:34 | |
alop | one thing I could think of, is to change all the root disk sizes to something much smaller, which would not be suitible for windows, then make a flavor names "m1.windows" or something, which has the large root disk windows likes | 20:36 |
alop | butthat wouldn't cover all instances | 20:36 |
mgagne | I can only imagine a troll face: problem? :D | 20:38 |
*** avozza is now known as zz_avozza | 20:38 | |
*** Marga_ has quit IRC | 20:38 | |
*** matrohon has joined #openstack-operators | 20:43 | |
*** signed8b_ is now known as signed8bit_ZZZzz | 20:55 | |
*** Marga_ has joined #openstack-operators | 20:56 | |
*** signed8bit_ZZZzz is now known as signed8b_ | 20:59 | |
*** bvandenh has joined #openstack-operators | 21:04 | |
*** zz_avozza is now known as avozza | 21:06 | |
*** Marga_ has quit IRC | 21:10 | |
*** Marga_ has joined #openstack-operators | 21:11 | |
*** Marga__ has joined #openstack-operators | 21:15 | |
*** Marga_ has quit IRC | 21:15 | |
*** matrohon has quit IRC | 21:16 | |
*** matrohon has joined #openstack-operators | 21:16 | |
*** Marga__ has quit IRC | 21:17 | |
*** Marga_ has joined #openstack-operators | 21:18 | |
*** matrohon has quit IRC | 21:40 | |
*** Piet has quit IRC | 21:46 | |
*** blair has quit IRC | 21:50 | |
*** bvandenh has quit IRC | 21:51 | |
*** Rockyg has joined #openstack-operators | 21:54 | |
*** VW__ has joined #openstack-operators | 21:58 | |
*** Marga_ has quit IRC | 21:59 | |
*** Marga_ has joined #openstack-operators | 21:59 | |
*** VW_ has quit IRC | 22:01 | |
*** Marga_ has quit IRC | 22:04 | |
*** jaypipes has quit IRC | 22:04 | |
*** VW_ has joined #openstack-operators | 22:05 | |
*** avozza is now known as zz_avozza | 22:05 | |
*** VW__ has quit IRC | 22:08 | |
*** Gala-G has joined #openstack-operators | 22:12 | |
*** Marga_ has joined #openstack-operators | 22:16 | |
*** Marga_ has quit IRC | 22:16 | |
*** Marga_ has joined #openstack-operators | 22:17 | |
*** VW_ has quit IRC | 22:21 | |
*** blair has joined #openstack-operators | 22:27 | |
*** VW_ has joined #openstack-operators | 22:34 | |
*** signed8b_ has quit IRC | 22:44 | |
*** radez is now known as radez_g0n3 | 22:46 | |
*** pboros has quit IRC | 22:57 | |
*** VW__ has joined #openstack-operators | 23:01 | |
*** VW_ has quit IRC | 23:03 | |
*** VW__ has quit IRC | 23:10 | |
*** Rockyg has quit IRC | 23:16 | |
*** VW_ has joined #openstack-operators | 23:22 | |
*** j05h1 has joined #openstack-operators | 23:23 | |
*** j05h has quit IRC | 23:26 | |
*** VW_ has quit IRC | 23:39 | |
*** VW_ has joined #openstack-operators | 23:40 | |
*** VW_ has quit IRC | 23:41 | |
*** david-lyle is now known as david-lyle_afk | 23:41 | |
*** Piet has joined #openstack-operators | 23:50 | |
*** openstack has joined #openstack-operators | 23:57 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!