#openstack-meeting-4 log

14:00:31 <dmellado> #startmeeting kuryr
14:00:32 <openstack> Meeting started Mon Dec 17 14:00:31 2018 UTC and is due to finish in 60 minutes.  The chair is dmellado. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:35 <openstack> The meeting name has been set to 'kuryr'
14:00:45 <dmellado> Hi everyone! Who's around here for today's kuryr meeting? ;)
14:00:52 <dulek> o/
14:01:01 <yboaron_> o/
14:01:03 <dmellado> #chair dulek
14:01:04 <openstack> Current chairs: dmellado dulek
14:01:05 <dmellado> #chair yboaron_
14:01:06 <openstack> Current chairs: dmellado dulek yboaron_
14:01:54 <dmellado> All right, so let's do a quick roll call
14:02:13 <dmellado> first of all, due to Christmas' holidays there won't be kuryr meetings on 24 nor on 31st
14:02:14 * dulek is trying to summon celebdor as he might be able to give some insight on that kuryr-daemon mem usage.
14:02:26 <dmellado> I'll be sending an email as a reminder
14:02:36 <dmellado> dulek: could you give a brief summary about that?
14:02:41 <dmellado> on which gates is this happening?
14:02:43 <ltomasbo> o/
14:02:47 <dmellado> what's eating memory and so?
14:02:48 <dulek> Oh maaan, we could wish ourselves happy holidays on 24th and happy new year on 31st. :D
14:02:52 <dmellado> #topic kuryr-kubernetes
14:03:02 <dmellado> dulek: lol
14:03:06 <dmellado> #chair ltomasbo
14:03:07 <openstack> Current chairs: dmellado dulek ltomasbo yboaron_
14:03:13 <dulek> Okay, let me explain then.
14:03:43 <dulek> So it's the issue of events being skipped and not passed to kuryr-controller that is watching K8s API.
14:03:56 <dulek> I've linked that to errors in kubernetes-api/openshift-master
14:04:05 <dulek> http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-openshift-master.txt.gz
14:04:07 <dmellado> like, random type events? or is it tied to specific ones?
14:04:09 <dulek> Here's an example.
14:04:16 <dmellado> #link http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-openshift-master.txt.gz
14:04:27 <dulek> dmellado: Like the updates on pods, services or endpoints.
14:04:45 <dulek> Okay, so then I've linked those issues with etcd warnings: http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-etcd.txt.gz
14:04:51 <maysams> o/
14:04:58 <celebdor> I am here
14:05:02 <celebdor> what's up
14:05:07 <dmellado> dulek: so that's linked with http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000906.html
14:05:08 <celebdor> sorry for the delay, I was on a meeting
14:05:14 <dulek> dmellado: Yes!
14:05:15 <dmellado> hiya folks
14:05:18 <dulek> And then with some insight from infra I've started looking at dstat files from those gate runs.
14:05:45 <dulek> What immediately caught my attention is the column that shows up the process using most of the memory.
14:05:55 <dmellado> celebdor: dulek has been seeing odd things on dstat, you might have some experience on that
14:06:00 <dulek> It's kuryr-daemon with value of 2577G.
14:06:03 <dmellado> how many memory was the process eating, dulek? xD
14:06:22 <celebdor> that looks far too high
14:06:31 <dulek> http://logs.openstack.org/27/625327/3/check/kuryr-kubernetes-tempest-daemon-openshift-octavia/cb22439/controller/logs/screen-dstat.txt.gz - see for yourself, it's the third column to the last.
14:06:36 <celebdor> dulek: only on the gates or locally as well?
14:06:47 <dulek> celebdor: I'll be checking that after the meeting.
14:06:51 <celebdor> ok
14:06:53 <celebdor> thanks
14:06:54 <dulek> So first of all I don't understand why OOM is not happening.
14:07:03 <dmellado> dulek: is this also tied to specific cloud providers on the infra?
14:07:12 <dmellado> i.e. it happens on RAX but not on OVH and so
14:07:13 <dmellado> ?
14:07:25 <dulek> dmellado: Not really, no. The logs I'm sending you now are from INAP.
14:07:32 <dulek> Second - once the kuryr-daemon is started it drains all the swap.
14:07:55 <dulek> So this might have some sense - the issues in etcd might be due to some IO delays.
14:08:41 <celebdor> dulek: s/drains/occupies?
14:09:27 <ltomasbo> dulek, didn't we have a healthcheck that was restarting kuryr-daemon if eating more than X MBs?
14:09:43 <dulek> celebdor: Well, swap usage raises. But now that I look again it's only using up to 3 GB of it, so not all.
14:09:49 <dulek> ltomasbo: That's only on containerized.
14:10:17 <celebdor> dulek: really?
14:10:33 <celebdor> a docker leak then?
14:10:39 <yboaron_> dulek, Q: Does finding this string in etcd log file means that etcd lost requests? 'avoid queries with large range/delete range!' ?
14:10:41 <celebdor> that would be unusual, I usspose
14:10:50 <dulek> celebdor: In that case kuryr-daemon runs as a process, not container.
14:11:04 <dmellado> dulek: I guess he's referring to etcd
14:11:09 <dulek> yboaron_: Not 100% sure really.
14:11:19 <dmellado> but if kuryr-daemon is the process eating up the memory, then it doesn't apply
14:11:23 <dulek> dmellado: We run etcd as process as well.
14:11:28 <celebdor> dulek: so this happens on containerized or not. Not sure I understood
14:11:48 <dulek> celebdor: Ah, sorry. I saw it on non-containerized at the moment.
14:12:02 <dmellado> it could be due to the healthcheck
14:12:12 <dulek> But I can check…
14:12:18 <dmellado> it'd be interesting to see if we can replicate if we disable that healthcheck on containerized
14:12:58 <dmellado> dulek: did you freeze an upstream VM to try to fetch some data?
14:13:14 <dulek> Looks the same in containerized: kuryr-daemon 2589G
14:13:33 <dulek> celebdor: Note that we run kuryr-daemon as privileged container, so it's on host's ps namespace, right?
14:13:45 <dulek> dmellado: Not yet, I'll try this locally first.
14:14:12 <dulek> Okay, so if nobody has an idea, I'll try to poke it a bit more.
14:14:23 <dulek> Hopefully it's some dstat quirk
14:14:38 <dulek> But in any case I guess it's worth to be looked at.
14:14:38 <dmellado> dulek: in any case let me know if you get stuck trying this on local
14:14:50 <dmellado> and we'll take a look together in a vm either upstream or rdocloud
14:14:56 <celebdor> dulek: indeed
14:15:10 <dulek> Okay, cool!
14:15:30 <dmellado> thanks for bringing this dulek
14:16:50 <dmellado> so, I also wanted to let you know
14:17:01 <dmellado> that I've been working on getting python-openshift packaged for fedora
14:17:07 <dmellado> which involves a hell of bureaucracy
14:17:20 <dmellado> and being sponsored into fedora-packagers group
14:17:32 <celebdor> I thought it's just a rpmspec file
14:17:36 <dmellado> I'm not sure if besides me there's anyone on the team who also got those privileges
14:17:46 <dmellado> but if you could vote for this I'd be happy
14:17:58 <dmellado> celebdor: it involved a dependency hell xD
14:18:01 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1654833
14:18:02 <openstack> bugzilla.redhat.com bug 1654833 in Package Review "Review Request: python-kubernetes - add python-kuberenetes to EPEL 7" [Medium,New] - Assigned to nobody
14:18:04 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1654835
14:18:04 <openstack> bugzilla.redhat.com bug 1654835 in Package Review "Review Request: python-google-auth - Add python-google-auth to EPEL 7" [Medium,Post] - Assigned to zebob.m
14:18:06 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1659281
14:18:06 <openstack> bugzilla.redhat.com bug 1659281 in Package Review "Review Request: python-dictdiffer - Dictdiffer is a module that helps you to diff and patch dictionaries" [Medium,New] - Assigned to nobody
14:18:08 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1659282
14:18:09 <openstack> bugzilla.redhat.com bug 1659282 in Package Review "Review Request: python-string_utils - A python module containing utility functions for strings" [Medium,New] - Assigned to nobody
14:18:10 <dmellado> https://bugzilla.redhat.com/show_bug.cgi?id=1659286
14:18:11 <openstack> bugzilla.redhat.com bug 1659286 in Package Review "Review Request: python-openshift - Python client for the OpenShift API" [Medium,New] - Assigned to nobody
14:18:44 <dmellado> celebdor: ltomasbo maysams yboaron_ andy topics from your side?
14:19:11 <celebdor> no
14:19:19 <celebdor> updates from NP?
14:19:25 <yboaron_> neither from my side
14:19:32 <maysams> dmellado: celebdor: yep
14:19:37 <dmellado> go for it maysams
14:19:54 <ltomasbo> I have a topic
14:20:04 <ltomasbo> https://bugs.launchpad.net/kuryr-kubernetes/+bug/1808506
14:20:06 <openstack> Launchpad bug 1808506 in kuryr-kubernetes "neutron-vif requires admin rights" [Medium,New] - Assigned to Luis Tomas Bolivar (ltomasbo)
14:20:29 <ltomasbo> sorry maysams, go ahead...
14:20:34 <maysams> I am working on updating the sg rules in the CRD when pods are created/deleted/updated
14:20:35 <maysams> https://review.openstack.org/#/c/625588
14:20:42 <maysams> ltomasbo: it's okay
14:20:54 <ltomasbo> great! I'll take a look asap!
14:20:56 <celebdor> ltomasbo: that's BM only
14:21:17 <celebdor> maysams: you also reported some bug, rgiht?
14:21:37 <maysams> celebdor: yes
14:21:39 <maysams> https://bugs.launchpad.net/kuryr-kubernetes/+bug/1808787
14:21:41 <openstack> Launchpad bug 1808787 in kuryr-kubernetes "Pod Label Handler tries to handle pod creation causing controller restart" [Undecided,New]
14:21:41 <ltomasbo> celebdor, still is broken... and it raises another concern, which is that we are using both on devstack as well as on the gates the wront type of tenant
14:22:17 <yboaron_> ltomasbo, what do you mean by 'under a normal tenant' in the bug description, do we use tenant with extra permissions in our gates?
14:22:18 <dmellado> ltomasbo: is that related to the devstack tenant permissions you told me about?
14:22:30 <celebdor> ltomasbo: that is my biggest concren
14:22:33 <celebdor> *concern
14:22:39 <maysams> celebdor: basically the pod label handler is waiting for the vif handler to handle the event, but due to resource not ready exceptions this will not work
14:22:40 <ltomasbo> yboaron_, unfortunately, yes
14:23:05 <yboaron_> ltomasbo, can you elaborate on that?
14:23:05 <celebdor> maysams: why wait and not drop?
14:23:09 <dmellado> ltomasbo: I'll need to spend some time investigating it
14:23:16 <celebdor> if it drops the event if it has no annotation
14:23:25 <ltomasbo> dmellado, it will be great if you can take a look, yes!
14:23:33 <celebdor> it will eventually get the event from the patch that the vif handler does
14:23:35 <dulek> celebdor: Hey, that makes sense.
14:23:42 <dmellado> ltomasbo: I'll sync with you after the meeting to work into the details, thanks!
14:23:42 <ltomasbo> dmellado, it is not only on the gates, but also on the k8s tenant that we create on devstack
14:23:52 <maysams> celebdor: indeed makes sense
14:23:58 <celebdor> :O
14:24:11 <celebdor> let's try that then
14:24:14 <dmellado> increibdle, something that celebdor says makes sense
14:24:17 <dmellado> xD
14:24:21 <maysams> celebdor: thanks :)
14:24:22 <celebdor> dmellado: I feel the same way
14:24:24 <celebdor> believe me
14:24:28 <dmellado> something's going really odd
14:24:33 <ltomasbo> maysams, celebdor: yep, it is easy to just drop the event, but worth to keep in mind that we have a problem there
14:24:34 <dmellado> :D
14:24:47 <dmellado> ltomasbo: maysams we'll treat that as a bug
14:24:55 <dmellado> so if you haven't already, please open it in LP
14:24:58 <celebdor> ltomasbo: why is this a problem?
14:25:09 <ltomasbo> dmellado, is a bug we hit on network policies, but it is not a bug produced by network policies...
14:25:11 <celebdor> label pod handler only cares about annotated pods
14:25:14 <celebdor> I think it is fair
14:25:16 <celebdor> on the other hand
14:25:35 <ltomasbo> celebdor, if we have different handlers listening to the same events, why one is blocked by the other?
14:25:38 <celebdor> ltomasbo: dmellado: we need the tenant to be neutron policy regular tenant
14:25:41 <celebdor> like right now
14:25:54 <maysams> ltomasbo is right
14:26:02 <ltomasbo> celebdor, I agree with the workaround
14:26:08 <dmellado> ltomasbo: totally agree
14:26:11 <ltomasbo> celebdor, but that is just avoiding the problem, not fixing it
14:26:13 <celebdor> ltomasbo: well, I don't think it's that odd for one handler to depend on stuff done by the other
14:26:27 <ltomasbo> celebdor, no no, I don't mean the depends on
14:26:29 <celebdor> just that the current handler design makes it weird
14:26:44 <ltomasbo> celebdor, what would happen in the case that handler is not depending on the other?
14:27:12 <ltomasbo> celebdor, if that is the desing, it does not make a lot of sense to have different handlers... perhaps simple to have just one...
14:27:46 <dmellado> ltomasbo: but I think having several ones is better for the granularity
14:27:54 <dmellado> if by any chance we have handlers a and b
14:27:58 <celebdor> ltomasbo: I don't understand the "what would happen if they do not depend on each other?
14:28:01 <dmellado> and b depends on a then it's fine
14:28:16 <dmellado> but maybe some user would like ot use just 'a' and it would use less memory
14:28:17 <yboaron> sorry guys, I was disconnected for the last few minutes .. network issues
14:28:22 <celebdor> they will just annotate the resource (if they need to) in some random order
14:28:28 <celebdor> what's the prob with that
14:28:30 <celebdor> ?
14:29:21 <ltomasbo> celebdor, imagine the vif handler was not doing pod annotations and other neutron actions
14:29:27 <celebdor> ok
14:29:35 <ltomasbo> celebdor, it will never get executed due to the other handler being scheduled fist
14:30:59 <ltomasbo> anyway, we can discuss this on the kuryr channel after the meeting
14:31:06 <dmellado> ltomasbo: yep, let's do that
14:31:13 <celebdor> ltomasbo: is the handler now serializing per resource?
14:31:23 <celebdor> ok, ok
14:31:27 <celebdor> we can take it up later
14:32:20 <dmellado> gcheresh: itzikb juriarte
14:32:22 <dmellado> o/
14:32:28 <dmellado> anything you'd like to add to the meeting?
14:32:45 <dmellado> I'm pretty sure you'd be interested on the python-openshift thing
14:33:38 <dmellado> https://media1.tenor.com/images/184a80bf791b211b0e2f3f02badab20e/tenor.gif?itemid=12469549
14:33:40 <dmellado> awesome!
14:33:43 <dmellado> xD
14:34:12 <gcheresh> dmellado: had network issues as well on the TLV (everyone did)
14:34:27 <dmellado> gcheresh: ouch, hope you can get to fix it
14:34:40 <gcheresh> it just disconnected and now is back again
14:34:53 <dmellado> just wanted to ask whether you can push for the python-openshift rpms to get accepted
14:35:11 <dmellado> I pointed the links out earlier and they'll be available on the logs
14:35:23 <dmellado> besides that, is there anything you'd like to cover? ;)
14:39:28 <dmellado> All right, so then it seems that it was it for the day!
14:39:41 <dmellado> ltomasbo: celebdor let's continue the discussion on the kuryr channel
14:39:43 <dmellado> thanks for attending!
14:39:48 <dmellado> #endmeeting