Monday, 2018-11-19

*** tosky has quit IRC		00:49
*** sambetts_ has quit IRC		02:35
*** sambetts_ has joined #openstack-oslo		02:39
openstackgerrit	Merged openstack-dev/oslo-cookiecutter master: change default python 3 env in tox to 3.5 https://review.openstack.org/617149	03:17
*** njohnston has quit IRC		04:47
*** njohnston has joined #openstack-oslo		04:48
*** e0ne has joined #openstack-oslo		06:12
*** e0ne has quit IRC		06:17
*** Luzi has joined #openstack-oslo		07:03
*** pcaruana has joined #openstack-oslo		07:24
*** hoonetorg has quit IRC		07:55
*** hberaud has joined #openstack-oslo		07:56
*** shardy has joined #openstack-oslo		07:58
*** hoonetorg has joined #openstack-oslo		08:13
*** a-pugachev has joined #openstack-oslo		08:42
*** tosky has joined #openstack-oslo		08:50
*** moguimar has joined #openstack-oslo		09:32
*** jaosorior has joined #openstack-oslo		09:41
*** cdent has joined #openstack-oslo		09:51
*** e0ne has joined #openstack-oslo		09:57
*** e0ne_ has joined #openstack-oslo		09:59
*** e0ne has quit IRC		10:02
*** sean-k-mooney has quit IRC		10:15
*** sean-k-mooney has joined #openstack-oslo		10:23
*** toabctl has quit IRC		10:33
*** cdent has quit IRC		10:57
*** tosky has quit IRC		11:03
*** tosky has joined #openstack-oslo		11:06
*** raildo has joined #openstack-oslo		11:46
*** njohnston has quit IRC		11:50
*** njohnston has joined #openstack-oslo		11:51
*** raildo has quit IRC		12:01
*** cdent has joined #openstack-oslo		12:19
*** dkehn has quit IRC		12:33
*** raildo has joined #openstack-oslo		12:45
*** cdent has quit IRC		13:13
*** kgiusti has joined #openstack-oslo		13:37
openstackgerrit	Juan Antonio Osorio Robles proposed openstack/oslo.policy master: Implement base for pluggable policy drivers https://review.openstack.org/577807	13:41
openstackgerrit	Josephine Seifert proposed openstack/oslo-specs master: Adding library for encryption and decryption https://review.openstack.org/618754	13:50
*** cdent has joined #openstack-oslo		13:50
*** cdent has quit IRC		13:55
*** pbourke has quit IRC		14:04
*** cdent has joined #openstack-oslo		14:05
*** pbourke has joined #openstack-oslo		14:06
*** lbragstad has joined #openstack-oslo		14:09
*** notq has joined #openstack-oslo		14:21
notq	for oslo messaging, when rabbitmq is restarted for the rpc bus, we get mesage timeouts and the service stays down. It's waiting on some message that was in process, and is no longer geting a respond back from the other service. It never seems to timeout, and move on though. The service has to be restarted to recover.	14:25
notq	Is anyone familiar with this type of issue, and can point me in the right direction?	14:25
*** ansmith has joined #openstack-oslo		14:26
notq	I understand how to debug the issue, but there's no need to debug it, I know the rabbitmq is restarted. Is it something where the rabbitmq can wait on restart stopping new messages, and waiting until the other ones finish during shutdown. That would be one solution.	14:26
notq	Another solution would be, that the openstack service understands this state, and times out properly if a message isn't received, and retries	14:27
dhellmann	notq : which version of openstack and oslo.messaging are you using? there are timeout handling features but I wonder if you've found a buggy configuration	14:35
notq	it's on rocky, not sure the oslo messaging version	14:36
notq	can check	14:36
kgiusti	notq: do you know what service is blocking? Ideally where the service is blocking?	14:40
notq	this time it was nova, but it's happened with others as well.	14:41
notq	where the service is blocking, it's always in a stack trace to amqp	14:41
notq	and it keeps repeating the attempt for the message, and the service stops	14:42
kgiusti	notq: can you pastebin a sample of the stack trace?	14:42
notq	sure	14:42
kgiusti	notq: it may be best to open a launchpad bug on this and upload any logs you have to it: https://bugs.launchpad.net/oslo.messaging/+filebug	14:43
*** bobh has joined #openstack-oslo		14:43
kgiusti	notq: that way we can track this publicly and others can help out	14:44
notq	ok. It's gone on over many releases. I wasn't on the topic, but the rest of the team always said it was rabbitmq being bad, or a password change, always some sort of weird answer. Recently assigned to look after them, and it's this whole class of problems is just happening the same way. Someone deploys a rabbitmq change, it updates the container, it restarts, a service loses connection in this exact same way.	14:45
kgiusti	notq: does the service's tcp connection to the 'new' rabbitmq service come up?	14:46
notq	I've had other rabbitmq connection issues, we're using kubernetes to deploy openstack for reference.	14:46
notq	No, there's no consumer for it, it's waiting on this message id	14:47
notq	the similar problems i've had were with like logstash and rabbitmq. Getting the retry settings just right solved that.	14:48
notq	because if it's connection pooling, or somehow saving state, and the new rabbitmq comes up on the service ip, different settings can handle that correctly, and others don't	14:49
notq	so for example, a java application caches the dns resolution, which causes a problem. you have to turn off caching. Golang http clients cache the connection as well, you have to turn that off in the http library. Etc etc.	14:49
notq	elasticsearch has an all together different caching problem, but the same story. It's connection pool saves state, and saving state when a new container comes up on the same service ip, doesn't work.	14:50
notq	from experience, i'd guess kombu is saving something with it's connection pooling which is causing a problem	14:52
notq	and it never happens with notification busses though. Only for the rpc case where it's waiting for a return message	14:53
*** zaneb has joined #openstack-oslo		14:55
kgiusti	notq: I'm not familiar with the type of failure you describe - it's the first I've heard of it actually. Can you capture debug logs of the service that gets stuck?	14:59
*** zaneb has quit IRC		15:01
kgiusti	notq: it could very well be some state in kombu - the oslo.messaging driver relies heavily on kombu to handle connection failures/retries	15:01
notq	I can try. We only throw debug on in staging, so I'll have to reboot the rabbitmq repeatedly until I can catch it. Which is a bit harder given staging is not used to the extent	15:01
notq	and it will need the oslo debug i'm guessing right, not just debug=true	15:02
kgiusti	notq: yeah - oslo messaging debug would be ideal	15:02
*** Luzi has quit IRC		15:03
kgiusti	notq: usually I enable "oslo.messaging=DEBUG,oslo_messaging=DEBUG" in log_levels for oslo.log	15:04
kgiusti	notq: but if you do have existing logs w/o debug that capture the error that would be helpful also - perhaps we start with that before venturing into debug hell...	15:05
notq	sure, but i doubt it's going to explain much. I think it's simply an issue of kombu and how it handles the service ip situation with kubernetes. Now, perhaps nova shouldn't just keep retrying and dying on the amqp driver failure...	15:06
kgiusti	notq: re: nova - yes oslo.messaging is best effort. It cannot guarantee zero message loss. But it could be that nova is using exceptionally long timeouts.	15:08
notq	well, i think it's failing past exceptionally long timeouts. I think the case is missing the timeout somehow, but I'd need to look at it.	15:09
kgiusti	notq: we (oslo) recently implemented a heartbeat mechanisim for faster detection of failed RPC calls specifically for nova, but I don't believe it is being used in Rocky	15:09
notq	we have long timeouts due to recovering, but this will last hours until someone wakes up	15:09
notq	yeah, a heartbeat would be a great solution	15:10
kgiusti	notq: yeesh - hours... not what I'd expect for a retry timeout (more like 10 minutes)	15:10
notq	but, the heartbeat would also have to not be caching the connection to retry, for itto work with the service ip	15:10
*** zaneb has joined #openstack-oslo		15:12
kgiusti	notq: can you explain a bit more in detail about "caching the service ip" - specifically what meta data you believe is being cached?	15:12
notq	I can only give examples of other cases I've sorted out, like with java, or golangs http client.	15:12
notq	basically, there's a service ip, and a real ip.	15:13
notq	if it caches the real ip for connection pooling, or anything else at any point, the dns for example	15:13
kgiusti	notq: ok, gotcha	15:13
notq	then it will keep trying the ip for the old container, and not the new container which is now up on the service ip	15:13
notq	and when i work with library owners on this type of issue, i'm generaly told nothing caches repeatedly, until we find where it caches. because libraries, inside libaries, inside libaries, might be caching it	15:14
kgiusti	notq: understood. Yeah oslo.messaging simply passes along whatever the value of "transport_url" is to kombu.	15:16
kgiusti	notq: do you happen to know what version kombu is being used?	15:16
notq	i don't, it's a great question. kombu also has issues, and critical bugs fixed in the newest version regarding it's reconnection logic	15:17
*** bobh has quit IRC		15:18
kgiusti	notq: in any case it sounds like a combination of bugs: failure to reconnect to rabbitmq, and failure to timeout while waiting for an RPC reply message.	15:19
notq	agreed.	15:22
kgiusti	notq: just looking at the rocky releases for oslo.messaging, there's a requirement's bump to py-amqp in release 8.1.0:	15:26
kgiusti	notq: https://bugs.launchpad.net/oslo.messaging/+bug/1780992	15:26
openstack	Launchpad bug 1780992 in oslo.messaging "Trying to connect to Rabbit servers can timeout if only /etc/hosts entries are used" [Undecided,Fix released] - Assigned to Daniel Alvarez (dalvarezs)	15:26
kgiusti	notq: plus a connection related bugfix: https://bugs.launchpad.net/oslo.messaging/+bug/1745166	15:28
openstack	Launchpad bug 1745166 in oslo.messaging "amqp.py warns about deprecated feature" [Low,Fix released] - Assigned to Ken Giusti (kgiusti)	15:28
notq	@kglusti that's very helpful, I wasn't aware that was the release name.	15:28
kgiusti	notq: that last one ended up fixing some intermittent connection failures we were seeing in the python3 gates, FWIW	15:28
*** bobh has joined #openstack-oslo		15:29
notq	it doesn't help that we are mostly off openstack-helm into our dedicated helm charts, but the one that failed this time was in the openstack-helm rabbitmq, and that may be different	15:29
*** notq has quit IRC		15:36
*** bobh has quit IRC		15:58
*** pcaruana has quit IRC		16:07
*** e0ne_ has quit IRC		16:16
*** e0ne has joined #openstack-oslo		16:21
*** e0ne has quit IRC		16:27
*** kgiusti has quit IRC		17:24
*** a-pugachev has quit IRC		17:35
*** zbitter has joined #openstack-oslo		17:37
*** zaneb has quit IRC		17:39
*** pcaruana has joined #openstack-oslo		17:44
openstackgerrit	Merged openstack/oslo.messaging master: Use ensure_connection to prevent loss of connection error logs https://review.openstack.org/615649	17:51
*** e0ne has joined #openstack-oslo		18:47
*** e0ne has quit IRC		18:51
*** e0ne has joined #openstack-oslo		19:00
*** shardy has quit IRC		19:27
*** kgiusti has joined #openstack-oslo		19:45
openstackgerrit	Merged openstack/sphinx-feature-classification master: Optimizing the safety of the http link site in HACKING.rst https://review.openstack.org/618351	20:06
*** e0ne has quit IRC		20:07
*** e0ne has joined #openstack-oslo		20:10
*** e0ne has quit IRC		20:10
*** zbitter is now known as zaneb		20:18
*** a-pugachev has joined #openstack-oslo		20:34
*** a-pugachev has quit IRC		20:35
*** efried has joined #openstack-oslo		20:41
efried	dhellmann: Are we allowed to edit renos that are out the door already?	20:41
efried	dhellmann: à la https://review.openstack.org/#/c/617222/ ?	20:41
dhellmann	efried : it's possible and allowed but "dangerous" if done wrong	20:41
efried	dhellmann: Dangerous under what circumstances?	20:42
efried	i.e. what's "done wrong"?	20:42
dhellmann	https://docs.openstack.org/reno/latest/user/usage.html#updating-stable-branch-release-notes	20:42
efried	...	20:42
efried	nyaha. So that one is bogus, needs to be reproposed to stable/rocky.	20:43
*** raildo has quit IRC		20:43
dhellmann	efried : I commented; it's better to set up a redirect	20:44
efried	nice, thanks dhellmann	20:44
efried	in fact this reno was introduced in ocata	20:45
efried	afaict	20:45
*** a-pugachev has joined #openstack-oslo		21:35
*** e0ne has joined #openstack-oslo		21:35
*** e0ne has quit IRC		21:36
*** ansmith has quit IRC		21:42
*** pcaruana has quit IRC		21:51
openstackgerrit	Merged openstack/oslo.messaging master: Add a test for rabbit URLs lacking terminating '/' https://review.openstack.org/611870	21:55
*** HenryG has quit IRC		22:12
*** HenryG has joined #openstack-oslo		22:14
*** cdent has quit IRC		22:19
*** kgiusti has quit IRC		22:39
*** moguimar has quit IRC		22:39
*** jbadiapa has quit IRC		22:39
*** dhellmann has quit IRC		22:39
*** sambetts_ has quit IRC		22:39
*** dhellmann has joined #openstack-oslo		22:40
*** jbadiapa has joined #openstack-oslo		22:40
*** sambetts_ has joined #openstack-oslo		22:43
efried	dhellmann: More reno fun. See https://review.openstack.org/#/c/618708/ -- adding the release notes job to .zuul.yaml didn't trigger the release notes job to run on that patch.	23:16
efried	I think I get that this is because the current conditionals are set up to only trigger the job if something under the releasenotes/ dir changes	23:16
efried	just wondering how hard it would be to add a condition for this special case.	23:16
efried	If not trivial, I wouldn't bother, since this should be quite rare	23:17
*** a-pugachev has quit IRC		23:29
*** njohnston has quit IRC		23:29
*** njohnston has joined #openstack-oslo		23:30
*** gibi has quit IRC		23:34
*** gibi has joined #openstack-oslo		23:35
dhellmann	efried : we could change the job definition to also look at the zuul configuration files, but it would be just as easy to add an empty release notes file in that patch I think	23:40
*** lbragstad has quit IRC		23:40
efried	dhellmann: I threw a patch on top to verify. No biggie. Just wondered if it was worth doing in the job def.	23:40
dhellmann	efried : I'd hate to run that job every time we update some other settings :-/	23:53
efried	dhellmann: Right, it would have to be smart enough to just run the job for the specific line that was added (whatever that may be).	23:54
efried	i.e. this isn't a reno-specific thing.	23:54

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!