*** tosky has quit IRC | 00:49 | |
*** sambetts_ has quit IRC | 02:35 | |
*** sambetts_ has joined #openstack-oslo | 02:39 | |
openstackgerrit | Merged openstack-dev/oslo-cookiecutter master: change default python 3 env in tox to 3.5 https://review.openstack.org/617149 | 03:17 |
---|---|---|
*** njohnston has quit IRC | 04:47 | |
*** njohnston has joined #openstack-oslo | 04:48 | |
*** e0ne has joined #openstack-oslo | 06:12 | |
*** e0ne has quit IRC | 06:17 | |
*** Luzi has joined #openstack-oslo | 07:03 | |
*** pcaruana has joined #openstack-oslo | 07:24 | |
*** hoonetorg has quit IRC | 07:55 | |
*** hberaud has joined #openstack-oslo | 07:56 | |
*** shardy has joined #openstack-oslo | 07:58 | |
*** hoonetorg has joined #openstack-oslo | 08:13 | |
*** a-pugachev has joined #openstack-oslo | 08:42 | |
*** tosky has joined #openstack-oslo | 08:50 | |
*** moguimar has joined #openstack-oslo | 09:32 | |
*** jaosorior has joined #openstack-oslo | 09:41 | |
*** cdent has joined #openstack-oslo | 09:51 | |
*** e0ne has joined #openstack-oslo | 09:57 | |
*** e0ne_ has joined #openstack-oslo | 09:59 | |
*** e0ne has quit IRC | 10:02 | |
*** sean-k-mooney has quit IRC | 10:15 | |
*** sean-k-mooney has joined #openstack-oslo | 10:23 | |
*** toabctl has quit IRC | 10:33 | |
*** cdent has quit IRC | 10:57 | |
*** tosky has quit IRC | 11:03 | |
*** tosky has joined #openstack-oslo | 11:06 | |
*** raildo has joined #openstack-oslo | 11:46 | |
*** njohnston has quit IRC | 11:50 | |
*** njohnston has joined #openstack-oslo | 11:51 | |
*** raildo has quit IRC | 12:01 | |
*** cdent has joined #openstack-oslo | 12:19 | |
*** dkehn has quit IRC | 12:33 | |
*** raildo has joined #openstack-oslo | 12:45 | |
*** cdent has quit IRC | 13:13 | |
*** kgiusti has joined #openstack-oslo | 13:37 | |
openstackgerrit | Juan Antonio Osorio Robles proposed openstack/oslo.policy master: Implement base for pluggable policy drivers https://review.openstack.org/577807 | 13:41 |
openstackgerrit | Josephine Seifert proposed openstack/oslo-specs master: Adding library for encryption and decryption https://review.openstack.org/618754 | 13:50 |
*** cdent has joined #openstack-oslo | 13:50 | |
*** cdent has quit IRC | 13:55 | |
*** pbourke has quit IRC | 14:04 | |
*** cdent has joined #openstack-oslo | 14:05 | |
*** pbourke has joined #openstack-oslo | 14:06 | |
*** lbragstad has joined #openstack-oslo | 14:09 | |
*** notq has joined #openstack-oslo | 14:21 | |
notq | for oslo messaging, when rabbitmq is restarted for the rpc bus, we get mesage timeouts and the service stays down. It's waiting on some message that was in process, and is no longer geting a respond back from the other service. It never seems to timeout, and move on though. The service has to be restarted to recover. | 14:25 |
notq | Is anyone familiar with this type of issue, and can point me in the right direction? | 14:25 |
*** ansmith has joined #openstack-oslo | 14:26 | |
notq | I understand how to debug the issue, but there's no need to debug it, I know the rabbitmq is restarted. Is it something where the rabbitmq can wait on restart stopping new messages, and waiting until the other ones finish during shutdown. That would be one solution. | 14:26 |
notq | Another solution would be, that the openstack service understands this state, and times out properly if a message isn't received, and retries | 14:27 |
dhellmann | notq : which version of openstack and oslo.messaging are you using? there are timeout handling features but I wonder if you've found a buggy configuration | 14:35 |
notq | it's on rocky, not sure the oslo messaging version | 14:36 |
notq | can check | 14:36 |
kgiusti | notq: do you know what service is blocking? Ideally where the service is blocking? | 14:40 |
notq | this time it was nova, but it's happened with others as well. | 14:41 |
notq | where the service is blocking, it's always in a stack trace to amqp | 14:41 |
notq | and it keeps repeating the attempt for the message, and the service stops | 14:42 |
kgiusti | notq: can you pastebin a sample of the stack trace? | 14:42 |
notq | sure | 14:42 |
kgiusti | notq: it may be best to open a launchpad bug on this and upload any logs you have to it: https://bugs.launchpad.net/oslo.messaging/+filebug | 14:43 |
*** bobh has joined #openstack-oslo | 14:43 | |
kgiusti | notq: that way we can track this publicly and others can help out | 14:44 |
notq | ok. It's gone on over many releases. I wasn't on the topic, but the rest of the team always said it was rabbitmq being bad, or a password change, always some sort of weird answer. Recently assigned to look after them, and it's this whole class of problems is just happening the same way. Someone deploys a rabbitmq change, it updates the container, it restarts, a service loses connection in this exact same way. | 14:45 |
kgiusti | notq: does the service's tcp connection to the 'new' rabbitmq service come up? | 14:46 |
notq | I've had other rabbitmq connection issues, we're using kubernetes to deploy openstack for reference. | 14:46 |
notq | No, there's no consumer for it, it's waiting on this message id | 14:47 |
notq | the similar problems i've had were with like logstash and rabbitmq. Getting the retry settings just right solved that. | 14:48 |
notq | because if it's connection pooling, or somehow saving state, and the new rabbitmq comes up on the service ip, different settings can handle that correctly, and others don't | 14:49 |
notq | so for example, a java application caches the dns resolution, which causes a problem. you have to turn off caching. Golang http clients cache the connection as well, you have to turn that off in the http library. Etc etc. | 14:49 |
notq | elasticsearch has an all together different caching problem, but the same story. It's connection pool saves state, and saving state when a new container comes up on the same service ip, doesn't work. | 14:50 |
notq | from experience, i'd guess kombu is saving something with it's connection pooling which is causing a problem | 14:52 |
notq | and it never happens with notification busses though. Only for the rpc case where it's waiting for a return message | 14:53 |
*** zaneb has joined #openstack-oslo | 14:55 | |
kgiusti | notq: I'm not familiar with the type of failure you describe - it's the first I've heard of it actually. Can you capture debug logs of the service that gets stuck? | 14:59 |
*** zaneb has quit IRC | 15:01 | |
kgiusti | notq: it could very well be some state in kombu - the oslo.messaging driver relies heavily on kombu to handle connection failures/retries | 15:01 |
notq | I can try. We only throw debug on in staging, so I'll have to reboot the rabbitmq repeatedly until I can catch it. Which is a bit harder given staging is not used to the extent | 15:01 |
notq | and it will need the oslo debug i'm guessing right, not just debug=true | 15:02 |
kgiusti | notq: yeah - oslo messaging debug would be ideal | 15:02 |
*** Luzi has quit IRC | 15:03 | |
kgiusti | notq: usually I enable "oslo.messaging=DEBUG,oslo_messaging=DEBUG" in log_levels for oslo.log | 15:04 |
kgiusti | notq: but if you do have existing logs w/o debug that capture the error that would be helpful also - perhaps we start with that before venturing into debug hell... | 15:05 |
notq | sure, but i doubt it's going to explain much. I think it's simply an issue of kombu and how it handles the service ip situation with kubernetes. Now, perhaps nova shouldn't just keep retrying and dying on the amqp driver failure... | 15:06 |
kgiusti | notq: re: nova - yes oslo.messaging is best effort. It cannot guarantee zero message loss. But it could be that nova is using exceptionally long timeouts. | 15:08 |
notq | well, i think it's failing past exceptionally long timeouts. I think the case is missing the timeout somehow, but I'd need to look at it. | 15:09 |
kgiusti | notq: we (oslo) recently implemented a heartbeat mechanisim for faster detection of failed RPC calls specifically for nova, but I don't believe it is being used in Rocky | 15:09 |
notq | we have long timeouts due to recovering, but this will last hours until someone wakes up | 15:09 |
notq | yeah, a heartbeat would be a great solution | 15:10 |
kgiusti | notq: yeesh - hours... not what I'd expect for a retry timeout (more like 10 minutes) | 15:10 |
notq | but, the heartbeat would also have to not be caching the connection to retry, for itto work with the service ip | 15:10 |
*** zaneb has joined #openstack-oslo | 15:12 | |
kgiusti | notq: can you explain a bit more in detail about "caching the service ip" - specifically what meta data you believe is being cached? | 15:12 |
notq | I can only give examples of other cases I've sorted out, like with java, or golangs http client. | 15:12 |
notq | basically, there's a service ip, and a real ip. | 15:13 |
notq | if it caches the real ip for connection pooling, or anything else at any point, the dns for example | 15:13 |
kgiusti | notq: ok, gotcha | 15:13 |
notq | then it will keep trying the ip for the old container, and not the new container which is now up on the service ip | 15:13 |
notq | and when i work with library owners on this type of issue, i'm generaly told nothing caches repeatedly, until we find where it caches. because libraries, inside libaries, inside libaries, might be caching it | 15:14 |
kgiusti | notq: understood. Yeah oslo.messaging simply passes along whatever the value of "transport_url" is to kombu. | 15:16 |
kgiusti | notq: do you happen to know what version kombu is being used? | 15:16 |
notq | i don't, it's a great question. kombu also has issues, and critical bugs fixed in the newest version regarding it's reconnection logic | 15:17 |
*** bobh has quit IRC | 15:18 | |
kgiusti | notq: in any case it sounds like a combination of bugs: failure to reconnect to rabbitmq, and failure to timeout while waiting for an RPC reply message. | 15:19 |
notq | agreed. | 15:22 |
kgiusti | notq: just looking at the rocky releases for oslo.messaging, there's a requirement's bump to py-amqp in release 8.1.0: | 15:26 |
kgiusti | notq: https://bugs.launchpad.net/oslo.messaging/+bug/1780992 | 15:26 |
openstack | Launchpad bug 1780992 in oslo.messaging "Trying to connect to Rabbit servers can timeout if only /etc/hosts entries are used" [Undecided,Fix released] - Assigned to Daniel Alvarez (dalvarezs) | 15:26 |
kgiusti | notq: plus a connection related bugfix: https://bugs.launchpad.net/oslo.messaging/+bug/1745166 | 15:28 |
openstack | Launchpad bug 1745166 in oslo.messaging "amqp.py warns about deprecated feature" [Low,Fix released] - Assigned to Ken Giusti (kgiusti) | 15:28 |
notq | @kglusti that's very helpful, I wasn't aware that was the release name. | 15:28 |
kgiusti | notq: that last one ended up fixing some intermittent connection failures we were seeing in the python3 gates, FWIW | 15:28 |
*** bobh has joined #openstack-oslo | 15:29 | |
notq | it doesn't help that we are mostly off openstack-helm into our dedicated helm charts, but the one that failed this time was in the openstack-helm rabbitmq, and that may be different | 15:29 |
*** notq has quit IRC | 15:36 | |
*** bobh has quit IRC | 15:58 | |
*** pcaruana has quit IRC | 16:07 | |
*** e0ne_ has quit IRC | 16:16 | |
*** e0ne has joined #openstack-oslo | 16:21 | |
*** e0ne has quit IRC | 16:27 | |
*** kgiusti has quit IRC | 17:24 | |
*** a-pugachev has quit IRC | 17:35 | |
*** zbitter has joined #openstack-oslo | 17:37 | |
*** zaneb has quit IRC | 17:39 | |
*** pcaruana has joined #openstack-oslo | 17:44 | |
openstackgerrit | Merged openstack/oslo.messaging master: Use ensure_connection to prevent loss of connection error logs https://review.openstack.org/615649 | 17:51 |
*** e0ne has joined #openstack-oslo | 18:47 | |
*** e0ne has quit IRC | 18:51 | |
*** e0ne has joined #openstack-oslo | 19:00 | |
*** shardy has quit IRC | 19:27 | |
*** kgiusti has joined #openstack-oslo | 19:45 | |
openstackgerrit | Merged openstack/sphinx-feature-classification master: Optimizing the safety of the http link site in HACKING.rst https://review.openstack.org/618351 | 20:06 |
*** e0ne has quit IRC | 20:07 | |
*** e0ne has joined #openstack-oslo | 20:10 | |
*** e0ne has quit IRC | 20:10 | |
*** zbitter is now known as zaneb | 20:18 | |
*** a-pugachev has joined #openstack-oslo | 20:34 | |
*** a-pugachev has quit IRC | 20:35 | |
*** efried has joined #openstack-oslo | 20:41 | |
efried | dhellmann: Are we allowed to edit renos that are out the door already? | 20:41 |
efried | dhellmann: à la https://review.openstack.org/#/c/617222/ ? | 20:41 |
dhellmann | efried : it's possible and allowed but "dangerous" if done wrong | 20:41 |
efried | dhellmann: Dangerous under what circumstances? | 20:42 |
efried | i.e. what's "done wrong"? | 20:42 |
dhellmann | https://docs.openstack.org/reno/latest/user/usage.html#updating-stable-branch-release-notes | 20:42 |
efried | ... | 20:42 |
efried | nyaha. So that one is bogus, needs to be reproposed to stable/rocky. | 20:43 |
*** raildo has quit IRC | 20:43 | |
dhellmann | efried : I commented; it's better to set up a redirect | 20:44 |
efried | nice, thanks dhellmann | 20:44 |
efried | in fact this reno was introduced in ocata | 20:45 |
efried | afaict | 20:45 |
*** a-pugachev has joined #openstack-oslo | 21:35 | |
*** e0ne has joined #openstack-oslo | 21:35 | |
*** e0ne has quit IRC | 21:36 | |
*** ansmith has quit IRC | 21:42 | |
*** pcaruana has quit IRC | 21:51 | |
openstackgerrit | Merged openstack/oslo.messaging master: Add a test for rabbit URLs lacking terminating '/' https://review.openstack.org/611870 | 21:55 |
*** HenryG has quit IRC | 22:12 | |
*** HenryG has joined #openstack-oslo | 22:14 | |
*** cdent has quit IRC | 22:19 | |
*** kgiusti has quit IRC | 22:39 | |
*** moguimar has quit IRC | 22:39 | |
*** jbadiapa has quit IRC | 22:39 | |
*** dhellmann has quit IRC | 22:39 | |
*** sambetts_ has quit IRC | 22:39 | |
*** dhellmann has joined #openstack-oslo | 22:40 | |
*** jbadiapa has joined #openstack-oslo | 22:40 | |
*** sambetts_ has joined #openstack-oslo | 22:43 | |
efried | dhellmann: More reno fun. See https://review.openstack.org/#/c/618708/ -- adding the release notes job to .zuul.yaml didn't trigger the release notes job to run on that patch. | 23:16 |
efried | I think I get that this is because the current conditionals are set up to only trigger the job if something under the releasenotes/ dir changes | 23:16 |
efried | just wondering how hard it would be to add a condition for this special case. | 23:16 |
efried | If not trivial, I wouldn't bother, since this should be quite rare | 23:17 |
*** a-pugachev has quit IRC | 23:29 | |
*** njohnston has quit IRC | 23:29 | |
*** njohnston has joined #openstack-oslo | 23:30 | |
*** gibi has quit IRC | 23:34 | |
*** gibi has joined #openstack-oslo | 23:35 | |
dhellmann | efried : we could change the job definition to also look at the zuul configuration files, but it would be just as easy to add an empty release notes file in that patch I think | 23:40 |
*** lbragstad has quit IRC | 23:40 | |
efried | dhellmann: I threw a patch on top to verify. No biggie. Just wondered if it was worth doing in the job def. | 23:40 |
dhellmann | efried : I'd hate to run that job every time we update some other settings :-/ | 23:53 |
efried | dhellmann: Right, it would have to be smart enough to just run the job for the specific line that was added (whatever that may be). | 23:54 |
efried | i.e. this isn't a reno-specific thing. | 23:54 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!