*** jaypipes_ has joined #openstack-placement | 01:18 | |
*** mriedem has quit IRC | 01:18 | |
*** jaypipes has quit IRC | 01:20 | |
*** tetsuro has joined #openstack-placement | 02:00 | |
*** bhagyashris has joined #openstack-placement | 02:04 | |
openstackgerrit | Jack Ding proposed openstack/nova-specs master: Select cpu model from a list of cpu models https://review.openstack.org/620959 | 03:03 |
---|---|---|
*** tetsuro has quit IRC | 04:09 | |
*** sean-k-mooney_ has joined #openstack-placement | 04:45 | |
*** sean-k-mooney has quit IRC | 04:45 | |
*** tetsuro has joined #openstack-placement | 05:39 | |
*** mgagne has quit IRC | 06:33 | |
*** mgagne has joined #openstack-placement | 06:40 | |
*** tetsuro has quit IRC | 07:04 | |
*** gryf is now known as _gryf | 07:15 | |
*** _gryf is now known as gryf | 07:16 | |
*** cdent has joined #openstack-placement | 07:42 | |
*** avolkov has joined #openstack-placement | 07:49 | |
*** tetsuro has joined #openstack-placement | 07:50 | |
*** helenafm has joined #openstack-placement | 08:14 | |
openstackgerrit | Yongli He proposed openstack/nova-specs master: add spec "show-server-numa-topology" https://review.openstack.org/612256 | 08:14 |
*** tetsuro_ has joined #openstack-placement | 08:46 | |
*** tetsuro has quit IRC | 08:48 | |
*** bhagyashris has quit IRC | 09:06 | |
*** tetsuro_ has quit IRC | 09:22 | |
*** ttsiouts has joined #openstack-placement | 09:26 | |
*** takashin has left #openstack-placement | 09:30 | |
*** e0ne has joined #openstack-placement | 09:53 | |
*** e0ne has quit IRC | 10:29 | |
*** e0ne has joined #openstack-placement | 10:30 | |
*** ttsiouts has quit IRC | 11:10 | |
*** ttsiouts has joined #openstack-placement | 11:11 | |
*** ttsiouts has quit IRC | 11:15 | |
*** fanzhang has joined #openstack-placement | 11:19 | |
fanzhang | hi, team placement, recently we use rally to do a concurrency test booting total 1000 vms which concurrency=45 and times=25. Only very few vms (like 4/1000) got NoValidHost error. From the scheduler log, we got http://paste.openstack.org/show/737782/ , but placement service is up and good. Any ideas why this happens and how to solve it? | 11:23 |
cdent | fanzhang: thanks, reading the paste | 11:24 |
cdent | fanzhang: is this with master, or an earlier release? What web server set up are you using with placement | 11:25 |
cdent | That error message suggests that there's a proxy in there somewhere, and the proxy is overloaded (perhaps apache2 in front of mod wsgi?) | 11:26 |
fanzhang | cdent thanks, we are currently using queens-17.0.3 and apache httpd for web server | 11:26 |
fanzhang | yes, we have haproxy in front of 3 controller nodes | 11:26 |
cdent | if that's the case you'll want to see where one of those request is dying: at haproxy or at apache | 11:27 |
cdent | and adjust the configuration on those | 11:27 |
cdent | placement itself should be capable of scaling horizontally as much as you like, but the things in front of it need to have their configurations adjust so they can accept or queue socket connects fast enough to handle what you are doing | 11:27 |
cdent | The BadStatusLine is almost always that a proxy gave up | 11:28 |
cdent | I've had some luck using different mpm modules with apache, depending on the environment | 11:29 |
fanzhang | thanks, earlier today, we checked the haproxy logs, and filtered the log manually by time. Looks like haproxy was cool, and got response from placement with http code 200 and len(body) 16622. | 11:30 |
cdent | Another possibility is that the nova-scheduler process was unable to deal and dropped the connection before reading the results (from the haproxy) | 11:33 |
cdent | are you running more than one nova-scheduler process? | 11:33 |
fanzhang | yes, on nova-scheduler on a controller node | 11:34 |
fanzhang | total 3 nova-scheduler process | 11:34 |
fanzhang | oh, btw, there are many warnings in nova-scheduler logs, like 2018-12-20 15:04:50.073 95480 DEBUG nova.api.openstack.placement.wsgi_wrapper [req-2ec0e1b7-2e1f-4d1d-86a0-352b9059890e dae40408cd614d948d8b6756d698e18b 7e64659205d1432c9a49b94534db5344 - default default] Placement API returning an error response: Inventory changed while attempting to allocate: Another thread concurrently updated the data. Please retry your update call_func | 11:35 |
fanzhang | /usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi_wrapper.py:31 | 11:35 |
cdent | that's excpected | 11:35 |
cdent | sorry, expected | 11:35 |
cdent | what's happening there is that the state of the resource providers and their inventory is changing while you are making allocations and if your view and the server's view is different the server will tell you with a 409 response | 11:36 |
cdent | that handling gets a bit more clean and clear in rocky | 11:36 |
fanzhang | ok...I see, an expected conflict exception it is, right? About the 1st question, I googled a lot, and I assumed that it's the 'server' who killed the connection, but had no clue why. | 11:37 |
*** tetsuro has joined #openstack-placement | 11:38 | |
cdent | fanzhang: yeah the notion of 'server' gets pretty messy when two proxies are involved. since you're seeing that the reponse is making it back to the haproxy, that means it is either dying there or at nova-scheduler | 11:40 |
cdent | since you're doing scale work with placement, you might find this (rather old) blog posting on the topic interesting: https://anticdent.org/placement-scale-fun.html | 11:41 |
*** cdent_ has joined #openstack-placement | 11:44 | |
fanzhang | cdent thanks, will read it later, but if currently we don't do scale work, would you please share some suggestions we can at least try to make assumptions above more clear? Like wanting to find out if it's haproxy or nova-scheduler doing the 'dirty' work? | 11:44 |
*** cdent has quit IRC | 11:46 | |
*** cdent_ is now known as cdent | 11:46 | |
fanzhang | in case that you miss the last reply: thanks, will read it later, but if currently we don't do scale work, would you please share some suggestions we can at least try to make assumptions above more clear? Like wanting to find out if it's haproxy or nova-scheduler doing the 'dirty' work? | 11:47 |
cdent | fanzhang: it's hard to guess what is going on. it sounds like you've already traced request ids on all the services and narrowed it down somewhat. If you can narrow a bit more that would be good. | 11:49 |
cdent | you might also try implementing the change in this patch: https://review.openstack.org/#/c/159382/ | 11:49 |
cdent | to see if the problem is an overloaded nova-scheduler. Do you have any data on the cpu load on the machines where these services are running? | 11:50 |
*** ttsiouts has joined #openstack-placement | 11:52 | |
fanzhang | 48 cpus, 512G ram as one controller node, cpu load is high as currently some folks are doing other concurrency tests, now the load average is 15.59, 17.90, 12.42. | 11:53 |
fanzhang | from zabbix dashboard, the time we met the warning, cpu utils of 3 nodes were avg 16,17,20 | 11:55 |
fanzhang | yes, I traced a lot by request id, and we are currently trying to get global request id work. The patch above looks useful, thanks so much cdent :) | 12:01 |
fanzhang | my head is total mess right now...I will go through your advice again and try to grab a cup of coffee :( | 12:06 |
cdent | I know how that can be :) | 12:09 |
fanzhang | thanks again cdent | 12:10 |
cdent | If you get a bit further and have more questions, I should be around, or post to the mailing list. I think it basically comes down to the fact that at a certain point something is giving up because there are too many active tcp sockets in action | 12:10 |
cdent | it's not a specific problem with placement, but rather the whole system | 12:11 |
fanzhang | yes, we adjusted several times about the whole openstack cluster, it's really not easy... | 12:12 |
cdent | there are so many moving parts | 12:13 |
fanzhang | oh, it just comes up another question, https://bugs.launchpad.net/openstack-manuals/+bug/1761649, any ideas why we should disable `option httpchk` for nova-placement on haproxy.cfg ? | 12:13 |
openstack | Launchpad bug 1761649 in openstack-manuals "HAProxy in openstackhaguide nova placement api" [Medium,Triaged] | 12:13 |
cdent | hmmm, I wasn't aware of that bug, let me read through it | 12:16 |
*** tssurya has joined #openstack-placement | 12:18 | |
fanzhang | we just hit this bug after adding `option httpchk` to haproxy.cfg and all placement services are going crazy. Simple removing it helps, and I googled and here it is. | 12:19 |
cdent | It's unclear how that would be related, unless the issue with the bad status line is causing the httpchk to invalidate remotes incorrectly | 12:20 |
cdent | but again: that all suggests to me that the haproxy is itself overloaded | 12:21 |
fanzhang | hmmm, really need to take a break here. Thanks so much for your time cdent :) | 12:23 |
fanzhang | you really help a lot :) | 12:23 |
cdent | you're welcome. good luck. | 12:24 |
*** e0ne has quit IRC | 13:19 | |
*** e0ne has joined #openstack-placement | 13:33 | |
*** mriedem has joined #openstack-placement | 13:38 | |
*** helenafm has quit IRC | 13:59 | |
*** e0ne has quit IRC | 14:11 | |
*** e0ne_ has joined #openstack-placement | 14:11 | |
openstackgerrit | Merged openstack/nova-specs master: Spec: Support filtering by forbidden aggregate https://review.openstack.org/603352 | 14:30 |
*** ttsiouts has quit IRC | 14:33 | |
*** ttsiouts has joined #openstack-placement | 14:34 | |
*** ttsiouts has quit IRC | 14:38 | |
*** ttsiouts has joined #openstack-placement | 14:43 | |
*** helenafm has joined #openstack-placement | 14:53 | |
openstackgerrit | Merged openstack/nova-specs master: Expose virtual device tags in REST API https://review.openstack.org/393930 | 14:55 |
*** mriedem is now known as mriedem_afk | 15:11 | |
*** tssurya_ has joined #openstack-placement | 15:16 | |
*** tssurya has quit IRC | 15:18 | |
*** evrardjp has quit IRC | 15:37 | |
*** evrardjp has joined #openstack-placement | 15:42 | |
*** efried has joined #openstack-placement | 16:26 | |
*** tetsuro has quit IRC | 16:27 | |
*** mriedem_afk is now known as mriedem | 16:32 | |
*** tssurya_ has quit IRC | 16:52 | |
*** avolkov has quit IRC | 16:52 | |
*** gibi is now known as gibi_off | 16:55 | |
*** helenafm has quit IRC | 16:55 | |
*** e0ne_ has quit IRC | 16:59 | |
*** rubasov has quit IRC | 17:01 | |
openstackgerrit | Jack Ding proposed openstack/nova-specs master: Select cpu model from a list of cpu models https://review.openstack.org/620959 | 17:13 |
*** tssurya_ has joined #openstack-placement | 17:19 | |
*** cdent has quit IRC | 17:20 | |
*** tssurya_ is now known as tssurya | 17:32 | |
*** rubasov has joined #openstack-placement | 17:33 | |
openstackgerrit | sean mooney proposed openstack/nova-specs master: Add spec for sriov live migration https://review.openstack.org/605116 | 17:41 |
*** sean-k-mooney_ is now known as sean-k-mooney | 17:44 | |
*** ttsiouts has quit IRC | 17:50 | |
*** ttsiouts has joined #openstack-placement | 17:51 | |
*** ttsiouts has quit IRC | 17:55 | |
*** efried has quit IRC | 17:59 | |
*** efried has joined #openstack-placement | 17:59 | |
*** avolkov has joined #openstack-placement | 18:00 | |
*** avolkov has quit IRC | 18:13 | |
*** jaypipes_ is now known as jaypipes | 18:28 | |
*** avolkov has joined #openstack-placement | 18:34 | |
*** e0ne has joined #openstack-placement | 18:49 | |
*** tssurya has quit IRC | 19:07 | |
*** e0ne_ has joined #openstack-placement | 19:25 | |
*** e0ne has quit IRC | 19:25 | |
*** N3l1x has joined #openstack-placement | 19:47 | |
*** takashin has joined #openstack-placement | 20:51 | |
*** e0ne_ has quit IRC | 21:21 | |
openstackgerrit | Merged openstack/nova-specs master: Select cpu model from a list of cpu models https://review.openstack.org/620959 | 22:22 |
*** N3l1x has quit IRC | 22:26 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!