Thursday, 2018-12-20

*** jaypipes_ has joined #openstack-placement		01:18
*** mriedem has quit IRC		01:18
*** jaypipes has quit IRC		01:20
*** tetsuro has joined #openstack-placement		02:00
*** bhagyashris has joined #openstack-placement		02:04
openstackgerrit	Jack Ding proposed openstack/nova-specs master: Select cpu model from a list of cpu models https://review.openstack.org/620959	03:03
*** tetsuro has quit IRC		04:09
*** sean-k-mooney_ has joined #openstack-placement		04:45
*** sean-k-mooney has quit IRC		04:45
*** tetsuro has joined #openstack-placement		05:39
*** mgagne has quit IRC		06:33
*** mgagne has joined #openstack-placement		06:40
*** tetsuro has quit IRC		07:04
*** gryf is now known as _gryf		07:15
*** _gryf is now known as gryf		07:16
*** cdent has joined #openstack-placement		07:42
*** avolkov has joined #openstack-placement		07:49
*** tetsuro has joined #openstack-placement		07:50
*** helenafm has joined #openstack-placement		08:14
openstackgerrit	Yongli He proposed openstack/nova-specs master: add spec "show-server-numa-topology" https://review.openstack.org/612256	08:14
*** tetsuro_ has joined #openstack-placement		08:46
*** tetsuro has quit IRC		08:48
*** bhagyashris has quit IRC		09:06
*** tetsuro_ has quit IRC		09:22
*** ttsiouts has joined #openstack-placement		09:26
*** takashin has left #openstack-placement		09:30
*** e0ne has joined #openstack-placement		09:53
*** e0ne has quit IRC		10:29
*** e0ne has joined #openstack-placement		10:30
*** ttsiouts has quit IRC		11:10
*** ttsiouts has joined #openstack-placement		11:11
*** ttsiouts has quit IRC		11:15
*** fanzhang has joined #openstack-placement		11:19
fanzhang	hi, team placement, recently we use rally to do a concurrency test booting total 1000 vms which concurrency=45 and times=25. Only very few vms (like 4/1000) got NoValidHost error. From the scheduler log, we got http://paste.openstack.org/show/737782/ , but placement service is up and good. Any ideas why this happens and how to solve it?	11:23
cdent	fanzhang: thanks, reading the paste	11:24
cdent	fanzhang: is this with master, or an earlier release? What web server set up are you using with placement	11:25
cdent	That error message suggests that there's a proxy in there somewhere, and the proxy is overloaded (perhaps apache2 in front of mod wsgi?)	11:26
fanzhang	cdent thanks, we are currently using queens-17.0.3 and apache httpd for web server	11:26
fanzhang	yes, we have haproxy in front of 3 controller nodes	11:26
cdent	if that's the case you'll want to see where one of those request is dying: at haproxy or at apache	11:27
cdent	and adjust the configuration on those	11:27
cdent	placement itself should be capable of scaling horizontally as much as you like, but the things in front of it need to have their configurations adjust so they can accept or queue socket connects fast enough to handle what you are doing	11:27
cdent	The BadStatusLine is almost always that a proxy gave up	11:28
cdent	I've had some luck using different mpm modules with apache, depending on the environment	11:29
fanzhang	thanks, earlier today, we checked the haproxy logs, and filtered the log manually by time. Looks like haproxy was cool, and got response from placement with http code 200 and len(body) 16622.	11:30
cdent	Another possibility is that the nova-scheduler process was unable to deal and dropped the connection before reading the results (from the haproxy)	11:33
cdent	are you running more than one nova-scheduler process?	11:33
fanzhang	yes, on nova-scheduler on a controller node	11:34
fanzhang	total 3 nova-scheduler process	11:34
fanzhang	oh, btw, there are many warnings in nova-scheduler logs, like 2018-12-20 15:04:50.073 95480 DEBUG nova.api.openstack.placement.wsgi_wrapper [req-2ec0e1b7-2e1f-4d1d-86a0-352b9059890e dae40408cd614d948d8b6756d698e18b 7e64659205d1432c9a49b94534db5344 - default default] Placement API returning an error response: Inventory changed while attempting to allocate: Another thread concurrently updated the data. Please retry your update call_func	11:35
fanzhang	/usr/lib/python2.7/site-packages/nova/api/openstack/placement/wsgi_wrapper.py:31	11:35
cdent	that's excpected	11:35
cdent	sorry, expected	11:35
cdent	what's happening there is that the state of the resource providers and their inventory is changing while you are making allocations and if your view and the server's view is different the server will tell you with a 409 response	11:36
cdent	that handling gets a bit more clean and clear in rocky	11:36
fanzhang	ok...I see, an expected conflict exception it is, right? About the 1st question, I googled a lot, and I assumed that it's the 'server' who killed the connection, but had no clue why.	11:37
*** tetsuro has joined #openstack-placement		11:38
cdent	fanzhang: yeah the notion of 'server' gets pretty messy when two proxies are involved. since you're seeing that the reponse is making it back to the haproxy, that means it is either dying there or at nova-scheduler	11:40
cdent	since you're doing scale work with placement, you might find this (rather old) blog posting on the topic interesting: https://anticdent.org/placement-scale-fun.html	11:41
*** cdent_ has joined #openstack-placement		11:44
fanzhang	cdent thanks, will read it later, but if currently we don't do scale work, would you please share some suggestions we can at least try to make assumptions above more clear? Like wanting to find out if it's haproxy or nova-scheduler doing the 'dirty' work?	11:44
*** cdent has quit IRC		11:46
*** cdent_ is now known as cdent		11:46
fanzhang	in case that you miss the last reply: thanks, will read it later, but if currently we don't do scale work, would you please share some suggestions we can at least try to make assumptions above more clear? Like wanting to find out if it's haproxy or nova-scheduler doing the 'dirty' work?	11:47
cdent	fanzhang: it's hard to guess what is going on. it sounds like you've already traced request ids on all the services and narrowed it down somewhat. If you can narrow a bit more that would be good.	11:49
cdent	you might also try implementing the change in this patch: https://review.openstack.org/#/c/159382/	11:49
cdent	to see if the problem is an overloaded nova-scheduler. Do you have any data on the cpu load on the machines where these services are running?	11:50
*** ttsiouts has joined #openstack-placement		11:52
fanzhang	48 cpus, 512G ram as one controller node, cpu load is high as currently some folks are doing other concurrency tests, now the load average is 15.59, 17.90, 12.42.	11:53
fanzhang	from zabbix dashboard, the time we met the warning, cpu utils of 3 nodes were avg 16,17,20	11:55
fanzhang	yes, I traced a lot by request id, and we are currently trying to get global request id work. The patch above looks useful, thanks so much cdent :)	12:01
fanzhang	my head is total mess right now...I will go through your advice again and try to grab a cup of coffee :(	12:06
cdent	I know how that can be :)	12:09
fanzhang	thanks again cdent	12:10
cdent	If you get a bit further and have more questions, I should be around, or post to the mailing list. I think it basically comes down to the fact that at a certain point something is giving up because there are too many active tcp sockets in action	12:10
cdent	it's not a specific problem with placement, but rather the whole system	12:11
fanzhang	yes, we adjusted several times about the whole openstack cluster, it's really not easy...	12:12
cdent	there are so many moving parts	12:13
fanzhang	oh, it just comes up another question, https://bugs.launchpad.net/openstack-manuals/+bug/1761649, any ideas why we should disable `option httpchk` for nova-placement on haproxy.cfg ?	12:13
openstack	Launchpad bug 1761649 in openstack-manuals "HAProxy in openstackhaguide nova placement api" [Medium,Triaged]	12:13
cdent	hmmm, I wasn't aware of that bug, let me read through it	12:16
*** tssurya has joined #openstack-placement		12:18
fanzhang	we just hit this bug after adding `option httpchk` to haproxy.cfg and all placement services are going crazy. Simple removing it helps, and I googled and here it is.	12:19
cdent	It's unclear how that would be related, unless the issue with the bad status line is causing the httpchk to invalidate remotes incorrectly	12:20
cdent	but again: that all suggests to me that the haproxy is itself overloaded	12:21
fanzhang	hmmm, really need to take a break here. Thanks so much for your time cdent :)	12:23
fanzhang	you really help a lot :)	12:23
cdent	you're welcome. good luck.	12:24
*** e0ne has quit IRC		13:19
*** e0ne has joined #openstack-placement		13:33
*** mriedem has joined #openstack-placement		13:38
*** helenafm has quit IRC		13:59
*** e0ne has quit IRC		14:11
*** e0ne_ has joined #openstack-placement		14:11
openstackgerrit	Merged openstack/nova-specs master: Spec: Support filtering by forbidden aggregate https://review.openstack.org/603352	14:30
*** ttsiouts has quit IRC		14:33
*** ttsiouts has joined #openstack-placement		14:34
*** ttsiouts has quit IRC		14:38
*** ttsiouts has joined #openstack-placement		14:43
*** helenafm has joined #openstack-placement		14:53
openstackgerrit	Merged openstack/nova-specs master: Expose virtual device tags in REST API https://review.openstack.org/393930	14:55
*** mriedem is now known as mriedem_afk		15:11
*** tssurya_ has joined #openstack-placement		15:16
*** tssurya has quit IRC		15:18
*** evrardjp has quit IRC		15:37
*** evrardjp has joined #openstack-placement		15:42
*** efried has joined #openstack-placement		16:26
*** tetsuro has quit IRC		16:27
*** mriedem_afk is now known as mriedem		16:32
*** tssurya_ has quit IRC		16:52
*** avolkov has quit IRC		16:52
*** gibi is now known as gibi_off		16:55
*** helenafm has quit IRC		16:55
*** e0ne_ has quit IRC		16:59
*** rubasov has quit IRC		17:01
openstackgerrit	Jack Ding proposed openstack/nova-specs master: Select cpu model from a list of cpu models https://review.openstack.org/620959	17:13
*** tssurya_ has joined #openstack-placement		17:19
*** cdent has quit IRC		17:20
*** tssurya_ is now known as tssurya		17:32
*** rubasov has joined #openstack-placement		17:33
openstackgerrit	sean mooney proposed openstack/nova-specs master: Add spec for sriov live migration https://review.openstack.org/605116	17:41
*** sean-k-mooney_ is now known as sean-k-mooney		17:44
*** ttsiouts has quit IRC		17:50
*** ttsiouts has joined #openstack-placement		17:51
*** ttsiouts has quit IRC		17:55
*** efried has quit IRC		17:59
*** efried has joined #openstack-placement		17:59
*** avolkov has joined #openstack-placement		18:00
*** avolkov has quit IRC		18:13
*** jaypipes_ is now known as jaypipes		18:28
*** avolkov has joined #openstack-placement		18:34
*** e0ne has joined #openstack-placement		18:49
*** tssurya has quit IRC		19:07
*** e0ne_ has joined #openstack-placement		19:25
*** e0ne has quit IRC		19:25
*** N3l1x has joined #openstack-placement		19:47
*** takashin has joined #openstack-placement		20:51
*** e0ne_ has quit IRC		21:21
openstackgerrit	Merged openstack/nova-specs master: Select cpu model from a list of cpu models https://review.openstack.org/620959	22:22
*** N3l1x has quit IRC		22:26

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!