Friday, 2018-12-07

corvus	clarkb: oh; er, do you have the whole sequence handy? (or a pointer to docs?)	00:00
clarkb	ya let me get the email that was sent to the list	00:00
clarkb	corvus: http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000489.html is what I started with	00:00
*** yamamoto has quit IRC		00:01
clarkb	I performed the pull and update steps in taht email for each of the kubernets containers that showed up as running according to `atomic containers list --no-trunc`	00:02
clarkb	the only things outside of that I did was free up disk space so that the pull would work. I did that via atomic images prune and `journalctl --vacuum-size=100M`	00:02
corvus	clarkb: do we need to restart anything after it's done?	00:03
clarkb	corvus: the update command restarts the containers for us (you can confirm with the containers list command)	00:03
corvus	ok i think i grok now	00:04
clarkb	the pull brings down the image, the update restarts services as necessary aiui	00:04
clarkb	corvus: note that the minion was updated too (it runs a subset of the services)	00:08
clarkb	corvus: thinking out loud, does the json change need an executor restart?	00:10
corvus	clarkb: yep. i did your foreach kube*; pull, update on all	00:10
clarkb	thats part of the ansible proceses which are forked on demand and should've seen the change as soon as we installed it?	00:10
corvus	clarkb: i think it will need a restart because it's one of the things we copy into place on startup	00:11
clarkb	ah	00:11
*** rlandy has quit IRC		00:11
corvus	infra-root: i've created another k8s cluster via magnum in vexxhost to experiment with running gitea in it	00:12
corvus	clarkb: ^ and i just upgrade it, because i forgot to specify the version on create	00:12
fungi	okay, i'm fed/inebriated and catching up	00:12
clarkb	corvus: it was actually pretty painless once I figured out how to get around teh disk constraints	00:13
corvus	if this works, i'll be able to push some changes to system-config for it. but for now i'd just like to experiment and see what works	00:13
corvus	clarkb: agreed; i'm done now :)	00:13
clarkb	well that and knowing what to do in the first place but thankfully people sent email to the discuss list about it	00:13
corvus	oh, the cluster is called "opendev" btw	00:14
fungi	we're ~ready to do executor restarts nowish i guess?	00:16
fungi	(unless i've misread)	00:16
clarkb	I am ready if you are	00:16
fungi	ze01 seemed to have the right (new) version installed when i looked a moment ago	00:19
fungi	i'll double-check them all real fast	00:19
corvus	wfm. i'd recommend an all-stop, then update fstab on ze12, then all start.	00:20
corvus	does someone else want to do the all-stop, and i'll fixup ze12 when they're stopped?	00:20
fungi	huh, suddenly i've lost (ipv6 only?) access to ze01	00:21
clarkb	corvus: I'm happy to do it, though maybe fungi would like to if he hasn't done one before?	00:21
clarkb	fungi: ansible ze*.openstack.org -m shell -a 'sytemctl stop zuul-executor' should work iirf	00:22
clarkb	if you fix the typos :/	00:22
clarkb	then we wait for all of them to stop and switch stop to start	00:22
fungi	seems i'm suddenly able to connect again and finishing double-checking installed versions	00:23
fungi	okay, ze01-ze12 report the expected versions via pbr freeze	00:25
fungi	ze12 doesn't actually seem to reflect any zuul installed according to pbr freeze so digging deeper	00:25
fungi	ze01-ze11 did	00:25
clarkb	fungi: python3 $(which pbr) freeze ?	00:26
clarkb	chances are pbr is currently installed under python2 so it looks there by default	00:26
fungi	pip3 freeze works, so yeah	00:26
*** bobh has joined #openstack-infra		00:26
fungi	well, chances are pbr is installed under _both_ (otherwise zuul wouldn't install under python3) but the executable entrypoint is the python2 version	00:27
clarkb	right the "binary" executable	00:27
fungi	pbr freeze indicates "zuul==3.3.2.dev32 # git sha e2520b9" on the others and pip3 freeze on ze12 reports "zuul==3.3.2.dev32" so i take that as a match	00:28
fungi	unfortunately `python3 -m pbr freeze` isn't a thing	00:29
fungi	perhaps mordred knows what needs to be added to make that also work	00:29
fungi	anyway, i think we're good for executor restarts if we want	00:30
corvus	i'm ready	00:30
clarkb	I'm still here	00:31
*** jmorgan1 has quit IRC		00:31
*** bobh has quit IRC		00:31
*** yamamoto has joined #openstack-infra		00:32
fungi	okay, so should we be manually restarting executors one-by-one or use the zuul_restart playbook in system-config?	00:33
corvus	fungi: not that one -- it does a full system restart	00:33
corvus	fungi: i'd just do what clarkb suggested above	00:33
clarkb	fungi: I put an example command above that should work `ansible ze*.openstack.org -m shell -a 'systemctl stop zuul-executor'`	00:34
clarkb	that probably needs sudo on bridge.o.o	00:34
fungi	so `ansible ze*.openstack.org -m shell -a 'sytemctl stop zuul-executor' (modulo typos)	00:34
clarkb	ya	00:34
fungi	i'll do that now	00:35
fungi	obviously then we start too	00:35
clarkb	fungi: no you wait before starting	00:35
fungi	k	00:35
clarkb	the executors will take some time to completely stop (so we can check ps -elf \| grep zuul-executor or whatever incantation you prefer for that sort of thing	00:35
corvus	it takes like 15 minutes to stop	00:35
clarkb	I do something like ps -elf \| grep zuul \| wc -l to get a countdown metric	00:36
fungi	on bridge.o.o i've run `sudo ansible ze*.openstack.org -m shell -a 'systemctl stop zuul-executor'`	00:36
fungi	it reported "CHANGED" for all 12 executors, which i take as a good sign	00:37
corvus	yep	00:37
corvus	i see ze12 stopping	00:37
corvus	i have already rsynced the necessary data, so the ze12 switcheroo shouldn't take long	00:38
corvus	ze12 has stopped	00:38
fungi	awesome	00:39
clarkb	ze01 is down to 82 processes by my earlier command	00:39
clarkb	now 80, so trending in the expected direction	00:39
fungi	yeah, `sudo ansible ze*.openstack.org -m shell -a 'ps -elf \| grep zuul	00:40
fungi	\| wc -l'`	00:40
*** tosky has quit IRC		00:40
corvus	ze12 is ready to go	00:40
fungi	is returning pretty high numbers (disregard the stray newline)	00:40
fungi	returning 52 for ze12	00:41
clarkb	fungi: it likely won't got to zero fwiw due to ssh control persist processes being stubborn	00:41
fungi	k	00:41
corvus	grep for "zuul-" instead	00:41
fungi	ahh	00:41
corvus	as in "zuul-executor"	00:41
fungi	much more reasonable	00:41
fungi	4 on all but ze12 which returns 2	00:42
fungi	2 is the new 0?	00:42
clarkb	fungi: according to puppet: yes	00:42
corvus	with grep, i think so :)	00:42
fungi	k	00:42
corvus	ze12 has been stopped for a while	00:42
fungi	i get it ;)	00:42
clarkb	ze01 looks stopped	00:43
fungi	several of them are returning 2 now, yes	00:43
fungi	, 05, 08 and 10 still going	00:44
fungi	now just 04 and 05	00:44
*** yamamoto has quit IRC		00:44
*** mriedem has quit IRC		00:45
fungi	and now just 05 left	00:46
openstackgerrit	MarcH proposed openstack-infra/git-review master: doc: new testing-behind-proxy.rst; tox.ini: passenv = http[s]_proxy https://review.openstack.org/623361	00:47
*** yamamoto has joined #openstack-infra		00:47
fungi	100% stopped now	00:48
corvus	i cleaned up all old build dirs	00:48
fungi	ready to start, or wait?	00:48
corvus	fungi: ready	00:48
fungi	finger hovering over the button	00:48
fungi	clarkb: all clear?	00:48
clarkb	fungi: ya	00:49
fungi	running	00:49
corvus	ze12 seems happy	00:49
clarkb	I see new executor on ze01	00:49
fungi	should all be starting up now	00:49
*** Swami has quit IRC		00:49
fungi	`ps -elf \| grep zuul- \| wc -l` is returning 4 for all	00:49
openstackgerrit	MarcH proposed openstack-infra/git-review master: CONTRIBUTING.rst, HACKING.rst: fix broken link, minor flow updates https://review.openstack.org/623362	00:50
clarkb	swap has been stable so far	00:51
corvus	looks like the restart did clear out swap usage too, so it should be easy to compare	00:52
corvus	(ie, swap went to 0 at restart)	00:52
mwhahaha	if anyone is around to promote https://review.openstack.org/#/c/623293/ in the tripleo gate that would be helpful (to stop resets due to nested virt crashes)	00:54
corvus	looks pretty good; i'm going to eod now	00:54
* mwhahaha wanders off		00:54
fungi	mwhahaha: i've promoted it now	00:57
mwhahaha	Thanks	00:58
*** rkukura has quit IRC		00:59
clarkb	there are no more queued jobs now	01:06
*** bobh has joined #openstack-infra		01:06
clarkb	no more executor queued jobs	01:07
clarkb	there may be jobs the scheduler is waiting for nodes on	01:07
*** gyee has quit IRC		01:09
*** bobh has quit IRC		01:10
pabelanger	ze12.o.o looks to be running a different kernel for some reason	01:11
pabelanger	Linux ze12 4.4.0-137-generic #163-Ubuntu SMP Mon Sep 24 13:14:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux	01:11
pabelanger	this is from ze01	01:11
pabelanger	Linux ze01 4.15.0-42-generic #45~16.04.1-Ubuntu SMP Mon Nov 19 13:02:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux	01:11
clarkb	hrm how do we apply the hwe kernel?	01:12
clarkb	I thought we puppeted that	01:12
clarkb	maybe we haven't rebooted since puppet ran	01:12
clarkb	I bet that is it	01:12
pabelanger	yah, maybe	01:12
clarkb	since now puppet is a step after launch	01:12
pabelanger	but, it seem ze12.o.o is running more jobs right now	01:12
pabelanger	and using less ram	01:12
pabelanger	unsure if related to kernel or just smaller jobs	01:13
clarkb	pabelanger: the jobs definitely have an impact on executor memory	01:13
clarkb	so could be distribution of "expensive" jobs	01:13
pabelanger	yah, the HDD usage on ze12 is low also	01:13
pabelanger	so, likely smaller projects	01:13
*** kjackal has quit IRC		01:14
clarkb	ze01 swap use is still pretty stable	01:18
clarkb	ze12 is apparently swapping	01:18
clarkb	which is maybe not surprising since the old kernel has/had issues with swap	01:18
pabelanger	yah, we should stop it and reboot if we wanted	01:19
pabelanger	but ze01.o.o swap looks good	01:20
clarkb	I've got to go to a birthday dinner so I can't do that now but can help in the morning if we want to do that	01:20
pabelanger	wfm	01:20
*** jamesmcarthur has joined #openstack-infra		01:21
*** jamesmcarthur has quit IRC		01:24
*** jamesmcarthur has joined #openstack-infra		01:25
*** rkukura has joined #openstack-infra		01:40
*** yamamoto has quit IRC		01:49
*** bobh has joined #openstack-infra		01:51
*** wolverineav has quit IRC		01:55
*** bobh has quit IRC		01:55
*** betherly has joined #openstack-infra		01:59
*** dave-mccowan has quit IRC		02:00
*** jamesmcarthur has quit IRC		02:01
*** betherly has quit IRC		02:04
*** bobh has joined #openstack-infra		02:10
*** mrsoul has joined #openstack-infra		02:12
*** bobh has quit IRC		02:15
lbragstad	awesome write up to the mailing list clarkb	02:26
*** hongbin has joined #openstack-infra		02:41
*** bobh has joined #openstack-infra		02:47
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541	02:49
*** bobh has quit IRC		02:51
*** psachin has joined #openstack-infra		02:53
*** imacdonn has quit IRC		02:53
*** imacdonn has joined #openstack-infra		02:53
*** betherly has joined #openstack-infra		03:01
*** bhavikdbavishi has joined #openstack-infra		03:02
*** betherly has quit IRC		03:05
*** bobh has joined #openstack-infra		03:06
*** rh-jelabarre has quit IRC		03:08
*** bobh has quit IRC		03:10
*** dave-mccowan has joined #openstack-infra		03:11
*** dave-mccowan has quit IRC		03:19
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541	03:20
*** jamesmcarthur has joined #openstack-infra		03:23
*** diablo_rojo has quit IRC		03:25
*** jamesmcarthur has quit IRC		03:41
*** bobh has joined #openstack-infra		03:43
*** bobh has quit IRC		03:48
*** ramishra has joined #openstack-infra		03:50
*** ykarel\|away has joined #openstack-infra		03:52
*** bobh has joined #openstack-infra		04:01
*** jamesmcarthur has joined #openstack-infra		04:02
*** bobh has quit IRC		04:06
*** jamesmcarthur has quit IRC		04:07
*** neilsun has joined #openstack-infra		04:15
*** lbragstad has quit IRC		04:23
openstackgerrit	Jea-Min, Lim proposed openstack-infra/project-config master: add new project called ku.stella this project is unofficial openstack project https://review.openstack.org/623396	04:30
*** bobh has joined #openstack-infra		04:37
*** psachin has quit IRC		04:41
*** bobh has quit IRC		04:42
*** wolverineav has joined #openstack-infra		04:44
*** ykarel\|away has quit IRC		04:46
*** janki has joined #openstack-infra		04:46
mrhillsman	is there a dib element for running a custom script?	04:49
*** bobh has joined #openstack-infra		04:56
*** psachin has joined #openstack-infra		04:58
*** bobh has quit IRC		05:01
*** ykarel\|away has joined #openstack-infra		05:04
ianw	mrhillsman: umm, every dib element is a custom script?	05:05
ianw	that's kind of the point of them :)	05:05
*** bobh has joined #openstack-infra		05:14
*** bobh has quit IRC		05:19
*** wolverineav has quit IRC		05:21
*** wolverineav has joined #openstack-infra		05:32
*** bobh has joined #openstack-infra		05:32
*** wolverineav has quit IRC		05:36
*** bobh has quit IRC		05:36
*** bobh has joined #openstack-infra		05:51
*** bobh has quit IRC		05:55
prometheanfire	I'm guessing it's been a busy day/week?	06:02
*** bobh has joined #openstack-infra		06:10
*** bobh has quit IRC		06:14
*** bhavikdbavishi has quit IRC		06:18
*** adam_zhang has joined #openstack-infra		06:22
*** bobh has joined #openstack-infra		06:28
*** hongbin has quit IRC		06:32
*** bobh has quit IRC		06:33
*** kjackal has joined #openstack-infra		06:42
*** adam_zhang has quit IRC		06:44
*** bobh has joined #openstack-infra		06:46
*** bobh has quit IRC		06:51
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541	06:52
*** hwoarang has quit IRC		06:54
*** hwoarang has joined #openstack-infra		06:56
openstackgerrit	Quique Llorente proposed openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175	07:01
*** rcernin has quit IRC		07:01
*** bobh has joined #openstack-infra		07:03
openstackgerrit	Merged openstack-infra/zuul master: web: refactor status page to use a reducer https://review.openstack.org/621395	07:04
openstackgerrit	Merged openstack-infra/zuul master: web: refactor jobs page to use a reducer https://review.openstack.org/621396	07:06
*** yamamoto has joined #openstack-infra		07:07
*** bobh has quit IRC		07:07
*** betherly has joined #openstack-infra		07:08
*** dklyle has quit IRC		07:09
*** dklyle has joined #openstack-infra		07:10
*** yamamoto has quit IRC		07:11
*** betherly has quit IRC		07:13
*** pgaxatte has joined #openstack-infra		07:19
*** bobh has joined #openstack-infra		07:21
*** dpawlik has joined #openstack-infra		07:24
*** bobh has quit IRC		07:26
*** kjackal has quit IRC		07:28
*** jtomasek has joined #openstack-infra		07:28
*** ykarel\|away is now known as ykarel		07:35
*** alexchadin has joined #openstack-infra		07:35
*** bobh has joined #openstack-infra		07:38
*** bobh has quit IRC		07:43
*** dims has quit IRC		07:44
*** dims has joined #openstack-infra		07:47
*** kjackal has joined #openstack-infra		07:54
*** bobh has joined #openstack-infra		07:56
*** bobh has quit IRC		08:01
*** ykarel is now known as ykarel\|lunch		08:02
*** bobh has joined #openstack-infra		08:14
*** bobh has quit IRC		08:19
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Use combined status for Github status checks https://review.openstack.org/623417	08:23
*** bobh has joined #openstack-infra		08:30
*** bobh has quit IRC		08:35
tobias-urdin	tonyb: thank you tony!	08:38
*** jpena\|off is now known as jpena		08:43
*** bobh has joined #openstack-infra		08:48
*** shardy has joined #openstack-infra		08:52
*** bobh has quit IRC		08:53
*** jpich has joined #openstack-infra		08:57
*** ykarel\|lunch is now known as ykarel		09:00
*** yamamoto has joined #openstack-infra		09:04
*** bobh has joined #openstack-infra		09:07
*** ccamacho has joined #openstack-infra		09:09
*** ccamacho has quit IRC		09:09
*** bobh has quit IRC		09:11
tobiash	clarkb, corvus: when I look at grafana I think the starting builds graph looks odd. During the times when all executors deregistered it's constantly at 5. So maybe something changed in the starting jobs phase.	09:14
*** ccamacho has joined #openstack-infra		09:15
tobiash	clarkb, corvus: to me it looks like job starting (maybe repo setup or gathering facts) is slower. So the swapping could be a red herring.	09:16
*** bobh has joined #openstack-infra		09:23
*** alexchadin has quit IRC		09:24
*** bobh has quit IRC		09:27
*** bobh has joined #openstack-infra		09:37
*** gfidente has joined #openstack-infra		09:37
*** bobh has quit IRC		09:41
*** bobh has joined #openstack-infra		09:51
*** bobh has quit IRC		09:55
*** e0ne has joined #openstack-infra		09:59
*** electrofelix has joined #openstack-infra		10:02
*** jamesmcarthur has joined #openstack-infra		10:03
*** verdurin has quit IRC		10:04
*** jamesmcarthur has quit IRC		10:07
*** verdurin has joined #openstack-infra		10:07
*** bobh has joined #openstack-infra		10:09
*** bobh has quit IRC		10:13
stephenfin	I've noticed that I seem to be getting signed out of Gerrit each day. Has something changed in the past ~3 weeks?	10:19
*** yamamoto has quit IRC		10:22
*** dpawlik has quit IRC		10:23
*** dpawlik has joined #openstack-infra		10:23
*** bobh has joined #openstack-infra		10:24
*** bhavikdbavishi has joined #openstack-infra		10:27
*** bobh has quit IRC		10:29
*** ccamacho has quit IRC		10:31
*** agopi is now known as agopi-pto		10:38
frickler	stephenfin: no changes that I know of, and I seem to stay logged in as long as I'm active once per day	10:40
*** ccamacho has joined #openstack-infra		10:41
stephenfin	frickler: Ack. Must be a client side issue so. I'll investigate. Thanks! :)	10:41
frickler	amorin: infra-root: I just saw a gate failure with three simultaneous timed_out jobs all on bhs1, so it seems that something is still not good there, even with the reduced load	10:43
*** agopi-pto has quit IRC		10:44
*** bobh has joined #openstack-infra		10:49
*** yamamoto has joined #openstack-infra		10:50
*** pbourke has quit IRC		10:50
*** bobh has quit IRC		10:53
*** yamamoto has quit IRC		10:56
*** eernst has joined #openstack-infra		10:58
*** wolverineav has joined #openstack-infra		11:00
*** bobh has joined #openstack-infra		11:01
*** wolverineav has quit IRC		11:05
*** bobh has quit IRC		11:06
*** eernst has quit IRC		11:08
*** kjackal has quit IRC		11:15
*** bobh has joined #openstack-infra		11:20
*** bobh has quit IRC		11:24
*** pbourke has joined #openstack-infra		11:27
*** kjackal has joined #openstack-infra		11:33
*** sshnaidm\|afk is now known as sshnaidm\|off		11:33
*** yamamoto has joined #openstack-infra		11:36
*** bobh has joined #openstack-infra		11:37
*** bobh has quit IRC		11:41
*** yamamoto has quit IRC		11:46
openstackgerrit	Jens Harbott (frickler) proposed openstack-infra/project-config master: Disable ovh bhs1 and gra1 https://review.openstack.org/623457	11:46
frickler	amorin: infra-root: ^^ this would be my measure of last resort, unless you have a better idea. but gate queues seem to be effectively be stuck due to the large number of timeouts and queue resets	11:48
*** gfidente has quit IRC		11:49
*** bobh has joined #openstack-infra		11:49
*** tosky has joined #openstack-infra		11:52
*** bobh has quit IRC		11:53
*** dtantsur\|afk is now known as dtantsur\		11:54
*** dtantsur\ is now known as dtantsur		11:54
*** slaweq has joined #openstack-infra		12:03
*** yamamoto has joined #openstack-infra		12:14
*** bobh has joined #openstack-infra		12:15
*** bobh has quit IRC		12:20
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Add timer for starting_builds https://review.openstack.org/623468	12:24
tobiash	corvus, clarkb: having metrics about the job startup times could be helpful ^	12:24
*** e0ne has quit IRC		12:27
fungi	stephenfin: there's a long standing (mis?)behavior with gerrit where if you have multiple tabs open and you try to use one you've gotten inadvertently signed out of and sign it back in, that will invalidate teh session token used by other gerrit tabs in your browser. workaround is to make sure one is signed in and working, then go around and force refresh all your other gerrit tabs _before_ trying to	12:38
fungi	click anything in them	12:38
*** tobiash has quit IRC		12:38
*** rh-jelabarre has joined #openstack-infra		12:39
*** bobh has joined #openstack-infra		12:39
stephenfin	fungi: Ahhhh, I've seen that. It's probably the fact I have tabs open from before vacation that's throwing me so	12:39
* stephenfin goes and refreshes everything manually		12:39
*** kaisers has quit IRC		12:39
fungi	frickler: are you sure the gate's stuck? we seem to be merging 10-15 changes an hour. looks to me like we're approving changes faster than we can get them through	12:40
panda	is zuul in infra deployed using container images build with pbrx ?	12:40
fungi	panda: not yet, no	12:40
fungi	still deployed with the puppet-zuul module	12:40
*** jpena is now known as jpena\|lunch		12:40
*** bhavikdbavishi has quit IRC		12:41
*** psachin has quit IRC		12:41
panda	fungi: but is planned to be ?	12:41
fungi	panda: i believe so, yes. https://specs.openstack.org/openstack-infra/infra-specs/specs/update-config-management.html outlines the transition plan	12:43
fungi	the containers section mentions "For our Python services, a new tool is in work, pbrx, which has a command for making single-process containers from pbr setup.cfg files and bindep.txt."	12:44
panda	fungi: ah, the link I was looking for. Thanks!	12:48
*** ahosam has joined #openstack-infra		12:49
fungi	always happy when i can point someone to actual documentation!	12:49
*** tobiash_ has joined #openstack-infra		12:52
*** tobiash has joined #openstack-infra		12:53
*** kaisers has joined #openstack-infra		12:56
*** bobh has quit IRC		12:56
*** bobh has joined #openstack-infra		12:57
*** tobiash has quit IRC		13:02
*** tobiash has joined #openstack-infra		13:02
openstackgerrit	Tobias Henkel proposed openstack-infra/zuul master: Add timer for starting_builds https://review.openstack.org/623468	13:03
frickler	fungi: most changes that merge seem to be outside of integrated or tripleo queues. still things seem to work a bit better currently, will wait for the next set of results from integrated queue	13:05
fungi	clarkb was helping track down a lot of failures which weren't related to performance in ovh regions	13:08
*** boden has joined #openstack-infra		13:08
fungi	which were impacting the tripleo and integrated gate queues	13:08
*** gfidente has joined #openstack-infra		13:08
frickler	I added the timeouts that I saw today at the bottom of https://etherpad.openstack.org/p/bhs1-test-node-slowness and they all were in ovh	13:09
frickler	but I'm also fine with waiting to see how things evolve over the weekend	13:10
fungi	i can run some more analysis of timeouts seen by logstash and break them down by provider weighted by max-servers in each	13:11
*** chandan_kumar is now known as chkumar\|off		13:12
*** e0ne has joined #openstack-infra		13:13
*** jamesmcarthur has joined #openstack-infra		13:18
*** bobh has quit IRC		13:22
*** EmilienM is now known as EvilienM		13:22
*** bobh has joined #openstack-infra		13:24
*** janki has quit IRC		13:24
frickler	fungi: https://ethercalc.openstack.org/jg8f4p7jow5o , seems to point to bhs1 still being bad, so maybe disable only that region	13:25
frickler	that's from the results for the last 12h on http://logstash.openstack.org/#/dashboard/file/logstash.json?query=(message:%20%5C%22FAILED%20with%20status:%20137%5C%22%20OR%20message:%20%5C%22FAILED%20with%20status:%20143%5C%22%20OR%20message:%20%5C%22RUN%20END%20RESULT_TIMED_OUT%5C%22)%20AND%20tags:%20%5C%22console%5C%22%20AND%20voting:1&from=864000s	13:26
*** bobh has quit IRC		13:29
fungi	frickler: thanks, and yeah that should only be since the most recent max-servers change to bhs1 so presumably fairly accurate	13:31
fungi	i agree no need to drop gra1, it's doing better on this than all the rackspace regions according to your analysis	13:31
fungi	hard to know if vexxhost-ca-ymq-1 is statistically significant there	13:33
fungi	so i wouldn't go drawing any conclusions from that one	13:34
*** jamesmcarthur has quit IRC		13:35
*** neilsun has quit IRC		13:36
*** jpena\|lunch is now known as jpena		13:37
*** sshnaidm\|off has quit IRC		13:37
*** bobh has joined #openstack-infra		13:39
*** jcoufal has joined #openstack-infra		13:41
*** eumel8 has joined #openstack-infra		13:42
*** edmondsw has quit IRC		13:42
openstackgerrit	Frank Kloeker proposed openstack-infra/project-config master: Re-activate translation job for Trove https://review.openstack.org/623492	13:42
openstackgerrit	Frank Kloeker proposed openstack-infra/openstack-zuul-jobs master: Add Trove to project doc translation https://review.openstack.org/623493	13:43
*** bobh has quit IRC		13:43
*** rlandy has joined #openstack-infra		13:44
*** tobiash has left #openstack-infra		13:44
*** tobiash_ is now known as tobiash		13:45
*** alexchadin has joined #openstack-infra		13:47
*** jamesmcarthur has joined #openstack-infra		13:49
frickler	fungi: vexxhost-ca-ymq-1 seems to be only special nodes for kata, so that shouldn't be relevant.	13:50
*** bobh has joined #openstack-infra		13:53
fungi	agreed	13:56
*** bobh has quit IRC		13:58
*** jpich has quit IRC		14:02
*** jpich has joined #openstack-infra		14:03
*** kgiusti has joined #openstack-infra		14:04
*** dave-mccowan has joined #openstack-infra		14:05
*** lbragstad has joined #openstack-infra		14:06
*** eharney has quit IRC		14:06
openstackgerrit	Jens Harbott (frickler) proposed openstack-infra/project-config master: Disable ovh bhs1 https://review.openstack.org/623457	14:06
openstackgerrit	Paul Belanger proposed openstack-infra/zuul master: Add nodepool.host_id variable to inventory file https://review.openstack.org/623496	14:07
*** dave-mccowan has quit IRC		14:10
openstackgerrit	Merged openstack-infra/nodepool master: Include host_id for openstack provider https://review.openstack.org/623107	14:10
*** edmondsw has joined #openstack-infra		14:12
mnaser	morning infra-root	14:19
mnaser	we're having massive issues with our centos-7 builds with 7.6 being out	14:19
mnaser	a lot of timeouts and really slow operations	14:19
mordred	mnaser: how spectacular	14:19
mnaser	could i request a node hold?	14:19
mnaser	mordred: stable enterprise os they said	14:20
mnaser	almost everything is timing out, it looks like slow io.. or slow network.. something is slow	14:20
mnaser	i dont know what	14:20
mordred	mnaser: yeah. value in not changing they said	14:20
dmsimard	mnaser: which job for which project you'd like to hold ?	14:20
mnaser	dmsimard: can we have it hold based on job? we have `openstack-ansible-functional-centos-7` or `openstack-ansible-functional-distro_install-centos-7` on any project	14:21
mnaser	all roles are broken	14:21
dmsimard	mnaser: yes, it can be based on job	14:21
mnaser	dmsimard: awesome, any of those two should be ok.	14:21
dmsimard	mnaser: what project? openstack/openstack-ansible ?	14:22
dmsimard	I know you said any project, but I need one :p	14:23
mnaser	dmsimard: it is affecting all of our roles so anything openstack/openstack-ansible-* .. the integrated is broken so i can find a job from openstack/openstack-ansible	14:23
mnaser	ah if its any project then lets just do openstack/openstack-ansible but i have different job name for you	14:23
dmsimard	ok	14:23
mnaser	openstack-ansible-deploy-aio_lxc-centos-7	14:23
mnaser	and/or openstack-ansible-deploy-aio_metal-centos-7	14:23
dmsimard	mnaser: the holds are set, next time there's a failure they will be available	14:24
mnaser	dmsimard: cool, will they catch and already running job that might fail	14:25
mnaser	or will it have to be done started now?	14:25
dmsimard	not 100% sure	14:25
mnaser	one literally just failed now, i wonder if were quick enough =P	14:25
dmsimard	let me see	14:25
mnaser	https://review.openstack.org/#/c/618711/ 1 minute ago	14:25
*** mriedem has joined #openstack-infra		14:25
*** lbragstad has quit IRC		14:25
dmsimard	yeah it triggered the autohold, one min	14:25
mnaser	perfect timing woo	14:27
dmsimard	mnaser: root@104.130.117.58	14:28
dmsimard	mnaser: it's in rax-ord	14:28
mnaser	awesome, works, thank you	14:28
mnaser	ssh taking ages to log me in	14:28
mnaser	so thats a good start	14:28
*** lbragstad has joined #openstack-infra		14:29
dmsimard	looks like ansible-playbook is still running	14:29
mnaser	yeah, the job didn't fail but timed out (it really shouldn't take this long)	14:29
fungi	mnaser: any chance you're seeing these slow runs just in ovh-bhs1? i think we're about to disable it again because of a disproportionate amount of job timeouts there	14:30
mnaser	fungi: i've seen some but in this case it's actually been centos and i've noticed it on rax-ord	14:30
mnaser	which in my experience is a stable region but i dunno if 7.6 made weird things	14:30
fungi	okay, so some more systemic issue i suppose	14:30
dmsimard	mnaser: setup-infrastructure is 56 minutes in your patch: http://logs.openstack.org/11/618711/3/check/openstack-ansible-deploy-aio_lxc-centos-7/ec4a635/logs/ara-report/	14:30
mnaser	thing is even log collection	14:31
dmsimard	it's 28 minutes on opensuse	14:31
mnaser	times out after 30 minutes	14:31
mnaser	so something is strange on centos machines	14:31
*** jamesmcarthur has quit IRC		14:31
mnaser	we cant even collect logs because we timeout	14:31
mnaser	http://logs.openstack.org/42/614342/4/check/openstack-ansible-functional-centos-7/517947d/job-output.txt.gz#_2018-12-06_17_57_16_163687	14:32
mnaser	have a look at that	14:32
dmsimard	mnaser: looking at the ara report for that timeout'd job, there's definitely something going on	14:33
guilhermesp	yeah mostly yesterday, around 3 rechecks with timeout collecting logs http://logs.openstack.org/20/618820/48/check/openstack-ansible-functional-centos-7/ca0f625/job-output.txt.gz#_2018-12-07_03_13_17_201631	14:33
mnaser	http://logs.openstack.org/11/618711/3/check/openstack-ansible-deploy-aio_lxc-centos-7/ec4a635/logs/ara-report/result/143042ae-27a8-4601-8506-1f4f7bea56a6/	14:33
dmsimard	mnaser: templating a file shouldn't take >5 minutes	14:34
dmsimard	creating files/directories too	14:34
mnaser	yeah..	14:34
mnaser	fungi: what was the dd command you've been using?	14:34
mordred	wow. the systemd log verification failure is nice	14:34
mnaser	:D	14:34
mnaser	i cant remember if hwoarang or cloudnull worked on that	14:35
fungi	mnaser: i took it from your swapfile setup example log: sudo dd if=/dev/zero of=/foo bs=1M count=4096	14:35
mnaser	but one is dealing with a new born and the other is enjoying far east asia	14:35
mnaser	so dont think they can comment too much :)	14:35
mnaser	ok running that to see what sort of numbers i get	14:35
*** wolverineav has joined #openstack-infra		14:36
dmsimard	taking 1m30s to create a directory http://logs.openstack.org/11/618711/3/check/openstack-ansible-deploy-aio_lxc-centos-7/ec4a635/logs/ara-report/file/aee893b3-c5f4-4c21-943a-367431ef584c/#line-46 ...	14:36
mnaser	4294967296 bytes (4.3 GB) copied, 9.86048 s, 436 MB/s	14:37
mnaser	hmm	14:37
mnaser	see whats interesting is	14:37
mnaser	when you login, it takes a while to get a terminal	14:37
mnaser	im wondering if that has to do with it..	14:37
mnaser	pam_systemd(sshd:session): Failed to create session: Failed to activate service 'org.freedesktop.login1': timed out	14:37
mordred	wow	14:38
*** eernst has joined #openstack-infra		14:38
fungi	looks like we lost a merger around 09:00z	14:38
*** bobh has joined #openstack-infra		14:38
logan-	following up on the nested virt issues we were seeing in limestone over the psat 3-4 days: all of the HVs have been upgraded from xenial hwe kernel 4.15.0.34.56 to 4.15.0.42.63 now and my jobs that require nested virt are happy again, so I assume others are also.	14:38
fungi	i'll see if i can figure out which merger is missing	14:38
*** wolverineav has quit IRC		14:41
fungi	zuul-merger is running on all the dedicated zm hosts	14:42
dmsimard	mnaser: not sure if related, but watching the processes, sudo commands seems to get stuck a lot ?	14:42
dmsimard	i.e, http://paste.openstack.org/raw/736820/	14:42
mnaser	dmsimard: that 100% is related to that logind stuff	14:42
mnaser	journalctl \| grep login1	14:43
dmsimard	oh yeah it's obvious now	14:43
*** takamatsu has joined #openstack-infra		14:43
*** alexchadin has quit IRC		14:43
mnaser	now i dont know if we're breaking it or if 7.6 broke it	14:43
fungi	actually, have we lost 4 mergers? we should have 20, right? 8 dedicated and 12 accessory to the executors?	14:44
fungi	http://grafana.openstack.org/d/T6vSHcSik/zuul-status?panelId=30&fullscreen&orgId=1&from=now%2Fd&to=now%2Fd	14:44
fungi	yep	14:44
dmsimard	mnaser: from http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1	14:45
dmsimard	mnaser: wrong name, meant fungi sorry :p	14:45
mnaser	figured :)	14:45
dmsimard	mnaser: in systemd-logind journal there's "Failed to abandon session scope: Transport endpoint is not connected" which leads to https://github.com/systemd/systemd/issues/2925	14:45
mnaser	dmsimard: thats super useful	14:46
mnaser	it looks like something happened at 12:14:52 which restarted dbus	14:46
mnaser	and since then it never came back	14:46
frickler	"Restarting dbus is not supported generally. That disconnects all clients, and the system generally cannot recover from that. This is a dbus limitation." - poettering	14:48
fungi	ze12 seems to have registered an oom-killer event in dmesg at 23:21:17z	14:48
mnaser	looking here https://github.com/openstack/openstack-ansible-lxc_hosts/blob/6eee41f123dd49d73ad2851b878c11efd6cfffa2/tasks/lxc_cache_preparation_systemd_old.yml	14:48
mnaser	ill move this to #openstack-ansible	14:48
*** yamamoto has quit IRC		14:49
fungi	the other executors don't indicate any recent oom events, but then again they turn over their dmesg ring buffers rather rapidly	14:49
openstackgerrit	Frank Kloeker proposed openstack-infra/project-config master: Add translation job for storyboard https://review.openstack.org/623508	14:50
fungi	not super urgent as the mergers seem to be keeping up. occasional spikes of 10-20 queued but they clear quickly	14:50
eumel8	fungi, mordred: ^^ We want to start with storyboard translation in that cycle.	14:51
fungi	but a little worried that the merger threads on the executors may be dying inexplicably	14:51
*** sshnaidm has joined #openstack-infra		14:52
*** jcoufal has quit IRC		14:54
*** jamesmcarthur has joined #openstack-infra		14:57
*** jcoufal has joined #openstack-infra		14:58
*** diablo_rojo has joined #openstack-infra		14:58
*** ccamacho has quit IRC		15:00
*** armstrong has joined #openstack-infra		15:00
*** sshnaidm has quit IRC		15:03
*** slaweq has quit IRC		15:04
*** eharney has joined #openstack-infra		15:05
*** ramishra has quit IRC		15:06
openstackgerrit	Frank Kloeker proposed openstack-infra/project-config master: Add translation job for storyboard https://review.openstack.org/623508	15:09
*** ykarel is now known as ykarel\|away		15:11
pabelanger	fungi: ze12 needs to be rebooted as it is running a non HWE kernel. The only executor that is different	15:12
AJaeger	eumel8: please ask storyboard cores to review that change and +1	15:12
*** dpawlik has quit IRC		15:12
fungi	pabelanger: yep, i assume that plays into the oom situation on that server	15:13
eumel8	AJaeger: thx	15:14
mnaser	mriedem: i see you do this often, but how do you take a bug and list it across affecting different releases?	15:15
mnaser	inside launchpad	15:15
fungi	mnaser: you have to be in the bug supervisor group for those projects, i think	15:16
mriedem	right	15:16
mriedem	"Target to series"	15:16
fungi	and if you are, there's a little icon where you can get into the project details for a given bugtask and add specific series	15:16
mriedem	and the series has to be managed properly in launchpad	15:17
mnaser	ok i see that, alright, looks like we need t add the new releases because it looks like the last series we have there is newton	15:17
mriedem	yup https://launchpad.net/nova/+series	15:17
mriedem	lots of projects in lp don't track the series	15:17
fungi	yeah, it's rather a bit of setup so unless it's something you expect to make a lot of use of i'm not sure i'd bother	15:17
mnaser	ack	15:17
*** ykarel\|away has quit IRC		15:18
*** ccamacho has joined #openstack-infra		15:18
dmsimard	For the record, we have identified what was causing the OSA CentOS timeouts and they're testing a fix right now. A task ended up restarting dbus and this apparently causes issues with systemd-logind which lead to 25s timeouts for every ssh, sudo (and ansible!) commands	15:20
dmsimard	It seems like a well documented bug and the consensus appears to be that dbus isn't meant to be restarted, it should only be reloaded if need be	15:20
mordred	dmsimard: awesome news!	15:23
mordred	dmsimard: yay for debugging and finding errors	15:23
dmsimard	mnaser: do you still need those two nodes ?	15:23
mnaser	dmsimard: oh can i have access to the baremetal one?	15:23
dmsimard	mnaser: yeah	15:23
dmsimard	sec	15:23
mnaser	we shouldn't run lxc_hosts on the metal jobs	15:24
dmsimard	actually, 25secs :P	15:24
dmsimard	mnaser: root@104.239.173.194	15:24
mnaser	aw man	15:25
mnaser	i hope we're not running lxc_hosts on non-container jobs	15:25
*** yamamoto has joined #openstack-infra		15:26
*** yamamoto has quit IRC		15:29
*** sshnaidm has joined #openstack-infra		15:29
*** yamamoto has joined #openstack-infra		15:29
openstackgerrit	Frank Kloeker proposed openstack-infra/project-config master: Re-activate translation job for Trove https://review.openstack.org/623492	15:30
fungi	if any infra-puppet-core has a moment to review https://review.openstack.org/623290 the storyboard team would appreciate it	15:31
fungi	eumel8: i've mentioned https://review.openstack.org/623508 in #storyboard and given a couple of our primary maintainers a heads up about it	15:33
*** ykarel\|away has joined #openstack-infra		15:33
*** e0ne has quit IRC		15:34
mnaser	ok we found the root cause, we can lose those vms dmsimard	15:34
mnaser	thank you	15:34
dmsimard	\o/	15:34
*** eharney has quit IRC		15:37
*** dpawlik has joined #openstack-infra		15:41
*** sshnaidm is now known as sshnaidm\|off		15:42
*** dpawlik has quit IRC		15:46
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul master: web: add OpenAPI documentation https://review.openstack.org/535541	15:46
*** eharney has joined #openstack-infra		15:52
*** e0ne has joined #openstack-infra		15:55
*** janki has joined #openstack-infra		16:00
*** pgaxatte has quit IRC		16:06
clarkb	frickler: fungi: I also identified ~6 e-r bugs that correlated strongly to disabling ovh bhs1	16:13
clarkb	we shoudl be able to use that to see if things are improving too	16:13
*** tosky has quit IRC		16:14
*** e0ne has quit IRC		16:23
openstackgerrit	Merged openstack-infra/zuul master: Add default value for relative_priority https://review.openstack.org/622175	16:25
*** gyee has joined #openstack-infra		16:25
*** adriancz has quit IRC		16:26
fungi	clarkb: did hits for those reduce by half when we halved the max-servers too?	16:28
clarkb	fungi: its hard to tell because they don't seem to run at a constant rate either? and looking at multiple graphs we'd need to merge them to answer that I think	16:29
clarkb	fwiw they seem to have continued since yesterday's halving	16:29
fungi	but disappeared when bhs1 was offline?	16:30
clarkb	yes	16:32
clarkb	fungi: http://status.openstack.org/elastic-recheck/index.html#1805176 is a really good example of that	16:32
openstackgerrit	Merged openstack-infra/zuul master: web: refactor job page to use a reducer https://review.openstack.org/623156	16:32
clarkb	there is this giant hole there, and at first I was really concerned we had broken log indexing but I'm fairly positive its the bhs1 disabling instead	16:32
openstackgerrit	Merged openstack-infra/zuul master: web: refactor tenants page to use a reducer https://review.openstack.org/623157	16:35
clarkb	fungi: but you'll alos notice that it hasn't stopped since yesterday's halving	16:37
fungi	that's "neat"	16:39
*** bhavikdbavishi has joined #openstack-infra		16:39
clarkb	ya I started going down the hold of where did we break indexing	16:40
clarkb	and after 15 minutes light bulb went off when I couldn't find anything obviously broken	16:40
clarkb	unfortunately halving our use to curb being our own noisy neighbor was my last good idea :) I don't really know what we'd look at next particularly from our end.	16:41
clarkb	maybe go back to the idea its unhappy hypervisors and maybe there are more of them than we expected?	16:41
clarkb	we do know that there are jobs that run perfectly fine on bhs1	16:42
clarkb	dstat doesn't show the cpu sys and wai signature of the jobs that fail. These happy jobs run in a resonable amount of time successfully	16:42
pabelanger	https://review.openstack.org/623496/ is atleast ready to start exposing host_id into the inventory files, for when we next restart zuul	16:42
clarkb	this implies to me its not a consistent issue like "cpus are slow" or "disks are slow" deal with it. I think it also rules out overhead due to metldown/spectre/l1tf	16:43
pabelanger	as another data point for logstash	16:43
clarkb	pabelanger: ++	16:43
clarkb	added to my review queue	16:43
clarkb	Maybe a next step is to disable the region, then boot a couple VMs on every hypervisor (may need amorin help with this depending on how the scheduler places things) and run benchmarks on each VM and see if that shows a pattern	16:45
clarkb	(I would just run devstack as a first pass benchmark)	16:45
*** cgoncalves has quit IRC		16:45
clarkb	we've seen the db migrations and service startup for eg nova api trigger things in devstack	16:45
fungi	i wonder if we could take bhs1 out of the line of fire for real jobs, but then load it with identical representative workloads we could use to benchmark instances and map them back to their respective hosts to identify any correlation	16:45
clarkb	ya	16:45
clarkb	also maybe the nova team ahs ideas around how to make nova VMs happier	16:47
*** bobh has quit IRC		16:50
*** bobh has joined #openstack-infra		16:51
*** cgoncalves has joined #openstack-infra		16:53
*** kjackal has quit IRC		16:53
*** boden has quit IRC		16:54
*** kjackal has joined #openstack-infra		16:55
clarkb	considering it is friday and we are entering a quieter period for many. Maybe the next action is as frickler suggests disable bhs1. Then we can regroup while things are slow and rather than treat this like a fire work through it carefully?	16:56
clarkb	unfortunately my timezone doesn't overlap well with CET. Thank you all who do have some overlap with help with that. Curious what you think about ^ if we were to try and do more measured debugging and take our time (which likely requires being awake when amorin et al are	16:57
fungi	i concur, we should likely disable it again. frickler's analysis was pretty eye-opening	16:59
clarkb	infra-root https://review.openstack.org/#/c/623457/2 is frickler's change to do ^. I've +2'd it and if anyone else would like to chime in now is a good time :)	16:59
mordred	clarkb: lgtm - I think the plan above to disable it and then run synthetic tests sounds good	17:00
fungi	i do think the frequent timeouts there are likely costing us more capacity than the nodes we lose by dropping it for now	17:00
clarkb	ya the cascading effect of gate resets is painful	17:02
*** ginopc has quit IRC		17:02
fungi	i think you can approve it. we have a lot of +2s on there now	17:05
clarkb	done	17:05
clarkb	then to be extra confident in those e-r holes being related to bhs1 we can watch for a new hole :)	17:08
*** jpich has quit IRC		17:10
*** mriedem is now known as mriedem_lunch		17:19
clarkb	pabelanger: +2'd https://review.openstack.org/#/c/623496/1 but didn't approve as I'm always wary around the addition of reliance on new zk data. I'm not quite sure if the release note needs to add any additioanl info about updates or if that will just work if your nodepool is older	17:19
clarkb	Shrews: ^ that might be a question for you?	17:19
*** gfidente has quit IRC		17:20
pabelanger	clarkb: in older nodepool, I'd expect it to nodepool.host_id to be None	17:20
Shrews	looking	17:21
*** dims has quit IRC		17:21
Shrews	looks like that should just work	17:22
openstackgerrit	Merged openstack-infra/project-config master: Disable ovh bhs1 https://review.openstack.org/623457	17:22
*** dtantsur is now known as dtantsur\|afk		17:22
Shrews	older nodepool won't have host_id, so it should just remain None, as pabelanger says	17:23
*** rlandy is now known as rlandy\|brb		17:23
clarkb	pabelanger: ya I guess since you are just setting yaml/ansible var values and not acting on the attribute that should be fine	17:23
*** wolverineav has joined #openstack-infra		17:23
clarkb	if the jobs don't handle that properly the job will have issues, but zuul itself will be fine	17:23
Shrews	the gotchas for zk schema changes tend to be adding new states to pre-existing fields	17:23
Shrews	pabelanger: just a reminder, but launchers will need a restart if you intend to use that new field	17:25
Shrews	last restart didn't get that update	17:25
clarkb	'The recend tripleo gate reset was not bhs1 related, at least jobs didn't run there. Instead delorean is unhappy with DEBUG: cp: cannot stat 'README.md': No such file or directory'	17:25
clarkb	we that first ' should be just before DEBUG	17:25
pabelanger	Shrews: ++	17:26
clarkb	Tryign to call those out to dispell the idea that bhs1 is our only issue	17:26
EvilienM	we're trying to figure how tripleo-ci-centos-7-undercloud-containers is not in our gate anymore	17:26
EvilienM	it's in our layout	17:27
corvus	EvilienM: give me a change id that should have run it and i can tell you	17:27
*** wolverineav has quit IRC		17:27
corvus	(you can also use the debug:true feature, but this will save 3 hours :)	17:27
EvilienM	corvus: I6754da1142e2ec865ef8c60a7e09df00300f791e	17:27
EvilienM	yeah	17:27
Shrews	"debug: corvus" > "debug: true"	17:28
corvus	lol	17:28
EvilienM	yes that^	17:28
mordred	all debug messages should just be "corvus:	17:28
corvus	(i wonder if we can attach the debug info to buildsets in the sql db so folks could look this up post-facto through the web)	17:29
EvilienM	corvus: it started 3 days ago	17:29
EvilienM	(lgostash says)	17:29
fungi	that sounds like a great idea. it doesn't really mean that much additional data in the db, and it's not like that db gets huge anyway	17:29
fungi	plus, people often come asking for an explanation of why something didn't run	17:30
*** janki has quit IRC		17:33
corvus	EvilienM: i haven't found that zuul even considered running that job on that change... where is it attached to the tripleo-ci project gate pipeline?	17:33
corvus	EvilienM: http://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/zuul.d/undercloud-jobs.yaml#n6 has it attached to check but not gate	17:36
corvus	(note that because of the clean-check gate requirement, a voting job in check but not in gate can wedge a co-gating project system)	17:37
*** rlandy\|brb is now known as rlandy		17:41
*** bhavikdbavishi has quit IRC		17:48
*** sean-k-mooney has joined #openstack-infra		17:49
*** ahosam has quit IRC		17:50
*** jpena is now known as jpena\|off		17:50
*** boden has joined #openstack-infra		17:50
EvilienM	corvus: weird, let me see	17:51
EvilienM	I think it's a leftover	17:51
EvilienM	when we had to deal with fires	17:51
EvilienM	I2753393cd7cdd64720d39360bebe9ddea2f20efc	17:51
EvilienM	mwhahaha, corvus : https://review.openstack.org/#/c/623555/	17:52
EvilienM	damn, sorry for noise	17:52
EvilienM	mwhahaha: are we missing other jobs in your finding?	17:52
mwhahaha	no just that one	17:52
mwhahaha	he removed the wrong one	17:52
*** shardy has quit IRC		17:57
*** Swami has joined #openstack-infra		18:13
*** wolverineav has joined #openstack-infra		18:13
*** jamesmcarthur has quit IRC		18:14
*** ykarel\|away has quit IRC		18:17
clarkb	integrated gate just merged ~9 changes. tripleo gate also merged a stack not too long ago. I think things are moving	18:18
openstackgerrit	Merged openstack-infra/zuul master: Add nodepool.host_id variable to inventory file https://review.openstack.org/623496	18:21
*** udesale has joined #openstack-infra		18:26
*** udesale has quit IRC		18:27
sean-k-mooney	hi o/	18:28
sean-k-mooney	i ask this every 6 months or so but looking at https://docs.openstack.org/infra/manual/testing.html is the policy on nested virt still the same.	18:28
sean-k-mooney	e.g. we cant gurrenttee its available and its usally disabled if detected	18:29
*** jamesmcarthur has joined #openstack-infra		18:31
clarkb	sean-k-mooney: yes, in fact just yseterday we found it crashing and rebooting VMs	18:34
clarkb	sean-k-mooney: seems to be just as unreliable as ever unfortunately	18:34
sean-k-mooney	clarkb: its now on by default in the linux kernel going forward	18:34
clarkb	sean-k-mooney: ya but none of our clouds or guests are running 4.19 yet (except our tumbleweed images and maybe f29)	18:35
sean-k-mooney	clarkb: my experince is its pretty reliable if you set teh vms to host-passthough but its flaky if you set a custom cpu moduel or host-model	18:35
*** jamesmcarthur has quit IRC		18:35
clarkb	and quite literally yesterday we found unexpected reboots in tests from centos7 crashing with nested guests and virt enabled. I wish it were reliable but it isn't	18:36
clarkb	logan-: ^ not sure if you set host passthrough or not butthat may be something to check if you don't need live migration in that region or have homogenous cpus	18:36
sean-k-mooney	ok the reason im asking is the intel nfv ci has broken again and im trying to figure out if i can set up a ci to replace it upstream or else where	18:36
logan-	yes, we use host passthrough	18:37
*** bobh has quit IRC		18:38
sean-k-mooney	huh strange i guess i was just looking when we ran the intel nfv ci with it in our dev lab we never had any issues	18:38
fungi	sean-k-mooney: in particular, over the past couple years we've seen that it's very sensitive to the combination of guest and host kernel used	18:38
fungi	if you control both then you can get it to work	18:38
fungi	_probably_	18:38
sean-k-mooney	fungi: we were using ubuntu 14.04/16.04 hosts with centos7 and ubutu 14.04/16.04 guests	18:39
sean-k-mooney	depending on theyear	18:39
fungi	so for "private cloud" scenarios it's likely viable. in our situation relying mostly on public cloud providers it's more like russian roulette	18:39
sean-k-mooney	ya i get that	18:39
*** ccamacho has quit IRC		18:40
sean-k-mooney	ok well that contiues to rule out ovs-dpdk testing in the gate so	18:40
sean-k-mooney	im going to set up a tiny cluster at home to do some third party ci for now.	18:41
fungi	we get to periods where there is some working combination of guest and host kernels in our images and some of our providers, and then our images get a minor kernel update and suddenly it all goes sideways until the provider gets a newer kernel onto their hosts	18:41
logan-	the base xenial 4.4 seemed to break a lot more often, so we run the 4.15 xenial hwe on the nodepool hvs starting a few months ago, and even with that we still see things like this week where cent7 and xenial guests running on these hvs were hard rebooting when they attempted to launch a nested virt guest	18:41
sean-k-mooney	logan-: did you have host-pasthough on both the hosting cloud and the vms launched by tempest	18:42
sean-k-mooney	i should have mention that you need it for both	18:43
logan-	i don't control all of the guest jobs, but for my test jobs that were rebooting on xenial, yes passthrough was enabled on the guest also	18:43
sean-k-mooney	oh well as i said i guess we were lucky with our configs	18:43
sean-k-mooney	logan-: do you work for one of the cloud providers?	18:44
logan-	i run the limestone cloud	18:44
clarkb	mgagne_: had found that the cpu itself seemed to make a difference too iirc. But unsure if that was a bug in cpus or linux kernel issues or something else	18:44
sean-k-mooney	ah ok not familar with that one.	18:44
clarkb	if 4.19 is indeed stable enough to have it turned on by default and evidence shows that to be the case I'll happily use it when we get there	18:45
clarkb	but that is likely still a ways off	18:45
clarkb	4.19 has fixes for the fs corruption now too	18:46
clarkb	whcih I should actually reboot for	18:46
*** kjackal has quit IRC		18:47
*** e0ne has joined #openstack-infra		18:47
sean-k-mooney	clarkb: ya if it changes in the future ill maybe see about setting up some ustream nfv testing but for now ill look at third party solutions	18:48
sean-k-mooney	that or review the effort to allow ovs-dpdk testing to work without nested virt but nova dont want to merge code that would only be used for testing	18:49
*** e0ne has quit IRC		18:49
sean-k-mooney	*revive	18:49
*** bobh has joined #openstack-infra		18:49
*** mriedem_lunch is now known as mriedem		18:49
fungi	i don't suppose that behavior could be made pluggable	18:49
clarkb	another avenue which we've considered in the past but never got very far with is baremetal test resources	18:50
mnaser	fyi i think our nested virt is working well	18:50
*** dims has joined #openstack-infra		18:50
mnaser	i know its pretty stable for kata afaik	18:50
clarkb	mnaser: have they recently tested with centos 7.6?	18:50
mnaser	i dont know if kata uses 7.6	18:50
clarkb	mnaser: that seemed to be the recent trigger for us I think	18:50
clarkb	mnaser: it was the new kernel in the host VM	18:51
mnaser	we also run the latest 7.6 in sjc1 in the host	18:51
sean-k-mooney	fungi: well it basically comes down to the fact that nova enables cpu pinning when you request hugepages and qemu does not supprot cpu pinning without kvm	18:51
mnaser	and that resolved a lot of nested virt issues	18:51
clarkb	mnaser: ah in that case it may work out	18:51
mnaser	worth a shot. kata has a lot of resources in terms of nested virt folks so maybe working with them might be beneficial	18:51
sean-k-mooney	fungi: hugepgaes and numa work with qemu but since i cant disabel the pinning i cant do the testing upstream.	18:52
clarkb	mnaser: logan- found the new ubuntu hwe kernel seems to have stabilized it too. The bigger issue on my side is that it stopped working in tripleo. The tests were rebooting halfway through and no one noticed until days later when I dug into failures	18:52
sean-k-mooney	maybe that will change when rhel8/centos8 come out next year	18:53
sean-k-mooney	that said we will still have cento7 jobs for a few releaes	18:53
*** bobh has quit IRC		18:54
*** bobh has joined #openstack-infra		18:57
fungi	we will, but they'll probably be run a lot less frequently as most of the development activity goes on in master anyway and we'd be using 7 on (fewer and older over time) stable branches	18:58
*** bobh has quit IRC		19:01
*** eernst has quit IRC		19:06
*** electrofelix has quit IRC		19:11
*** bobh has joined #openstack-infra		19:16
*** wolverineav has quit IRC		19:19
*** bobh has quit IRC		19:20
openstackgerrit	MarcH proposed openstack-infra/git-review master: CONTRIBUTING.rst, HACKING.rst: fix broken link, minor flow updates https://review.openstack.org/623362	19:28
*** armstrong has quit IRC		19:30
openstackgerrit	MarcH proposed openstack-infra/git-review master: doc: new testing-behind-proxy.rst; tox.ini: passenv = http[s]_proxy https://review.openstack.org/623361	19:31
*** bobh has joined #openstack-infra		19:34
clarkb	I'm caught up on email and the gate is looking relatively happy. I think I'm going to take this as an opportunity to reboot for kernel fixes (so my fs doesn't corrupet) and apply some patches to my local router. I expect I'll not be gone long but will let you know via the phone network if I managed to break something :)	19:36
*** bobh has quit IRC		19:39
tobiash	clarkb: good luck :)	19:39
clarkb	that was actually far less painful than I expected	19:46
*** bobh has joined #openstack-infra		19:53
*** bobh has quit IRC		19:57
*** jcoufal has quit IRC		20:10
*** jamesmcarthur has joined #openstack-infra		20:10
*** jamesmcarthur has quit IRC		20:11
*** bobh has joined #openstack-infra		20:11
*** eernst has joined #openstack-infra		20:13
*** kjackal has joined #openstack-infra		20:15
*** bobh has quit IRC		20:16
fungi	huh, a bunch of failing changes near the front of the integrated gate queue now	20:19
*** sean-k-mooney has quit IRC		20:20
fungi	all seem to be failures and timeouts for unit tests? weird	20:20
fungi	the neutron change with the three unit test timeouts all ran in limestone-regionone	20:21
fungi	cinder change failed a volume creation/deletion unit test, ran in ovh-gra1	20:24
*** mriedem has quit IRC		20:25
fungi	the nova change had two unit test jobs fail on assorted database migration/sync tests, and both those jobs ran in limestone-regionone	20:25
*** kjackal has quit IRC		20:25
fungi	logan-: any chance the kernel updates have caused us to chew up more resources there?	20:25
*** sean-k-mooney has joined #openstack-infra		20:28
*** mriedem has joined #openstack-infra		20:28
*** bobh has joined #openstack-infra		20:29
*** kgiusti has left #openstack-infra		20:30
*** bobh has quit IRC		20:34
*** eharney has quit IRC		20:39
*** ahosam has joined #openstack-infra		20:41
fungi	according to logstash, job timeouts for limestone-regionone really started picking up around 18:00z	20:46
*** ralonsoh has quit IRC		20:47
*** bobh has joined #openstack-infra		20:47
fungi	that's something like 3.5 hours after logan- mentioned all the hypervisors had been upgraded	20:48
fungi	so maybe not connected?	20:48
*** bobh has quit IRC		20:52
mriedem	so, on that new zuul queueing behavior, is there any way that could negatively affect some changes from landing?	20:54
mriedem	clarkb's email mentioned it could mean things in nova taking longer,	20:54
mriedem	but i'm just wondering if we're re-enqueuing some approved nova changes that take about 16 hours just to fail on some slow node, then take extra long now to go back through	20:54
mriedem	b/c it's taking us days to land code	20:55
mriedem	i.e. it sounds like nova changes are deprioritized because there could be a lot of nova changes queued up at any given time right?	20:56
pabelanger	each patch now enters the check pipeline at the same priority, across all projects. So the first patch set of each project gets nodes now, even if nova submitted 10 different patches lets say	20:58
mnaser	mriedem: is this happening in big stacks?	20:58
mriedem	define big	20:58
mriedem	https://review.openstack.org/#/c/602804/15	20:58
pabelanger	this is amplified, if a large stack of patches are submitted together, as patches behind that won't get nodes until the patch series has nodes	20:58
mnaser	yeah what pabelanger said.. i think this is something we should somehow address	20:59
dansmith	define large?	20:59
dansmith	and is this just stacks of patches?	20:59
mnaser	well, large is relative to how many patches are in the nova queue right now	20:59
mnaser	i.e. if i push a 30 stack change, then i've just effectively slowed down all of nova's development till all 30 are tested	21:00
mnaser	which isn't ideal tbh	21:00
dansmith	meaning it slows down all of nova, while trying to avoid slowing down, say, neutron?	21:00
mnaser	pretty much dansmith	21:00
mnaser	because then the 'queue' for nova is long and neutron is short so.. neutron gets a node while nova waits	21:01
pabelanger	mnaser: well, not really. it is the same behavior to nova, but other projects get priority over nodes now	21:01
dansmith	doesn't that kinda not work if one project has three people working on it and another has 30?	21:01
dansmith	pabelanger: I'm not sure I see the difference	21:01
mnaser	dansmith: yeah. unfortunately the busier projects somehow get less results and the more idle ones get more results quicker	21:01
pabelanger	for node requests, now large and small project both have equal weight to the pool of resources	21:02
mnaser	because $some_small_project always has 2-3 changes in queue, vs nova that might have 40	21:02
mnaser	so if i'm working on a project alone, i can almost always get a node right away.. but im in nova, it might not be for a while	21:02
dansmith	mnaser: okay, yeah, that seems frustrating.. because we've been waiting longer than a day to get a single read on a fix.. like, to even get logs to evaluate something	21:02
pabelanger	the 4th paragraph in http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000482.html should give a little more detail	21:02
mriedem	i just replied,	21:03
mnaser	dansmith: i agree. i'm kinda on both sides of this, now our openstack-ansible roles get super quick feedback.. but i see working with other big projects that now it takes forever to get check responses	21:03
mriedem	wondering if it would be possible to weigh nova changes in the queue differently based on their previous number of times through	21:03
fungi	node assignments are round-robined between repositories, so if three nova changes are queued at the same time as two cinder changes and one neutron change then nodes will be doled out to the first nova, neutron and cinder changes, then the next two nova and neutron changes, and then the third nova change in that order	21:03
pabelanger	dansmith: right now I would say the gate resets / reduced nodes from providers is also impacting the time here too	21:03
dansmith	mnaser: honestly the last few days I'll make one change in the morning and just do other stuff for the whole day, then check on the patch the next morning	21:03
dansmith	mnaser: which pretty much kills one's will to live and motivation to iterate on something	21:04
mnaser	dansmith: i agree with you 100%.	21:04
dansmith	pabelanger: I'm sure yeah	21:04
mnaser	i do think at times a smaller project can iterate a lot faster than say.. nova	21:04
fungi	a lot of this queuing model was driven by the desire to make tripleo changes wait instead of starving the ci system of resources to the point where most other projects were waiting forever	21:04
mnaser	because they'll always get a highest priority if they only have one change in a queue	21:04
mriedem	so, just curious, are the new non-openstack foundation projects contributing resources to node pool for CI resources?	21:04
mriedem	is that just a bucket of money the foundation doles out for CI?	21:05
fungi	the foundation doesn't purchase ci resources	21:05
mnaser	mriedem: none do, but afaik, their utilization is very small	21:05
dansmith	fungi: yeah, and I'm sure tripleo causes a lot of trouble for the pool	21:05
mriedem	fungi: don't get me wrong, i'd like to see tripleo go full 3rd party CI at this point :)	21:05
mriedem	has the idea of quotas ever come up?	21:05
fungi	and yeah, the other osf projects use miniscule amounts of ci resources, as mnaser notes	21:05
*** bobh has joined #openstack-infra		21:06
mriedem	e.g. each project gets some kind of quota so one project can't just add 30 jobs to run per change	21:06
mriedem	or never clean up their old redundant jobs	21:06
fungi	mriedem: yes, i think the idea is that we'd potentially implement quotas in zuul/nodepool per tenant, but we're still missing some of the mechanisms we'd need to do that	21:07
mriedem	tenant == code repo/project requesting CI resources?	21:08
*** jtomasek has quit IRC		21:08
mriedem	maybe project under governance	21:08
*** rlandy has quit IRC		21:08
mriedem	i just know it's very easy today, especially with zuulv3 where jobs are defined per project, for a project to be like, "oh we should test knob x, let's copy job A, call it job B and set knob x to True even though 95% of the tests run will be the same"	21:09
sean-k-mooney	fungi: out of interest what is missing for quotas. i played wtih having mulitple zuul tenants using the same cloud resoues provided by a ahred node pool.	21:09
sean-k-mooney	i did not have it deployed for long but had tought everything was supported to do that	21:10
corvus	well from my pov what's missing is the will to set quotas. does anyone have a suggestion as to how we should set them?	21:10
*** bobh has quit IRC		21:10
clarkb	seems I've missed the fun conversation running errands and finding lunch	21:11
clarkb	my last email included a paste that breaks down usage	21:11
*** jamesmcarthur has joined #openstack-infra		21:11
clarkb	tripleo is ~42% with nova and neutron around 10-15% iirc. All openstack official projects together are 98.2% iirc	21:11
dansmith	I definitely get the intent here, and I think it's good,	21:12
clarkb	I really want to gwt away from the idea itsnew projects using all the resources the data doesnot back that up	21:12
dansmith	clarkb: we get it. it's not new projects, it's tripleo :)	21:12
sean-k-mooney	corvus: well i dont know what the actul capasity is but you could start with one tenant per offial team in the governance repo and oversibsribing the actual cloud resouces by giving each tanatn a quorat of 100 inances and then adjust it down	21:13
dansmith	42% when the fat pig that is nova is a pretty huge scaling factor	21:13
fungi	sean-k-mooney: sharing job configuration across tenants is inconvenient, i think we were looking at tenant per osf top-level project	21:14
*** jamesmcarthur has quit IRC		21:14
*** jamesmcarthur has joined #openstack-infra		21:14
mriedem	but nova waiting a day for results when zaqar is prioritized higher is kind of....weird	21:14
clarkb	right the biggest impact we can have is improving tripleos fail rate and overall high drmand	21:15
corvus	dansmith: it's a little hard for me to separate whether you're seeing the effect of the new queueing behavior or just the general backlog. because while you may wait a day to get results on the change you push up this morning, if we turned off the new behavior the same would still be true.	21:15
mriedem	i would think quota would be doled out by some kind of deployed / activity metrci	21:15
mriedem	*metric	21:15
corvus	yesterday, mid-day, nova was waiting 5 hours for results	21:15
dansmith	er, I meant 42% when something as big as nova is only 10% is a big scaling factor	21:15
mriedem	taking a different angle here,	21:15
dansmith	corvus: 5 hours for one patch or five hours for the last patch in a small series?	21:16
mriedem	are we aware of any voting jobs that have a fail rate of 50% or more?	21:16
dansmith	corvus: and yeah, I know things are in shambles right now	21:16
sean-k-mooney	why does triplo use so much resouces anyway. is its minium deployment size jsut quite big or does it have a lot of jobs? or both i guess?	21:16
mriedem	shitload of jobs	21:16
*** tpsilva has quit IRC		21:16
mriedem	from what i could see	21:16
corvus	sean-k-mooney: and they run very long	21:16
mriedem	baremetal	21:16
mriedem	it's like the worst kind of ci requirements	21:16
sean-k-mooney	am could we have a seperate teanatn for tripplo?	21:17
corvus	we don't have any bare-metal resources, so they all run virtualized	21:17
clarkb	they are big long running jobs that fail a lot	21:17
mriedem	so back to my other question, do we know of voting jobs that are failing at too high a rate and should make them non-voting to avoid resets?	21:18
clarkb	the fail a lot is important with how the gate works	21:18
clarkb	mriedem: non voting jobs dont causeresets and shouldnt run in the gate	21:18
mriedem	we used to make the ceph job non-voting if it failed consistently at something like 25%	21:18
dansmith	maybe we have one queue for tripleo and one for everything else?	21:18
mriedem	clarkb: that's my point,	21:18
mriedem	do we have voting jobs with high fail rates that should be made non-voting	21:18
mriedem	until they are sorted out	21:18
dansmith	if they're currently 50% of the load, that would seem reasonable,	21:18
clarkb	mriedem: I am not sure, you'd have to check graphite	21:19
clarkb	but also that doesnt fix the issue of brojen software	21:19
sean-k-mooney	dansmith: ya that was basically the logic i had with suggesting making the be a seperate tenant then the rest	21:19
corvus	dansmith: they do have their own queue -- or do you mean quota? or... i may not understand...	21:19
mriedem	clarkb: i realize that doesn't fix broken software, but holding the rest of openstack hostage while one project figures out why it's jobs are always failing is wrong	21:20
dansmith	corvus: like, the per-project queue thing you have now, but put tripleo in one queue and everyone else in the same queue,	21:20
dansmith	so that nova and neutron fight only against tripleo :)	21:20
corvus	dansmith: gotcha -- so only test one tripleo change at a time across the whole system?	21:20
clarkb	mriedem: except Im fairly certain some of the issue here is broken rolling downhill	21:20
mriedem	as i said, if the ceph job had a high failure rate and someone wasn't fixing it, we'd make it non-voting	21:20
clarkb	tripleo takes nova and it does work tests break	21:20
clarkb	then clouds take tripleo and hosts the tests and break	21:21
clarkb	its not as simple as x y and z are bad stop them	21:21
dansmith	corvus: is that what the current queuing is doing? or you mean just because their jobs are so big, that they'd realistically only get one going at a time?	21:21
corvus	dansmith: that's actually not that hard to do -- we can adjust the new priority system to operate based on the gate pipeline's shared changes queues, even in check.	21:21
corvus	dansmith: it's not what the queueing is currently doing, because each tripleo project gets its own priority queue in check (in gate, they do share one queue).	21:22
dansmith	corvus: oh cripes!	21:23
dansmith	corvus: yeah, seems like they get priority inflation because they have a billion projects, amirite?	21:23
fungi	again though, that's in check. in gate the tripleo queue and the integrated queue are basically on equal footing	21:24
corvus	dansmith: right	21:24
dansmith	presumably gate is less of an issue right?	21:24
clarkb	well gate is what eats all the nodes	21:24
corvus	well, gate has priority over check	21:24
*** bobh has joined #openstack-infra		21:24
clarkb	especially when thing are flaky because a reset stops all jobs, then restarts them taking nodes from check	21:24
dansmith	really? gate is always smaller it seems like	21:24
clarkb	which is why I keep telling people please fix the tests	21:24
dansmith	yeah I know gate resets kill everything	21:24
corvus	so a gate reset (integrated or tripleo) starves check	21:24
clarkb	you will merge features faster if we all just spend a little time fixing bugs	21:25
sean-k-mooney	clarkb: well to get to gate they would have to go through check so check should be the limiting factor no?	21:25
dansmith	sean-k-mooney: it's flaky tests	21:25
dansmith	not failboat tests	21:25
clarkb	right its using nested virt crashing teh VM and failing the test that does it	21:25
clarkb	and other not 100% failure issues	21:25
fungi	yeah, tests with nondeterministic behaviors (or tests exercising nondeterministic features of the software)	21:26
pabelanger	maybe in check, group the priority based on the project deliverables? I think that is what dansmith might be getting at	21:26
dansmith	so, gate resets suck, we know that, and a very small section of people work on fixing those which sucks	21:26
dansmith	but yeah,	21:26
sean-k-mooney	so like 4 years ago we used to have both recheck and reverify to allow running check and gate independely	21:26
dansmith	ffs put tripleo in one bucket at least,	21:26
sean-k-mooney	do we have reverify anymore	21:26
dansmith	and maybe put everyone else in another bucket together	21:26
clarkb	dansmith: if the TC tells us to we will. But so far OpenStack the project seems to tolerate their behavior	21:26
fungi	pabelanger: challenge there is probably in integrating governance data. zuul knows about a flat set of repositories, and knows about specific queues	21:27
pabelanger	fungi: agree	21:27
clarkb	fwiw I don't think tolerating is necessarily a bad thing	21:27
pabelanger	sean-k-mooney: no, just recheck	21:27
clarkb	tripleo testing has to deal with all the bugs in the rest of openstack	21:27
dansmith	clarkb: really? Is the TC in charge of how to allocate infra resources? I didn't realize	21:27
corvus	[since i'm entering the convesation in the middle, let me add a late preface -- the new queing behavior is an attempt to make things suck equally for all projects. it's just an idea and i'm very happy to talk about and try other ideas, or turn it off if it's terrible. it's behind a feature flag]	21:27
clarkb	dansmith: the TC is in charge of saying nova is more important that tripleo	21:27
dansmith	clarkb: but osa is the same and they use a much smaller fraction right?	21:27
sean-k-mooney	pabelanger: ok could we bring back reverify so that if check passes and gate fails they could just rerun gate jobs and maybe reduces load that way	21:28
dansmith	clarkb: we know they're never going to say that.. this doesn't seem like a politics thing, but rather a making the best use of the resources we all share	21:28
clarkb	dansmith: yes, I pointed this out to the TC when I started sharing this data	21:28
clarkb	dansmith: all of the other deployment projects use about 5% or less resources each	21:28
*** bobh has quit IRC		21:28
dansmith	clarkb: if you make it about "okay who is prettier" you'll never get an answer from them and we'll all just keep sucking	21:28
dansmith	clarkb: right, and they all deal with "all the bugs of openstack" in the same way right?	21:29
corvus	sean-k-mooney: that's called the 'clean-check' requirement and it was instituted because too many people were approving changes with failing check jobs, and therefore causing gate resets or, if they merged, then merging flaky tests.	21:29
clarkb	dansmith: sort of. I think what makes tripleo "weird" is that they have stronger dependencies on openstack working for tripleo to actually deploy anything	21:29
clarkb	dansmith: ansible runs mistral which runs heat which runs ansible which runs openstack type of deal	21:29
clarkb	whereas puppet and ansible and so on are just the last bit	21:29
corvus	sean-k-mooney: sdague thought clean-check was very important and effective for eliminating that.	21:29
dansmith	clarkb: yeah I get that their design has made them fragile :)	21:30
sean-k-mooney	corvus: im suggestig that reverify would only work if check was clean and gate jobs failed it woudl rerun the gate jobs only keeping the clean check results	21:30
corvus	sean-k-mooney: oh, when the gate runs we forget the check result	21:30
*** eharney has joined #openstack-infra		21:31
clarkb	the good news is tripleo is reducing their footprint	21:31
corvus	sean-k-mooney: so basically the issue is, how do you enqueue a change in gate with a verified=-2. i guess we could permit that....	21:31
fungi	sean-k-mooney: also, before clean-check one of the problems was reviewers approving changes which had all-green results from check jobs run 6 months ago	21:31
clarkb	I think we are seeing positive change there, unfortunately it is still slow movement and I end up debugging a bunch of gate failures	21:31
clarkb	Ideally tripleo would be doing that more actively	21:31
clarkb	(as would openstack with its flaky queue)	21:31
corvus	(verified=-2 would mean "this failed in gate, but it previously passed in check, so allow it to go back into gate)	21:31
fungi	sean-k-mooney: which clean-check doesn't solve obviously, but at least forcing the change which caused a gate reset to getobtain check results keeps it from continuing to get reapproved	21:32
sean-k-mooney	fungi: ya i know it would reduce the quality of the gate in some respects	21:32
sean-k-mooney	it woudl be harder to do but enven if reverify woudl conditionally skip the check if they were run in the last day might help	21:33
fungi	sean-k-mooney: less about reducing the quality of the gate, it would actually slow development (these changes went in specifically because our ability to gate changes ground to a halt with nondeterministic jobs and people approving broken changes)	21:33
mriedem	yeah i don't want to go back to that	21:33
mriedem	it was changed for good reasons	21:33
sean-k-mooney	fair enough	21:34
mriedem	sdague isn't in his grave yet, but i can hear him rolling	21:34
sean-k-mooney	i know there were issue with it	21:34
clarkb	the reason I push on gate failures is the way the gate should work is that 20 changes go in, they grab X nodes to test those changes and do that for 2.5 hours. Then they all merge and the gate is empty leaving all the other resources for everyone else	21:34
clarkb	what happens instead is we use N nodes for 20 minutes, reset then use N nodes for 20 minutes and reset and on and on	21:35
corvus	s/2.5/1 :)	21:35
fungi	yep, while forcing a change which fails in the gate pipeline to reobtain check results is painful, not doing it is significantly more painful in the long run	21:35
clarkb	and never free up resources for check	21:35
clarkb	unfortunately the ideal behavior sort of assumes we operate under the assumption that broken flaky code is bad	21:35
pabelanger	clarkb: given how realitive priority works now, if the gate window was smaller again, that would mean more nodes in check right? Meaning more feedback, however it does mean potential longer for things to gate and merge	21:36
sean-k-mooney	well we do in most projects	21:36
pabelanger	right now, there is a lot of nodes servicing gate	21:36
fungi	and assumes that we catch a vast majority of the failures in check, and that our jobs pass reliably	21:36
clarkb	sean-k-mooney: there are certainly pockets that do, but overall from my perspective as being the person debugging things and beating this drum very few do	21:36
corvus	pabelanger: relative_priority doesn't really change the gate/check balance. and clarkb shrunk the window a few weeks ago and observed no noticeable change in overall throughput.	21:37
corvus	(so the window is back at 20)	21:37
clarkb	I do think this feedback is worthwhile though. One thing that comes to mind is maybe the priority should be based on time rather than count (though noav would still suffer under that)	21:37
corvus	shrinking the window should alter the check/gate balance though. so we might see some check results faster if we shrunk it again.	21:37
clarkb	perhaps there should be a relieve valve in the priority?	21:37
fungi	yeah, i think when things get really turbulent, resetting a 10-change queue over and over is about as bad as a 20	21:38
clarkb	perhaps we do need to allocate quotas to projects given some importance value	21:38
clarkb	my concern with this is I really don't want to be the person that says tripleo gets less resources than nova	21:38
fungi	i have a feeling there are a lot of times where zuul never gets nodes allocated to more than 10 changes in the queue before another reset tanks them all anyway	21:38
sean-k-mooney	clarkb: ya to be honest ignoring the hassel of being involed in runing a thrid party ci, one thing i did liek about it is the random edgecase it exposed that i was then able to go report or fix upstream	21:38
clarkb	I think we have an elected body over all the projects for that	21:38
dansmith	clarkb: to some degree, it's currently being said that tripleo gets as many resources as all the rest of openstack	21:39
pabelanger	clarkb: wasn't there also a suggestion for priority becase on number of nodes also?	21:39
clarkb	dansmith: yup, and we've asked tripleo nicely to change that and they have started doing so	21:39
dansmith	and by them having a crapton of projects, they're N times more important any any one other single-repo project, if I understand correctly	21:39
clarkb	pabelanger: ya node time I think I mentioend once	21:39
clarkb	dansmith: ya using gate queues to determine allocations (I think corvus mentioend that above) may be an important improvement here	21:40
clarkb	or the governance data? or some sort of aggregation	21:40
dansmith	clarkb: that means treating tripleo as one thing instead of N things? definitely seems like an improvement	21:40
*** bobh has joined #openstack-infra		21:41
clarkb	dansmith: ya and osa and others that are organized similarly	21:41
* dansmith nods		21:41
sean-k-mooney	clarkb: how hard would it be to have a check queue per governance team	21:41
pabelanger	I still like the suggestion from dansmith that project deliverables are counted together, not as individuals for priority. That does seem to give an advandage to a project with code over more repos, then a single	21:41
clarkb	sean-k-mooney: we already do a check queue per change. I think this is less about the queue and more prioritizing how nodes are assigned	21:41
sean-k-mooney	check queue per change? did you mean project?	21:42
clarkb	sean-k-mooney: no per change is how its logically implemented in zuul	21:42
clarkb	to represent dependencies	21:42
fungi	each change in an independent pipeline (e.g., "check") gets its own queue	21:43
fungi	changes in a dependent pipeline (e.g., "gate") get changes queued together based on jobs they share in common, or via explicit queue declaration	21:43
sean-k-mooney	fungi: ya i realised when i said queue before i ment to say pipeline	21:43
corvus	ideas so far: https://etherpad.openstack.org/p/QxXuCSdAoF	21:43
fungi	or in v3 did we drop the "jobs in common" criteria for automatic queue determination? is it all explicit now?	21:44
corvus	fungi: all explicit	21:44
fungi	okay, that then ;)	21:44
fungi	unfortunately my next idea is to go get dinner, and beer	21:45
*** bobh has quit IRC		21:45
fungi	or perhaps that's fortunately	21:45
corvus	it sounds like grouping projects in check by the same criteria we use in gate for setting relative_priority (ie, group all tripleo projects together in check) seems popular, fair, and not-difficult-to-implement.	21:45
*** jamesmcarthur has quit IRC		21:45
corvus	should i work on that?	21:45
clarkb	corvus: and if not difficult to implement we can at least try it out and if it doesn't work well probably not a huge deal?	21:46
clarkb	corvus: sounds like you should :)	21:46
fungi	yes, my concern with trying to merge it with governance data is the complexity. zuul already has knowledge of queues	21:46
corvus	yeah, i consider this whole thing an experiment and we should change it however we want :)	21:46
pabelanger	+1 to experiment	21:46
clarkb	dansmith: ^ does that seem like a reasonable place to start? That should treat aggregate tripleo as a unit rather than individual repos	21:47
pabelanger	clarkb: my question would be, does that ignore the 'integrated' queue in gate, or include it?	21:47
dansmith	I would think that'd be an improvement worth doing at least yeah	21:47
fungi	should there be a multi-tiered prioritization decision within the queuing set too? or just treat all changes for that group equally even if there are 10 nova changes and 1 mistral change (assuming mistral is in integrated too)	21:48
corvus	(oh, i added one more idea -- pabelanger suggested that we could treat the second and later patch in a patch series with a lower priority)	21:48
corvus	fungi: that's a good point, this would lump nova/cinder/etc together	21:49
clarkb	pabelanger: ya I guess thats the other question is if it will result in much change if nova + neutron + cinder + glance + swift are all together	21:49
sean-k-mooney	clarkb: you know there is one other thing we could try	21:49
sean-k-mooney	could we split the short zull jobs from the long ones	21:49
clarkb	sean-k-mooney: thats sort of what we've done with the current priority	21:50
clarkb	we are running a lot more short jobs beacuse the lessa ctive projects tend to have less involved testing	21:50
sean-k-mooney	e.g. run the unit,function,pep8,docs and release notes in one set and comment back and the tempest ones in another bucket	21:50
clarkb	ah split on that axis	21:50
sean-k-mooney	ya	21:51
clarkb	ya though maybe thats an indication we all need to run `tox` before pushing more often :P	21:51
sean-k-mooney	so that you get develop feed back quickly on at least the non integration tests	21:51
clarkb	(I'm bad at it myself, but it does make a difference)	21:51
pabelanger	clarkb: yah, i think for the impact, we'd need to group specific queues in check (via configuration?) and keep current behavior of per project realitive priority	21:51
sean-k-mooney	clarkb: well i normally do tox -e py27,pep8,docs	21:52
sean-k-mooney	but i rarely un py3 or fucntional test locally	21:52
sean-k-mooney	i do if i touch them but nova takes a while	21:52
sean-k-mooney	clarkb: you could even wait to kick off the tempest test until the other test job passed	21:53
corvus	it's worth keeping in mind that nothing we've discussed (including this idea) changes the fact that during the north-american day, we are trying to run about 2x the number of jobs at a time than we can support.	21:53
clarkb	http://logs.openstack.org/37/611137/1/gate/grenade-py3/e454703/job-output.txt.gz#_2018-12-07_21_48_57_091229 just reset the integrated gate	21:53
clarkb	sean-k-mooney: ^ its specifically test failures like that that I'm talking about needing more eyeballs fixing	21:54
corvus	if we run 20,000 jobs in a day now, with any of these changes, we'll still run 20,000 jobs in a day. it's just re-ordering when we run them. the only thing that will change that is to run fewer jobs, run them more quickly, or have them fail less often.	21:54
clarkb	that change modifies cinder unit tests	21:54
clarkb	so shouldn't be anywhere near what grenade runs	21:55
clarkb	and yet it fails >0% but <100% of the time	21:55
*** boden has quit IRC		21:55
clarkb	corvus: ya I think what dansmith and mriedem were get at is the turnaround time for a specific patch impacts there ability to fix review today or wait for tomorrow	21:55
clarkb	corvus: in the fifo system we had before that turnaround time was the same for evryone roughly (we'd get so far behind then everyone is waiting all day sort of thing)	21:56
clarkb	in the current system only some subset of people are waiting for tomorrow	21:56
dansmith	yep	21:56
clarkb	which is an improvement for some and not for others	21:56
mriedem	it's also impacting our ability to try and work on things to fix the gate	21:56
mriedem	like the n-api slow start times	21:56
dansmith	we just proposed removing the cellsv1 job from our regular run, btw.. I'm constantly thinking about what we can run less of, fwiw	21:56
clarkb	ya nova is a fairly responsible project	21:56
sean-k-mooney	hum the content type was application/octet-stream which i woudl have expected to work	21:56
mriedem	i'm trying to merge the nova-multiattach job into tempest so we can kill that job as well	21:56
dansmith	mriedem: right, this actually came up while trying to get results from gate-fixing patches	21:56
*** wolverineav has joined #openstack-infra		21:57
mriedem	clarkb: can you say that again, but this time into my lapel?	21:57
dansmith	clarkb: FAIRLY RESPONSIBLE	21:57
*** wolverineav has quit IRC		21:57
*** wolverineav has joined #openstack-infra		21:57
clarkb	dansmith: mriedem heh I just mean in comparison to others yall seem to jump on your bugs without prompting	21:57
clarkb	and dive in if prompted	21:57
dansmith	clarkb: yeah, arguing the "fairly" part :)	21:57
* mriedem lights a cigarette		21:57
fungi	our previous high-profile gate-fixers were mostly nova core reviewers too, historically	21:58
dansmith	mriedem jumps without prompting, and I jump when prompted by mriedem	21:58
dansmith	it's a good system.	21:58
corvus	clarkb: yes, though one thing that wasn't pointed out was that a large project has more changes with results waiting at the end of the turnaround time, whereas a smaller project may only have the one change. so if you're in a large project and can work on multiple things, it's less of an issue. of course, if you're focused on one change in a large project, it's worse now.	21:58
clarkb	corvus: ya	21:58
corvus	i'll go look into the combine-stuff-in-check idea now	21:58
clarkb	corvus: ok, thanks	21:58
mriedem	dansmith: working on this shit allows me to procrastinate from working on cross-cell resize	21:58
dansmith	mriedem: and you're generally a noble sumbitch to boot.	21:59
* dansmith has to run		21:59
clarkb	sean-k-mooney: ya these bugs tend to be difficult to debug (though not always) which is one reason I think we have so few people that dig into them	21:59
sean-k-mooney	so in general do people think there would be merrit in a precheck pipeline for running all the non tempest test(pep8,py27...) and only kicking of the dsvm test if the precheck job passed	21:59
clarkb	but that digging is quite valuable	21:59
clarkb	sean-k-mooney: about 5 years ago we did do that, and what we found was we had more round trips per patch as a result	21:59
clarkb	sean-k-mooney: that doesn't mean we shouldn't try it again	22:00
clarkb	but is something to keep in mind, the current thought is providing complete results in one go makes it easier to fix the complete set of bugs in a change before it goes through again	22:00
sean-k-mooney	more rount trips with a shorter latancy for result untill the quick jobs passed might be a saving on gate time over all	22:00
clarkb	we should see about measuring that along with any more lag time on throughput	22:00
pabelanger	Right, this was also the idea of doing fast-fail too, if one job fails, they all do.	22:01
pabelanger	but means less results	22:01
sean-k-mooney	clarkb: do we have visablity or an easy way to catogries what precentege of falirues on a patch are from tempest jobs verses the rest?	22:01
clarkb	sean-k-mooney: I think that is something to keep in our back pocket as an option if the reorg of priority aggregation continues to be sadness. I do want to avoid changing too many things at once	22:01
fungi	okay, really running off to dinner now. might pop back on when i return, but... it is friday night	22:01
pabelanger	sean-k-mooney: you could actually test that today, with using job dependencies in your zuul.yaml file	22:01
clarkb	fungi: enjoy your evening and weekened	22:02
fungi	thanks!	22:02
sean-k-mooney	pabelanger: you could actully test it today. i could test it in a week after i read up on how this all works again :)	22:02
clarkb	sean-k-mooney: pabelanger I added it to https://etherpad.openstack.org/p/QxXuCSdAoF for compelteness	22:02
pabelanger	clarkb: +1	22:03
sean-k-mooney	clarkb: this is somethign we could maybe test on a per projec basis too	22:03
corvus	my changes will definitely use more nodes overall if we fast-fail :(	22:04
sean-k-mooney	e.g. if we had a precheck pipeline we try it on nova or something by change just our zull file	22:04
clarkb	corvus: ya mine too	22:04
*** wolverineav has quit IRC		22:04
clarkb	but maybe that is ok if the aggregate doesn't	22:04
corvus	(typically my changes use 2x nodes because i always get something wrong the first time. expect x! nodes if we do fast-fail. :)	22:05
clarkb	sean-k-mooney: re that test I called out. It failed due to a 502 from apache talking to cinder. Apache says AH01102: error reading status line from remote server 127.0.0.1:60999 at http://logs.openstack.org/37/611137/1/gate/grenade-py3/e454703/logs/apache/error.txt.gz	22:06
clarkb	cinder api log doesn't immediately show me anything that would indicate why	22:07
sean-k-mooney	clarkb: ya i saw that apache is proxying to mod_wsgi i am assuming	22:07
clarkb	sean-k-mooney: I think only for apache, the rest of the services run uwsgi standalone and we just tcp to them?	22:07
clarkb	apache is just terminatign ssl for us	22:07
sean-k-mooney	oh ok	22:07
*** wolverineav has joined #openstack-infra		22:08
clarkb	stack exchange says set proxy-initial-not-pooled in apache	22:08
clarkb	this will degrade performance but make things more reliable as it avoids a race between pooled connection being closed and new connection to frontend	22:09
clarkb	I thought we had something like this in the apache config already	22:09
clarkb	oh I remember it was the backends and apache not allowing connection reuse by the python clients because python requests has the same race	22:10
sean-k-mooney	ya just reading https://httpd.apache.org/docs/2.4/mod/mod_proxy_http.html	22:10
clarkb	chances are we do want this sort of thing added to devstack	22:10
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/project-config master: Add gate job for Slack notifier in zuul-jobs https://review.openstack.org/623593	22:11
*** slaweq has joined #openstack-infra		22:11
sean-k-mooney	that or we use something other then appache for ssl termination	22:11
clarkb	ya we used apache because keystone already depped on it	22:12
clarkb	avoided adding a dep	22:12
sean-k-mooney	keystone can run under uwisg now right?	22:12
sean-k-mooney	i know glance stil has some issues	22:12
clarkb	maybe? its a good question	22:12
openstackgerrit	Clint 'SpamapS' Byrum proposed openstack-infra/zuul-jobs master: Add a slack-notify role https://review.openstack.org/623594	22:13
sean-k-mooney	haproxy, nginx and caddy all are lighter weight solution to ssl termmination then apache but that option is proably a good place to start	22:13
clarkb	ya devstack had broken support for some lightweight terminator that ended up being EOL'd and removed from the distros	22:17
clarkb	and it was at that point stuff moved to apache because it was already a hard dep for keystone	22:17
*** eernst has quit IRC		22:17
clarkb	it can certainly be updated again if it makes sense, though configuring apache is likely easier short term	22:17
* clarkb updates devstack repo		22:18
sean-k-mooney	ya am i have several other experiment i want to do with devstack but i might add that to the list	22:18
jrosser	is it right i seem to see a mix of centos 7.5 & 7.6 nodes?	22:19
clarkb	jrosser: as of yesterday all but inap should haev an up to date 7.6 image	22:20
clarkb	I haven't checked yet today if inap image managed to get pushed	22:21
jrosser	ok - i'll check they're all from there	22:21
*** dmellado has quit IRC		22:22
*** stevebaker has quit IRC		22:23
*** gouthamr has quit IRC		22:23
*** bobh has joined #openstack-infra		22:24
jrosser	clarkb: looking at a few the 7.5 do indeed look to be inap nodes	22:26
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Consider shared changes queues for relative_priority https://review.openstack.org/623595	22:28
clarkb	jrosser: ya I just checked and inap is still not getting a succesful upload	22:28
clarkb	we'll need to debug that	22:28
clarkb	jrosser: does that cause problems? maybe the package updates take a long time?	22:29
*** bobh has quit IRC		22:29
corvus	clarkb, pabelanger, sean-k-mooney, dansmith, mriedem, fungi: ^ i think https://review.openstack.org/623595 is our tweak (combined with a change to project-config to establish the queues in check)	22:29
jrosser	ok thanks - it'll trip up osa jobs where we just fixed 7.6 host + 7.6 docker image	22:29
jrosser	mismatching those doesnt work for us	22:29
clarkb	oh right the venv thing mnaser mentioned	22:30
jrosser	yes thats it	22:30
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Consider shared changes queues for relative_priority https://review.openstack.org/623595	22:30
mriedem	corvus: ok, but i'll admit the wording in that commit message is greek to me	22:31
mriedem	maybe you want to poke that out in the ML thread	22:31
clarkb	openstack.exceptions.ConflictException: ConflictException: 409: Client Error for url: https://image.api.mtl01.cloud.iweb.com/v2/images/7cc7c423-dea6-4efb-b36d-bc3b7fbdee5e/file, Image status transition from saving to saving is not allowed: 409 Conflict	22:33
clarkb	jrosser: ^ that is why our uploads are failing I think that is an openstacksdk bug	22:33
clarkb	mordred: ^ fyi	22:33
corvus	mriedem: it's greek for "lump tripleo together in check" :)	22:33
clarkb	jrosser: I'm not in a great spot to debug that myself right now, but let me get a full traceback pasted so that someone can look if that have time	22:34
clarkb	mordred: jrosser http://paste.openstack.org/show/736850/	22:35
mriedem	corvus: ok, but still probably good to say in that thread "this is what happens today and this is what's proposed after getting brow beaten for an hour in irc"	22:35
jrosser	clarkb: thankyou	22:35
clarkb	I want to finish up the train of thought with devstack apache, then review corvus' change, then maybe I'll get to that sdk thing	22:35
mriedem	maybe leave out that last part	22:35
corvus	mriedem: yeah, i'll reply, but it'll take me a minute because now i have to write about your suggestion ;)	22:36
mriedem	oh weights on changes within a given project?	22:37
mriedem	this all sounds like stuff people want nova to be doing all the time	22:37
mriedem	now you know how i feel	22:37
mriedem	oh what if we just used zuul in the nova-scheduler...	22:37
corvus	well, i'm reading it as weigh higher changes which have failed alot, but yeah	22:37
corvus	mriedem: they both have schedulers, they must be the same thing	22:38
mriedem	correct	22:38
clarkb	does that mean our work here is done?	22:38
corvus	i'm pretty sure the zuul-scheduler is so called because that was the only word that came to mind after spending a year listening to people talk about nova	22:39
clarkb	there are times that I wish devstack was more config managementy, this is one of them :P	22:41
mriedem	clarkb: btw, please take it easy on me on monday after your boy russell prances all over my team this weekend	22:41
clarkb	mriedem: I had them going 5-11 this year. I'm both happy and mad they proved me wrong	22:41
clarkb	mriedem: whats crazy is the packers could be that 5-10-1 team	22:41
mriedem	not crazy, great	22:42
*** bobh has joined #openstack-infra		22:43
*** mriedem is now known as mriedem_afk		22:46
*** bobh has quit IRC		22:47
clarkb	sean-k-mooney: https://review.openstack.org/623597 fyi should set that env var	22:52
*** slaweq has quit IRC		22:52
clarkb	I spent more time figuring out what the opensuse envvars file is than writing the patch :P	22:52
clarkb	corvus: I think your change must've got caught by a pyyaml release? pep8 is complaning that the "safe" methods dont' exist anymore	22:57
corvus	clarkb: i don't see a new pyyaml release	22:58
clarkb	I'm guessing the change to make safe the default and make unsafe explicit opt in hit?	22:58
clarkb	hrm	22:58
clarkb	http://logs.openstack.org/95/623595/2/check/tox-pep8/b4c0a8b/job-output.txt.gz#_2018-12-07_22_36_31_796233	22:58
corvus	there was a new mypy release though	22:58
clarkb	oh	22:58
clarkb	that must be it then	22:58
corvus	i expect the unit test failures are separate from that, so i'll look into that before pushing up fixes for both	22:59
clarkb	ok	23:00
clarkb	corvus: the other thing I notice is that we'll set specific check queues which are different tyhan those in gate (or could be at least?)	23:00
clarkb	that seems like a good feature	23:00
corvus	clarkb: yep; it'd be really messy to implement it otherwise	23:00
clarkb	I think you only get the CSafeLoader attributes if the libyaml-dev headers are available	23:01
clarkb	I wonder if mypy can be convinced to allow either type	23:01
clarkb	another option is to install that package via bindep	23:01
*** irdr has quit IRC		23:03
clarkb	" Image status transition from saving to saving is not allowed"	23:04
clarkb	it only just occurred to me its mad that the state is transitioning to the same state	23:04
*** gouthamr has joined #openstack-infra		23:06
*** wolverineav has quit IRC		23:08
*** wolverin_ has joined #openstack-infra		23:08
clarkb	this is the two step upload process, We create an image record, this first step is the one that gets passed all the image property data. Then we PUT the actual image file data to foo_image/file url	23:09
clarkb	its the second one that is failing, I don't think we supply the property data there, it should just be the content type header and the content of the image itself	23:09
*** slaweq has joined #openstack-infra		23:09
*** dmellado has joined #openstack-infra		23:11
*** slaweq has quit IRC		23:14
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Consider shared changes queues for relative_priority https://review.openstack.org/623595	23:15
openstackgerrit	James E. Blair proposed openstack-infra/zuul master: Cap mypy https://review.openstack.org/623598	23:15
clarkb	ya it looks like if we use the sdk native interface it may remember some object attributes but we consume things ont he shade side which is just doing a pretty boring POST then PUT without any information abuot state at that level (shade manages the state, the client is clueless beyond the session from what I can tell)	23:19
clarkb	possibly a chagne on the cloud side?	23:19
*** slaweq has joined #openstack-infra		23:19
clarkb	mgagne_: ^ if you are around do you see a similar traceback? I can't tell if its the lib/sdk that is buggy or if maybe the server is?	23:20
clarkb	mordred: chances are you just know off the top of your head if you have a sec too	23:23
*** slaweq has quit IRC		23:24
*** stevebaker has joined #openstack-infra		23:25
sean-k-mooney	clarkb: im kindo of surpised that we soudl set envars in /etc/sysconfig i would have assumed it would have either been in /etc/default/<apache name> or some systemd folder	23:26
clarkb	sean-k-mooney: apparently on rhel/centos and suse this is how you do it	23:28
clarkb	debuntu use /etc/apache2/envvars	23:28
sean-k-mooney	ok i proably should know that ...	23:28
clarkb	then the init system applies that data (via apachectl on debuntu?)	23:28
*** apetrich has quit IRC		23:47
mnaser	hm	23:47
mnaser	in nested virt situations, i assume folks have seen really slow io or traffic?	23:48
mnaser	trying to debug functional tests for k8s .. http://logs.openstack.org/75/623575/1/check/magnum-functional-k8s/463ae1a/logs/cluster-nodes/master-test-172.24.5.203/cloud-init-output.txt.gz	23:48
mnaser	it looks like its downloading quite slowyl	23:48
sean-k-mooney	mnaser: no i havent seen that before	23:48
mnaser	i mean it could also not be using nested virt	23:49
clarkb	mnaser: chances are its not nested virt	23:49
mnaser	darn :(	23:49
mnaser	i'm trying to make magnum functional tests for k8s actually start working again	23:49
mnaser	but i think it might just be totally impossible without some sort of nested virt guarantee :(	23:49
clarkb	http://logs.openstack.org/75/623575/1/check/magnum-functional-k8s/463ae1a/logs/etc/nova/nova-cpu.conf.txt.gz virt type qemu	23:49
clarkb	which we set in devstack by default due to issues like centos 7 crashing under it :(	23:50
mnaser	it'd be nice if we can have not-so-third-party-third-party-ci	23:50
sean-k-mooney	mnaser: so nested virst is on but kvm is not used so the nested vms are just slow	23:50
mnaser	as in like "here's some credentials, can we use those nodesets in this project only please"	23:50
mnaser	rather than us deploying a fully fledged zuul to do third party ci	23:51
clarkb	sean-k-mooney: it may or not be on depending on the cloud that it is scheduled to	23:52
* sean-k-mooney may or may not be deploying openstack and zuul at home to set up a third party ci...		23:52
clarkb	which is the other issue	23:52
clarkb	mnaser: we actually can do that, no one has offered as far as I know. But thats roughly what we are doing with kata	23:53
clarkb	the key thing is we can't gate in that setup (because its a spof)	23:53
clarkb	but can provide informational results	23:53
sean-k-mooney	clarkb: ya that is true its not that hard to check if you can enable kvm you jsut have to modeprobe it and see if /dev/kvm is there	23:53
mnaser	clarkb: .. so can we do that for magnum :D	23:54
clarkb	mnaser: maybe? adrian otto actually brought it up a while back and then it went nowhere (idea was to use rax onmetal at the time)	23:54
clarkb	sean-k-mooney: ya it even works if you've hidden vmx from the instance (which is why we don't botherdoing that)	23:54
mnaser	i mean i can do the work and provide the infra (..magnum is important for us, and its functional jobs are pretty disfunctional because of this)	23:54
sean-k-mooney	i barbican still doing terble things in there ci jobs.	23:55
*** rkukura_ has joined #openstack-infra		23:55
sean-k-mooney	before osic died they had a ci job that tried to enable kvm and powered off the host if it failed so zuul woudl reschdule them.	23:56
clarkb	sean-k-mooney: a few project try to use nested virt if its there (octavia and tripleo are/were doing this, its how we ran into that issues with centos(	23:56
sean-k-mooney	at least i think it was barbican that had that job	23:56
*** rkukura has quit IRC		23:57
*** rkukura_ is now known as rkukura		23:57
*** jamesmcarthur has joined #openstack-infra		23:58
sean-k-mooney	mnaser: by the way do you know with vexhost if i create an account can i set a limit on my usage per month	23:59
mnaser	sean-k-mooney: unfortunately the only way that's possible is by enforcing a quota on your account, we don't have a "cost" quota ;x	23:59
sean-k-mooney	ok i assumed that would be the answer	23:59

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!