Thursday, 2019-09-05

clarkb	johnsom: why is it querying resolving _http._tcp.mirror.regionone.fortnebula.opendev.org. SRV IN ? is that something apt does?	00:00
pabelanger	clarkb: is there no mirror info for vexxhost there?	00:00
pabelanger	also +2	00:00
clarkb	pabelanger: only the mirrors that have been replaced with opendev.org ansible'd mirrors are there	00:00
pabelanger	kk	00:00
johnsom	Couldn't tell you. I thought that was odd too.	00:00
clarkb	also it is trying both cloudflare and google before it gives up	00:01
johnsom	Yeah, upping the ttl seems like we are just going to fail the download anyway if the IPv6 routing is down.	00:01
*** diablo_rojo has joined #openstack-infra		00:02
diablo_rojo	rm_work, yeah they already went out	00:02
*** trident has joined #openstack-infra		00:03
clarkb	I can resolve google.com via those two resolvers right now against the two nameservers that failed from the mirror host in that cloud	00:04
clarkb	johnsom: are we sure that the job itself isn't breaking the host's networking or unbound setup?	00:04
clarkb	johnsom: we have had jobs do that in the past	00:04
clarkb	I think either that is happenign or we have intermittent routing issues out of the cloud	00:04
johnsom	This happens over many different jobs/patches	00:04
clarkb	(because right now it seems to be fine)	00:04
johnsom	Looking at the unbound log it was fine at the start of the job too.	00:05
clarkb	johnsom: ya then the job runs and does stuff to the host :)	00:05
clarkb	according to that log it had successfully resolved records just 14 seconds prior	00:05
clarkb	with this being udp is it deciding the network is unreachable because the timeout is hit?	00:07
johnsom	Well it blew up devstack on the latest one. The patch doesn't change the devstack config (and frankly we haven't in a log time either).	00:07
*** rh-jelabarre has quit IRC		00:07
clarkb	rtt to those two servers is 8ms currently if that timeout is in play	00:07
johnsom	Or there is no route, like the RA expired	00:07
clarkb	pabelanger: its direct via bgp	00:10
clarkb	er I was scrolled up and thougt that was familiar, sorry	00:10
rm_work	diablo_rojo: what was the title? I don't think I saw mine?	00:11
rm_work	and approx when?	00:11
clarkb	johnsom: fwiw the syslog entries for the time when unbound log starts logging thing shows updates to iptables and ebtables	00:11
*** armax has quit IRC		00:11
johnsom	We dom	00:12
clarkb	johnsom: http://paste.openstack.org/show/770959/	00:12
johnsom	We don't use either. We only use neutron SGs	00:12
clarkb	is it possible neutron is nuking it then?	00:12
*** stewie925 has quit IRC		00:13
diablo_rojo	rm_work, I pinged Kendall Waters (she sent them) I'll let you know as soon as she replies.	00:14
rm_work	hmm k	00:14
rm_work	what's her nick?	00:14
clarkb	it is interesting that the console log shows only the apt stuff and not other tasks that could change iptables or ebtables or ovs	00:14
clarkb	maybe neutron background stuff as its turning on? /me checks neutron logs	00:14
johnsom	Maybe. There is ovs openflow stuff in there too, so probably neutron devstack stuff.	00:14
johnsom	If that was the case, why would the other providers and the py27 run of the same patch pass?	00:16
johnsom	And the two node version on py3, etc...	00:16
clarkb	it could be a race	00:16
clarkb	assuming iptables/ebtables/ovs is at fault somehow it they happen long enough after your apt stuff then apt will succeed and happily move on (and job amy not do anything elseexternal after that point?)	00:17
clarkb	I don't have enough evidence to say they are at fault but it is highly suspicious that we have network problems at around exactly the same time we muck with the network	00:17
clarkb	johnsom: Sep 04 22:04:29.894751 ubuntu-bionic-fortnebula-regionone-0010774925 neutron-dhcp-agent[13633]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-89551ca5-1abb-484e-af0c-1a4f3c8e1d59', 'sysctl', '-w', 'net.ipv6.conf.default.accept_ra=0']	00:22
clarkb	do network namespaces mask off sysctls?	00:23
clarkb	if not setting default there instead of the specific interface may be the problem	00:23
diablo_rojo	rm_work, sounds like they would have come from Jimmy or Ashlee and would have had "Open Infrastructure Summit Shanghai" in the subject	00:25
johnsom	They do mask some of them, yes.	00:25
clarkb	internet says the answer is it depends	00:26
clarkb	ya depends on the specific one	00:26
rm_work	diablo_rojo: hmm ok i'll search again	00:26
clarkb	https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt does not say if accept_ra is namespace specific or not	00:27
johnsom	Yeah, they are listed somewhere else.	00:27
clarkb	I feel like there is a good chance this is the problem	00:27
rm_work	diablo_rojo: yeah I don't see mine... who should I contact?	00:29
clarkb	neat k8s docs says its even trickier than that. Some sysctls can be set within a namespace but take effect globally	00:30
clarkb	so there are three types I think, those that are namespace specific, those that can be set in a namespace but take effect globally, and those that can only be set at the root namespace level	00:31
johnsom	I would be really surprised if the accept_ra isn't netns specific. It impacts the routing tables which are a key part of netns functionality. Still digging however.	00:31
clarkb	johnsom: yes, however it is specifically the default. one	00:31
clarkb	johnsom: I would not be surprised if default took affect more globally because its default	00:32
rm_work	diablo_rojo: ah, I found Kendall Waters' email. I'll shoot her a line.	00:32
diablo_rojo	rm_work, you found your code?	00:32
johnsom	I don't think so.	00:32
diablo_rojo	rm_work, sounds like it might not have come from her there are a few people that work on them but it should have been around the beginning of the month 8/5 or 8/6?	00:33
rm_work	diablo_rojo: no, I just mean I found her email address, and will email her directly instead of randomly trying to use you as an intermediary :D	00:33
*** bobh has joined #openstack-infra		00:35
clarkb	docker default network setup doesn't let me change that sysctl	00:35
clarkb	probably have to try and reproduce using netns constructed like neutron	00:36
*** dychen has joined #openstack-infra		00:38
johnsom	clarkb I just tried it. It is locked into the netns	00:38
*** zhurong has quit IRC		00:38
*** gyee has quit IRC		00:39
johnsom	Two different windows, but you can get the idea: http://paste.openstack.org/show/770960/	00:40
johnsom	Oh, and a double paste on the first one. sigh	00:40
clarkb	johnsom: is line 79 after line 123 on the wall clock?	00:42
*** bobh has quit IRC		00:42
johnsom	After 125 actually	00:42
johnsom	I wanted to make sure it actually stuck and didn't just silently ignore it	00:43
*** bobh has joined #openstack-infra		00:45
clarkb	cool I can replicate that too	00:46
*** Goneri has quit IRC		00:47
openstackgerrit	Clark Boylan proposed opendev/base-jobs master: We need to set the build shard var in post playbook too https://review.opendev.org/680239	00:48
clarkb	infra-root ^ yay for testing. That was missed in my first change to base-test	00:48
openstackgerrit	Clark Boylan proposed zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift) https://review.opendev.org/680240	00:50
clarkb	johnsom: I'm betting someone like slaweq would be able to wade through these logs quicker to rule neutron stuff in or out	00:52
johnsom	clarkb FYI: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/controller/logs/devstacklog.txt.gz	00:52
johnsom	2019-09-04 23:49:55.793	00:52
johnsom	Other projects seem to be having the same issue	00:52
clarkb	johnsom: there are similar syslog'd network changes around the time that one fails dns too	00:54
clarkb	I think that is our next step. See if slaweq can help us understand what neutron is doing there to either rule it in or out	00:54
clarkb	and we can also have donnyd check for external routing problems	00:54
johnsom	Seems to only happen with the fortnebula instances	00:55
clarkb	fortnebula is our only ipv6 cloud right now (or did we turn on limestone again?)	00:56
clarkb	however if it is a timing thing then would be related to the cloud possibly	00:56
johnsom	Yeah, 100% of the hits in the last 24 hours were all fortnebula	00:57
clarkb	any idea why the worlddump doesn't seem to be working at the end of these devstack runs? it is supposed to capture things like ntwork state for us	00:57
clarkb	maybe we haven't listed that file as one to copy in the job?	00:58
johnsom	Though I guess I was only looking at the last 500 out of the 986 hits in the last 24 hours	00:58
johnsom	Hmm, I hadn't noticed that it disappeared	00:59
*** bobh has quit IRC		01:00
*** markvoelker has joined #openstack-infra		01:00
johnsom	None of the projects I have looked at have that worlddump file anymore.	01:01
donnyd	clarkb: how can i help	01:01
clarkb	so I think tahts the 3 next steps then. Fix worlddump, see if slaweq understands what neutron is doing and if that might play a part, and work with donnyd to check for ipv6 routing issues	01:02
clarkb	donnyd: we have some jobs in fortnebula that fail to resolve against 2606:4700:4700::1111 and 2001:4860:4860::8888 at times	01:02
clarkb	donnyd: https://79636ab1f0eaf6899dc0-686811860e75a35eabcecb5697905253.ssl.cf2.rackcdn.com/665029/30/check/octavia-v2-dsvm-scenario/51891fa/controller/logs/unbound_log.txt.gz shows this if you need examples	01:03
clarkb	the timestamp there translates to 2019-09-4T22:06:01Z	01:03
clarkb	johnsom: worlddump is exiting with return code 2? is that what the last line there means?	01:04
donnyd	why do i see no timestamps	01:04
donnyd	i haven't looked at the logs since the switch to swift	01:05
clarkb	donnyd: in that file [1567634761] is the timestamp which is an epoch time	01:05
*** markvoelker has quit IRC		01:05
clarkb	it is due to how unbound logs	01:05
donnyd	[1567634761] unbound[792:0] debug: answer from the cache failed	01:05
clarkb	then you get [1567634761] unbound[792:0] notice: sendto failed: Network is unreachable and [1567634761] unbound[792:0] notice: remote address is ip6 2606:4700:4700::1111 port 53 (len 28)	01:06
donnyd	[1567634761] unbound[792:0] notice: sendto failed: Network is unreachable	01:07
clarkb	I've got to run to dinner now, but I expect fixing worlddump will help significantly	01:07
donnyd	ok	01:07
johnsom	clarkb I don't see any output from the worlddump.py line in the job I'm looking at now. So I don't know...	01:07
donnyd	PING 2606:4700:4700::1111(2606:4700:4700::1111) 56 data bytes	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=2 ttl=60 time=6.76 ms	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=3 ttl=60 time=8.09 ms	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=4 ttl=60 time=11.5 ms	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=5 ttl=60 time=10.1 ms	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=6 ttl=60 time=7.96 ms	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=7 ttl=60 time=6.65 ms	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=8 ttl=60 time=6.73 ms	01:07
donnyd	64 bytes from 2606:4700:4700::1111: icmp_seq=9 ttl=60 time=7.23 ms	01:07
donnyd	i can turn on packet capture at the edge an find out what the dealio is	01:08
*** mattw4 has joined #openstack-infra		01:08
clarkb	donnyd: ya I was able to resolve to the two resolvers from the mirror just fine	01:09
clarkb	and they both respond in ~8ms rtt so not a timeout issue unless that rtt skyrockets	01:09
*** slaweq has joined #openstack-infra		01:11
clarkb	sean-k-mooney: fyi the sync-devstack-data role is what is supposed to copy the CA stuff around but it runs after devstack runs on the subnodes	01:12
clarkb	sean-k-mooney: I noticed this when debugging worlddump really quickly	01:12
clarkb	I can write or test a change tonight but you might just try move that role before the sub node devstack runs	01:12
clarkb	*I can't	01:12
clarkb	ianw frickler ^ fyi you may be interested in those two things (worlddump not working and sync-devstack-data not running before the subnodes)	01:13
clarkb	and now really dinner	01:13
*** slaweq has quit IRC		01:15
donnyd	johnsom: so resolution works the entire way through the job, but then fails at the end	01:20
johnsom	It seems to fail part way through	01:21
donnyd	where	01:21
*** zhurong has joined #openstack-infra		01:21
*** mattw4 has quit IRC		01:21
*** calbers has quit IRC		01:22
donnyd	the resolve failure is at the end of the job, but I am curious if it from switching between v6 and v4...	01:22
clarkb	our job setup will switch the ipv6 clouds to ipv6 at the brginning of the job	01:23
clarkb	before that the default is ipv4	01:23
johnsom	donnyd 2019-09-04 23:49:55.794760 in this ironic job: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/job-output.txt	01:23
donnyd	2019-09-04 23:49:55.161 \| ++ /opt/stack/ironic/devstack/lib/ironic:create_bridge_and_vms:1868 : sudo ip route replace 10.1.0.0/20 via 172.24.5.219	01:24
johnsom	2019-09-04 22:06:01.847407 in our Octavia job https://79636ab1f0eaf6899dc0-686811860e75a35eabcecb5697905253.ssl.cf2.rackcdn.com/665029/30/check/octavia-v2-dsvm-scenario/51891fa/job-output.txt	01:24
*** yamamoto has joined #openstack-infra		01:24
johnsom	For the octavia job, it's still setting up devstack, it hasn't started our tests yet	01:25
donnyd	Well that job is doing the same as the other	01:27
donnyd	failing at the end	01:27
donnyd	2019-09-04 21:57:20.422065 \| controller \| Get:1 http://mirror.regionone.fortnebula.opendev.org/ubuntu bionic-updates/universe amd64 qemu-user amd64 1:2.11+dfsg-1ubuntu7.17 [7,354 kB]	01:27
donnyd	works at the beginning of the job, but not at the end	01:27
donnyd	i wonder if any of the other projects are having this issue	01:27
johnsom	Well, the logs end because the devstack start failed....	01:27
donnyd	yes, the logs end	01:28
johnsom	Those two were the ones I saw in a quick search of the last 24s. The job needs to attempt to pull packages in, i.e. a devstack bindep for this error to report.	01:29
johnsom	24hrs, sorry	01:30
clarkb	ya and in those two neutron is making network changes around when it starts to fail	01:30
clarkb	hence my suspicion that may be related	01:30
johnsom	They appear time clustered, but the sample size is a bit small given we don't always have jobs running that need to call out	01:31
johnsom	https://usercontent.irccloud-cdn.com/file/CtLQgDRl/image.png	01:32
donnyd	I am also happy to get you some debugging instances if that will help	01:32
donnyd	How long has it been doing this?	01:35
johnsom	a month or two from what I remember. The gates were not logging the unbound messages before, so we were not sure why the DNS lookups were failing	01:36
*** calbers has joined #openstack-infra		01:37
donnyd	2019-09-04 22:05:04.371936 \| controller \| + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:423 : sudo ip -6 addr replace 2001:db8::2/64 dev br-ex	01:37
donnyd	2019-09-04 22:05:04.382939 \| controller \| + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:424 : local replace_range=fd4d:5343:322e::/56	01:37
donnyd	2019-09-04 22:05:04.385798 \| controller \| + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:425 : [[ -z 163f0224-ac00-44b8-931b-0ec01a216a4c ]]	01:37
donnyd	2019-09-04 22:05:04.388395 \| controller \| + lib/neutron_plugins/services/l3:_neutron_configure_router_v6:428 : sudo ip -6 route replace fd4d:5343:322e::/56 via 2001:db8::1 dev br-ex	01:37
donnyd	what are these configuring?	01:37
johnsom	That I don't know. That is neutron setup steps in devstack.	01:39
donnyd	do you have a local.conf for this job?	01:39
*** bobh has joined #openstack-infra		01:39
donnyd	Or a way to run it on a test box?	01:39
donnyd	I think that is going to be the fastest way to the bottom	01:40
johnsom	Here is one from an octavia job: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/controller/logs/local_conf.txt.gz	01:41
johnsom	I don't think this is easy to reproduce, but I also haven't looked to see if we have had successes at fortnebula. Let me look at the other jobs	01:42
*** bobh has quit IRC		01:44
donnyd	http://logstash.openstack.org/#/dashboard/file/logstash.json?query=node_provider:%5C%22fortnebula-regionone%5C%22%20AND%20message:%5C%22Temporary%20failure%20resolving%5C%22&from=3h	01:48
donnyd	So it looks to me like ironic-standalone-ipa-src and octavia-v2-dsvm-scenario have issues	01:50
donnyd	but i don't see any other jobs with that issue	01:50
donnyd	i should say octavia-*	01:50
johnsom	Right, most jobs don't pull in packages later in the stack process	01:50
johnsom	Ok, yes, there have been successful runs at fortnebula: https://821d285ecb2320351bef-f1e24edd0ae51a8de312c1bf83189630.ssl.cf1.rackcdn.com/672477/7/check/octavia-v2-dsvm-scenario-amphora-v2/5f2878d/job-output.txt	01:52
donnyd	do these jobs only test against ubuntu?	01:52
johnsom	No, there are centos as well	01:52
johnsom	At least for octavia, I can't talk to ironic	01:52
*** bobh has joined #openstack-infra		01:53
donnyd	its looks to me like the failures are 100% ubuntu	01:53
*** bobh has quit IRC		01:54
johnsom	Yeah, so it looks like the IPv6 DNS lookups are intermittent I see a number that were successful over the last two days.	01:54
johnsom	Our centos job has only landed on fortnebula three times in the last week, so, small sample size.	01:58
donnyd	yea that is a small sample size	01:59
donnyd	Hrm... well I surely need to keep an eye on it	01:59
*** markvoelker has joined #openstack-infra		02:01
johnsom	Yeah, maybe fire up a script that does a IPv6 lookup of mirror.regionone.fortnebula.opendev.org on DNS server 2001:4860:4860::8888 every 5-10 minutes. See if you get some failures.	02:01
johnsom	I also need to step away now. Thanks for your time looking into it.	02:02
donnyd	NP	02:02
donnyd	lmk how else i can be helpful	02:02
*** markvoelker has quit IRC		02:05
*** bobh has joined #openstack-infra		02:09
*** apetrich has quit IRC		02:10
*** jklare has quit IRC		02:12
*** jklare has joined #openstack-infra		02:20
*** bhavikdbavishi has joined #openstack-infra		02:40
*** bhavikdbavishi1 has joined #openstack-infra		02:43
*** bobh has quit IRC		02:43
*** bhavikdbavishi has quit IRC		02:44
*** bhavikdbavishi1 is now known as bhavikdbavishi		02:44
*** bobh has joined #openstack-infra		02:46
*** nicolasbock has quit IRC		02:49
*** jamesmcarthur has quit IRC		02:49
*** jamesmcarthur has joined #openstack-infra		02:50
*** bobh has quit IRC		02:50
*** jamesmcarthur has quit IRC		02:55
*** xinranwang has joined #openstack-infra		02:56
*** slaweq has joined #openstack-infra		03:11
*** slaweq has quit IRC		03:15
*** jamesmcarthur has joined #openstack-infra		03:20
*** markvoelker has joined #openstack-infra		03:31
*** jklare has quit IRC		03:35
*** jklare has joined #openstack-infra		03:35
*** markvoelker has quit IRC		03:41
*** calbers has quit IRC		03:44
*** calbers has joined #openstack-infra		03:47
*** ricolin has joined #openstack-infra		03:49
*** exsdev has joined #openstack-infra		04:17
*** ramishra has joined #openstack-infra		04:20
*** jtomasek has joined #openstack-infra		04:29
*** markvoelker has joined #openstack-infra		04:30
*** markvoelker has quit IRC		04:35
*** jtomasek has quit IRC		04:40
*** jamesmcarthur has quit IRC		04:47
*** larainema has joined #openstack-infra		04:48
*** raukadah is now known as chandankumar		04:49
*** kjackal has joined #openstack-infra		04:51
*** jtomasek has joined #openstack-infra		04:51
*** udesale has joined #openstack-infra		05:02
*** yamamoto has quit IRC		05:09
*** slaweq has joined #openstack-infra		05:11
*** adriant has quit IRC		05:11
*** adriant has joined #openstack-infra		05:12
*** slaweq has quit IRC		05:15
*** jamesmcarthur has joined #openstack-infra		05:17
*** jamesmcarthur has quit IRC		05:22
*** dychen has quit IRC		05:28
*** markvoelker has joined #openstack-infra		05:30
*** ccamacho has quit IRC		05:32
*** dchen has quit IRC		05:33
*** markvoelker has quit IRC		05:35
*** psachin has joined #openstack-infra		05:37
openstackgerrit	Andreas Jaeger proposed zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift) https://review.opendev.org/680240	05:40
*** georgk has quit IRC		05:40
*** fdegir has quit IRC		05:40
*** georgk has joined #openstack-infra		05:40
*** fdegir has joined #openstack-infra		05:40
AJaeger	config-core, please put https://review.opendev.org/679850 and https://review.opendev.org/679856 on your review queue	05:51
AJaeger	config-core, and also https://review.opendev.org/679743 and https://review.opendev.org/676430 and https://review.opendev.org/#/c/678573/	05:52
AJaeger	infra-root, we have a Zuul error on https://zuul.opendev.org/t/openstack/config-errors: "philpep/testinfra - undefined (undefined)"	05:56
*** kjackal has quit IRC		06:03
*** dchen has joined #openstack-infra		06:04
*** snecker has joined #openstack-infra		06:04
openstackgerrit	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/680298	06:16
*** jamesmcarthur has joined #openstack-infra		06:19
*** jamesmcarthur has quit IRC		06:23
*** kopecmartin\|off is now known as kopecmartion		06:23
*** slaweq has joined #openstack-infra		06:24
*** kopecmartion is now known as kopecmartin		06:24
*** ccamacho has joined #openstack-infra		06:27
*** slaweq has quit IRC		06:28
*** snecker has quit IRC		06:29
*** yamamoto has joined #openstack-infra		06:29
*** psachin has quit IRC		06:31
*** zhurong has quit IRC		06:37
*** yamamoto has quit IRC		06:42
*** yamamoto has joined #openstack-infra		06:43
openstackgerrit	Merged openstack/project-config master: Remove broken notification for starlingx https://review.opendev.org/679850	06:46
openstackgerrit	Merged openstack/project-config master: Finish retiring networking-generic-switch-tempest-plugin https://review.opendev.org/679743	06:48
*** openstackgerrit has quit IRC		06:51
*** slaweq has joined #openstack-infra		06:52
*** udesale has quit IRC		06:56
*** diga has joined #openstack-infra		06:58
*** janki has joined #openstack-infra		06:58
*** jamesmcarthur has joined #openstack-infra		06:59
*** yamamoto has quit IRC		07:00
*** snecker has joined #openstack-infra		07:03
*** yamamoto has joined #openstack-infra		07:03
*** markvoelker has joined #openstack-infra		07:06
*** jamesmcarthur has quit IRC		07:06
*** markvoelker has quit IRC		07:10
*** kjackal has joined #openstack-infra		07:15
*** tesseract has joined #openstack-infra		07:15
*** trident has quit IRC		07:21
*** tosky has joined #openstack-infra		07:23
*** pgaxatte has joined #openstack-infra		07:29
*** trident has joined #openstack-infra		07:29
*** rcernin has quit IRC		07:33
*** jpena\|off is now known as jpena		07:33
*** trident has quit IRC		07:34
*** openstackgerrit has joined #openstack-infra		07:38
openstackgerrit	Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/680298	07:38
*** trident has joined #openstack-infra		07:43
*** sshnaidm\|afk is now known as sshnaidm\|ruck		07:43
*** snecker has quit IRC		07:46
*** zhurong has joined #openstack-infra		07:53
*** dchen has quit IRC		07:56
*** dchen has joined #openstack-infra		07:57
*** dchen has quit IRC		08:00
openstackgerrit	Fabien Boucher proposed zuul/zuul master: Pagure - handle initial comment change event https://review.opendev.org/680310	08:01
*** jamesmcarthur has joined #openstack-infra		08:02
*** tkajinam has quit IRC		08:05
*** jamesmcarthur has quit IRC		08:07
*** ralonsoh has joined #openstack-infra		08:20
*** derekh has joined #openstack-infra		08:31
*** dtantsur\|afk is now known as dtantsur		08:38
*** e0ne has joined #openstack-infra		08:41
*** markvoelker has joined #openstack-infra		08:45
*** markvoelker has quit IRC		08:50
amotoki	hi, when I open https://zuul.opendev.org/t/openstack/build/91b360daabc8453dba13129f78aca17b, I see an error "Payment Required Access was denied for financial reasons." when trying to open "Artifacts links.	08:50
amotoki	is it a known issue?	08:50
frickler	amotoki: yes, see notice from yesterday. OVH is working on fixing this, you'll need to recheck to get new logs if you need to see them earlier	08:56
frickler	clarkb: worlddump seems to be working fine, the errors that are logged are expected. we just fail to collect its output :(. I didn't find the reference to sync-devstack-data not working on subnodes, does someone have logs for that	08:57
amotoki	frickler: thanks. I missed that.	08:57
*** noama has quit IRC		08:58
frickler	AJaeger: infra-root: that looks more like an issue with the github api, the repo itself seems to be in place for me: "404 Client Error: Not Found for url: https://api.github.com/installations/1549290/access_tokens"	09:00
*** trident has quit IRC		09:01
*** jamesmcarthur has joined #openstack-infra		09:04
*** jamesmcarthur has quit IRC		09:08
*** trident has joined #openstack-infra		09:09
*** bexelbie has quit IRC		09:11
*** kjackal has quit IRC		09:16
*** kjackal has joined #openstack-infra		09:18
*** pkopec has joined #openstack-infra		09:18
*** ociuhandu has joined #openstack-infra		09:23
ianw	yeah i'd note that only recently we enabled the zuul ci side of things there	09:24
ianw	we've got the system-config job running on it, but it has always failed; i need to look into it. should be back tomorrow	09:24
ianw	(it being testinfa)	09:24
*** gfidente has joined #openstack-infra		09:29
*** yamamoto has quit IRC		09:35
*** xenos76 has joined #openstack-infra		09:36
frickler	ianw: when did "only recently" happen and where? it seems we have patches in the check queue waiting for nodes since 14h, could that be related? e.g. https://review.opendev.org/680158	09:38
*** yamamoto has joined #openstack-infra		09:40
*** jamesmcarthur has joined #openstack-infra		09:40
frickler	seems we have a high rate of "deleting nodes" in grafana since 21:00 which would match that timeframe	09:40
frickler	seeing quite a lot of these in nodepool-launcher.log on nl01: 2019-09-05 09:41:24,333 INFO nodepool.driver.NodeRequestHandler[nl01-21267-PoolWorker.rax-iad-main]: Not enough quota remaining to satisfy request 300-0005089060	09:42
frickler	also ram quota errors like http://paste.openstack.org/show/771269/	09:44
*** jamesmcarthur has quit IRC		09:45
frickler	hmm, this one looks like rackspace doesn't even let us use all of the quota. or can this happen with multiple lauch requests in parallel? Quota exceeded for instances: Requested 1, but already used 220 of 222 i	09:46
frickler	nstances	09:46
*** bhavikdbavishi has quit IRC		09:52
frickler	infra-root: unless someone has a better idea, I'd suggest to try and gently lower the quota we use on rackspace in an attempt to reduce the ongoing churn	10:07
sean-k-mooney	clarkb: thanks for looking into the ca issue. for now ill stick with truning off the tls_proxy until after m3 but assiming moveing sync-devstack-data resolves the issue then we can obviously turn it back on.	10:09
*** udesale has joined #openstack-infra		10:20
*** udesale has quit IRC		10:28
*** udesale has joined #openstack-infra		10:29
ianw	frickler: i'd have to look back, but a few weeks. but i think maybe we might need to restart zuul to pick up the changes as it wasn't authorized to access the testinfra github project when it was launched. i think there may be some bits that don't actively reload themselves	10:29
ianw	sorry no idea on the quota bits	10:29
*** bexelbie has joined #openstack-infra		10:39
*** ramishra has quit IRC		10:39
*** nicolasbock has joined #openstack-infra		10:41
*** jamesmcarthur has joined #openstack-infra		10:41
*** yamamoto has quit IRC		10:42
*** jamesmcarthur has quit IRC		10:46
*** markvoelker has joined #openstack-infra		10:46
*** kjackal has quit IRC		10:47
*** iurygregory has joined #openstack-infra		10:48
*** gfidente has quit IRC		10:52
*** markvoelker has quit IRC		10:52
AJaeger	config-core, please review https://review.opendev.org/680240 https://review.opendev.org/680239 https://review.opendev.org/676430 , https://review.opendev.org/678573 and https://review.opendev.org/679652	10:56
*** ramishra has joined #openstack-infra		11:06
*** jamesmcarthur has joined #openstack-infra		11:18
*** kjackal has joined #openstack-infra		11:19
*** jpena is now known as jpena\|lunch		11:21
*** jamesmcarthur has quit IRC		11:23
*** ociuhandu has quit IRC		11:23
*** beagles has joined #openstack-infra		11:24
*** lpetrut has joined #openstack-infra		11:27
*** bexelbie has quit IRC		11:30
*** rosmaita has left #openstack-infra		11:35
*** bexelbie has joined #openstack-infra		11:36
*** bexelbie has quit IRC		11:36
*** yamamoto has joined #openstack-infra		11:36
*** ociuhandu has joined #openstack-infra		11:39
*** ociuhandu has quit IRC		11:40
*** ociuhandu has joined #openstack-infra		11:41
*** ociuhandu has quit IRC		11:41
*** ociuhandu has joined #openstack-infra		11:42
*** ociuhandu has quit IRC		11:43
*** ociuhandu has joined #openstack-infra		11:44
*** gfidente has joined #openstack-infra		11:45
*** ociuhandu has quit IRC		11:50
AJaeger	infra-root, I agree, something is wrong with out nodes - we have a requirements change waiting since 14 hours for a node for example...	11:53
*** yamamoto has quit IRC		11:54
Shrews	AJaeger: That may not necessarily be something wrong with nodepool. If zuul gets hung up on something, it probably already has the nodes it has requested, it just isn't progressing and using them. That's what we've seen most in these cases. Do you have a node request number for the change in question?	11:57
AJaeger	Shrews: see backscroll and comments by frickler - I have no further information and I'm not an admin.	11:58
Shrews	frickler: ^^^	11:58
AJaeger	Shrews: example change https://review.opendev.org/680107 in http://zuul.opendev.org/t/openstack/status	11:59
AJaeger	"Not enough quota remaining to satisfy request 300-0005089060" - is that what you need, Shrews ?	11:59
AJaeger	That is from backscroll	12:00
*** yamamoto has joined #openstack-infra		12:00
*** jamesmcarthur has joined #openstack-infra		12:00
AJaeger	Shrews: and http://paste.openstack.org/show/771269/ has details	12:00
*** roman_g has joined #openstack-infra		12:01
Shrews	2019-09-05 09:46:34,399 DEBUG nodepool.driver.NodeRequestHandler[nl01-21267-PoolWorker.rax-iad-main]: Fulfilled node request 300-0005089060	12:01
Shrews	that request was satisfied hours ago. if the change for that is not doing anything still, that points to a zuul problem	12:01
*** markvoelker has joined #openstack-infra		12:02
AJaeger	I don't know - frickler, do you? ^	12:02
Shrews	and if zuul is holding requested nodes longer than it needs to, it just puts excessive pressure on the entire system (causing the quota issues mentioned)	12:03
frickler	Shrews: it's well possible that zuul would need a restart, possibly related to what ianw mentioned, but I'm not going to touch it myself	12:04
Shrews	AJaeger: frickler: looks like the node used in that request was deleted not long after the request was fulfilled	12:04
*** virendra-sharma has joined #openstack-infra		12:04
*** bhavikdbavishi has joined #openstack-infra		12:10
*** virendra-sharma has quit IRC		12:13
*** rh-jelabarre has joined #openstack-infra		12:14
Shrews	AJaeger: frickler: from what I can tell, nodepool never even received that request until 2019-09-05 09:41:14, so it processed it rather quickly.	12:15
*** yamamoto has quit IRC		12:16
frickler	we do seem to be slowly progressing, not completely stuck. maybe indeed all is well except that we are running at or above capacity	12:18
AJaeger	we have 8 hold nodes - do we still need those?	12:19
Shrews	oh, that node request does not even belong to the referenced change	12:19
frickler	corvus: pabelanger: mnaser: ianw: mordred: do you still need your held nodes? all of them > 10 days old	12:20
pabelanger	frickler: nope, that can be deleted	12:21
AJaeger	we have 50 nodes deleting in limestone: http://grafana.openstack.org/d/WFOSH5Siz/nodepool-limestone?orgId=1	12:22
AJaeger	and 201 in ovh: http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?orgId=1	12:22
AJaeger	but 0 building. frickler, that might be the problem with the deleting you noticed ^	12:22
haleyb	clarkb: do you still have question on accept_ra sysctls in the dhcp namespace? I can't tell from scrollback. Don't think that should be an issue, but i'd be happy to look into it	12:22
pabelanger	AJaeger: looks like outages	12:23
AJaeger	so, that looks like the billing issue with ovh -	12:23
AJaeger	I'm aware of a swift billing issue, didn't know it affected nodes as well	12:23
AJaeger	is nodepool getting confused with trying to delete 200 ovh nodes and not succeeding?	12:24
*** yamamoto has joined #openstack-infra		12:26
*** ociuhandu has joined #openstack-infra		12:27
logan-	limestone has the host aggregate disabled cloud-side as a precautionary measure while we're ironing out an outage with an upstream carrier.. we've disabled the host aggregate this way in the past and it didn't have adverse affects with nodepool, but I wonder if maybe there's a scale issue with us + the ovh nodes?	12:28
*** ociuhandu has quit IRC		12:32
*** eharney has joined #openstack-infra		12:41
frickler	indeed this is what nl04 sees from ovh-gra1: keystoneauth1.exceptions.http.Unauthorized: The account is disabled for user: 6b66bafa4e214d5ab62928c8d7372b2b. (HTTP 401) (Request-ID: req-3e92a20a-058e-4ff6-a1f0-745f341bb8fa)	12:42
frickler	going to disable that zone	12:42
*** yamamoto has quit IRC		12:43
openstackgerrit	Jens Harbott (frickler) proposed openstack/project-config master: Disable OVH in nodepool due to accounting issues https://review.opendev.org/680397	12:45
frickler	infra-root: ^^ we might need to force-merge if checks take too long	12:45
*** rh-jelabarre has quit IRC		12:46
AJaeger	frickler: so, is that really a problem for nodepool? Doesn't it handle that error properly?	12:47
AJaeger	frickler: changes ahead of it in the queue got nodes assigned in 2-5 minutes, so shouldn't take too long	12:48
pabelanger	I'd be surpried if nodepool is having an issue because of it	12:49
pabelanger	but, could be possible	12:49
amorin	hey clarkb	12:49
pabelanger	in the past, provider outages haven't been an impact to nodepool	12:49
amorin	hey all	12:50
amorin	is the tenant on ovh still down?	12:51
amorin	the team is supposed to enabled it back this morning (french time)	12:51
pabelanger	amorin: yes, we are just disabling it now	12:51
pabelanger	might still be related to accounting?	12:51
frickler	amorin: at least it was 10 minutes ago	12:51
amorin	pabelanger, frickler spawn still not possible?	12:52
frickler	amorin: keystone say account 6b66bafa4e214d5ab62928c8d7372b2b is disabled	12:52
pabelanger	amorin: correct: http://grafana.openstack.org/dashboard/db/nodepool-ovh	12:52
AJaeger	amorin, and swift is still asking for payment	12:53
amorin	checking	12:53
amorin	I have a guy on phone call checking ATM	12:53
AJaeger	thanks, amorin . Much appreciated!	12:54
*** Goneri has joined #openstack-infra		12:55
amorin	frickler: what is 6b66bafa4e214d5ab62928c8d7372b2b ?	12:55
amorin	its not a tenant?	12:55
amorin	what we have is: dcaab5e32b234d56b626f72581e3644c	12:56
frickler	amorin: user id probably	12:56
amorin	ok	12:56
amorin	yes user is disabled	12:56
amorin	we are checking	12:56
*** bhavikdbavishi has quit IRC		12:56
*** yamamoto has joined #openstack-infra		12:58
AJaeger	frickler, pabelanger , I think we should not merge the ovh disable change while amorin is on it- do you agree and want to WIP it?	12:58
frickler	AJaeger: ack, done	12:59
*** jpena\|lunch is now known as jpena		13:01
amorin	frickler: AJaeger pabelanger users are ok now	13:04
amorin	could you check?	13:04
*** ramishra has quit IRC		13:04
*** ramishra has joined #openstack-infra		13:04
*** jamesmcarthur has quit IRC		13:04
*** bhavikdbavishi has joined #openstack-infra		13:04
*** jchhatbar has joined #openstack-infra		13:05
*** jcoufal has joined #openstack-infra		13:06
*** janki has quit IRC		13:06
*** mriedem has joined #openstack-infra		13:07
frickler	amorin: I don't see any change yet	13:07
amorin	ok, I still have some tasks ongoing, wait a minute	13:07
*** jchhatba_ has joined #openstack-infra		13:08
amorin	frickler: could you test again?	13:09
frickler	amorin: yep, looks better now	13:09
amorin	yay	13:10
*** pcaruana has quit IRC		13:10
*** jchhatbar has quit IRC		13:11
frickler	amorin: so at least I can manually list servers, our nodepool service seems to have a bit of a hickup. thanks anyway for fixing this	13:11
frickler	amorin: the swift issue still seems to persist, though, I'm assuming that will take longer to handle?	13:12
amorin	ah, it's supposed to fix also	13:13
amorin	checking	13:13
rledisez	frickler: there is few minutes of cache so now that the tenant is enabled, it should be ok in few minutes	13:13
amorin	thanks rledisez	13:13
rledisez	frickler: is it the same tenant ?	13:13
rledisez	just to double check	13:13
amorin	rledisez: the user is on the same tenant	13:13
frickler	rledisez: amorin: indeed, after another reload this seems to work now, too. thx alot	13:14
rledisez	frickler: perfect. let us know if you hit other issues	13:14
AJaeger	yeah, swift works - thanks amorin and rledisez !	13:14
amorin	sorry for that	13:15
AJaeger	frickler: nodepool is not yet using ovh according to http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?orgId=1 ?	13:15
*** eernst has quit IRC		13:16
frickler	AJaeger: nl04 seems to be recovering from suddenly getting useful responses, I'd tend to wait a bit and see what happens	13:16
AJaeger	ok	13:17
AJaeger	frickler: want to abandon https://review.opendev.org/#/c/680397/ now?	13:18
*** ociuhandu has joined #openstack-infra		13:18
frickler	AJaeger: done	13:21
*** bhavikdbavishi has quit IRC		13:28
AJaeger	frickler, pabelanger: Did either of you remove pabelanger's hold nodes?	13:29
pabelanger	I have not, I can do in a bit	13:30
pabelanger	or somebody else can, if they'd like	13:30
mnaser	i dont need my holds	13:30
mnaser	sorry i didnt update	13:31
*** rosmaita has joined #openstack-infra		13:32
corvus	frickler: i deleted my nodes, thanks for the reminder	13:33
corvus	frickler, AJaeger: i'd like to restart the executors, but that's going to create a bit of nodepool churn; do you think we're stable enough for that now, or should we defer that for a few hours?	13:35
fungi	i am around today, modulo several hours of meetings starting shortly, but am declaring bankruptcy on the last 36 hours of irc scrollback in here	13:35
corvus	fungi: glad to see you!	13:36
fungi	thanks! glad to be high and dry here in the mountains for the next several days	13:36
*** ginopc has quit IRC		13:37
smcginnis	Good to hear you are safe and dry.	13:38
*** pcaruana has joined #openstack-infra		13:39
AJaeger	corvus: we just enabled ovh again - but they are not picking up nodes yet according to grafana. Let's wait until ovh is healthy	13:39
AJaeger	corvus: so, my advise: Let's figure out ovh - and the nrestart. We might need to restart nl04 as well...	13:41
*** smarcet has joined #openstack-infra		13:42
corvus	AJaeger: ack; i'll take a look at the nl04 logs	13:42
corvus	AJaeger, frickler, Shrews: since frickler said that new commands work, but nothing is currently working on nl04, i wonder if there's something cached in the keystone session that's blocking us... how about i go ahead and restart nl04 and we can maybe learn whether that's the case?	13:44
*** jchhatba_ has quit IRC		13:44
corvus	actually... new theory:	13:45
corvus	i'm only seeing logs about deleting servers	13:45
corvus	frickler: you said you were able to list servers -- do we still have a lot of servers in the account?	13:46
AJaeger	http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh?orgId=1&refresh=1m is confusing, it shows deleting at 0	13:46
AJaeger	but if you look further down at "Test Node History", it still shows them as deleting	13:47
corvus	i get a grafana error for that dashboard	13:47
*** bnemec has quit IRC		13:48
AJaeger	does http://grafana.openstack.org/d/BhcSH5Iiz/nodepool-ovh work?	13:48
corvus	(or maybe a graphite error)	13:48
AJaeger	works for me...	13:48
*** bnemec has joined #openstack-infra		13:48
corvus	nope, i get http://paste.openstack.org/	13:48
corvus	grr	13:48
corvus	http://paste.openstack.org/show/771445/	13:48
AJaeger	;/	13:49
* AJaeger has to step out now		13:49
corvus	i get that for every nodepool dashboard	13:49
corvus	oh. privacy badger :)	13:51
*** rh-jelabarre has joined #openstack-infra		13:51
corvus	one really has to get used to debugging the internet if using pb	13:52
corvus	i've triggered the debug handler, and the objgraph part of that is taking a long time	13:56
corvus	probably because we do in fact have a memory leak, and it has to page everything in	13:56
Shrews	corvus: for zuul or the launcher?	13:57
corvus	nl04	13:57
corvus	or...	13:57
corvus	is it possible the debug handler crashed?	13:58
fungi	corvus: yep, a while back i discovered i needed to okay grafana.openstack.org embedding images from graphite.openstack.org (again, i think it's the .openstack.org cookies causing it for me)	13:58
corvus	cause the program seems to be running again now, with no iowait/swapping. but sigusr2 doesn't work anymore, and i still never saw anything after 2019-09-05 13:53:09,563 DEBUG nodepool.stack_dump: Most common types:	13:58
fungi	er, graphite.openDEV.org	13:59
fungi	so maybe not the domain cookie	13:59
*** smarcet has quit IRC		13:59
corvus	Shrews: these are the stack traces i got from the pool workers: http://paste.openstack.org/show/771446/ http://paste.openstack.org/show/771447/	14:00
corvus	i can't tell if they're stuck there in a deadlock or not (since i can't run the handler twice)	14:00
corvus	i think we've got all we're going to from nl04 and we should restart it now	14:00
corvus	(some NodeDeleter threads also have similar tracebacks)	14:01
corvus	Shrews: any objection to a restart, or should i go for it?	14:01
Shrews	corvus: go for it. i suspect changing something with the user auth caused some wonkiness within keystoneauth	14:02
Shrews	but those stack traces seem... off	14:03
corvus	yeah, i'm sure at least one dependency has changed from under us	14:03
corvus	maybe even nodepool itself	14:04
*** ociuhandu has quit IRC		14:04
corvus	okay, it's doing stuff now. running into a lot of quota issues, but hopefully that will even out	14:04
corvus	(as the old, improperly accounted-for nodes are deleted)	14:05
*** lpetrut has quit IRC		14:05
corvus	and there are nodes in-use now	14:05
corvus	so looks like things are okay again	14:06
*** pcaruana has quit IRC		14:12
*** yamamoto has quit IRC		14:16
*** yamamoto has joined #openstack-infra		14:16
johnsom	Any pointers on how I can debug "RETRY_LIMIT" job failures? There appear to be no logs: https://zuul.opendev.org/t/openstack/build/c612b1c80c0c4dd49d39706d2a6cc4a5	14:19
*** ociuhandu has joined #openstack-infra		14:20
*** smarcet has joined #openstack-infra		14:20
*** yamamoto has quit IRC		14:21
*** jamesmcarthur has joined #openstack-infra		14:23
corvus	johnsom: you might be able to recheck the change and then follow the streaming console log when it starts running	14:25
*** yamamoto has joined #openstack-infra		14:25
*** ociuhandu has quit IRC		14:25
corvus	when the release queue clears i'm going to restart the executors	14:25
*** smarcet has quit IRC		14:29
*** lpetrut has joined #openstack-infra		14:29
corvus	#status log restarted nl04 to deal with apparently stuck keystone session after ovh auth fixed	14:32
openstackstatus	corvus: finished logging	14:32
corvus	#status log restarted zuul executors with commit cfe6a7b985125325605ef192b2de5fe1986ef569	14:32
openstackstatus	corvus: finished logging	14:32
*** armstrong has joined #openstack-infra		14:35
*** smarcet has joined #openstack-infra		14:36
*** yamamoto has quit IRC		14:37
*** yamamoto has joined #openstack-infra		14:39
frickler	corvus: sorry, was afk for a bit. when I checked there were only a few servers listed, 3 in one region and about 10 in the other	14:42
frickler	corvus: there are also two ancient puppet apply processes on nl04, shall we just kill those?	14:42
corvus	frickler: great, i think that matches the observed behavior	14:43
corvus	frickler: ++	14:43
*** jamesmcarthur has quit IRC		14:43
frickler	corvus: finally, did you look at the config error? ianw mentioned that this might also need a restart to resolve http://zuul.openstack.org/config-errors	14:43
corvus	frickler: has our app been installed in that repo?	14:44
corvus	(or was it installed after we reconfigured to add the repo?)	14:45
frickler	corvus: I have no idea. the repo seems to have been in the config for some months if I checked correctly	14:45
*** mattw4 has joined #openstack-infra		14:46
*** udesale has quit IRC		14:46
*** lpetrut has quit IRC		14:47
frickler	https://review.opendev.org/657461	14:47
*** udesale has joined #openstack-infra		14:47
*** gyee has joined #openstack-infra		14:50
corvus	frickler: let's check with ianw	14:51
corvus	frickler: but if it's a transient error, then we should be able to fix it with a full reconfiguration	14:51
*** smarcet has quit IRC		14:53
*** armax has joined #openstack-infra		14:56
*** piotrowskim has quit IRC		14:58
*** cmurphy\|afk is now known as cmurphy		14:58
donnyd	can someone take a look at the nodepool logs for FN. I am having an issue with the new labels I created for NUMA	15:00
donnyd	It would be super helpful to know what nodepool is throwing as an error	15:01
*** jamesmcarthur has joined #openstack-infra		15:02
*** e0ne has quit IRC		15:05
Shrews	donnyd: all i'm seeing are quota issues	15:07
donnyd	Ok, i will just pull the quota entirely	15:07
Shrews	donnyd: what are the new labels?	15:08
donnyd	https://github.com/openstack/project-config/blob/master/nodepool/nl02.openstack.org.yaml#L348-L357	15:08
donnyd	multi-numa-ubuntu-bionic-expanded	15:08
donnyd	multi-numa-ubuntu-bionic	15:08
donnyd	those are the ones i am concerned with	15:09
Shrews	donnyd: this is the most recent reference to one of those: http://paste.openstack.org/show/771452/	15:09
donnyd	thanks Shrews	15:10
donnyd	that is what i needed	15:10
clarkb	corvus: fungi can I get review on https://review.opendev.org/#/c/680239/ that is next step for swift container sharding	15:11
clarkb	if tests look good for ^ I'll push up a change for base-minimal and base	15:11
clarkb	also is there a change to add ovh back to those jobs?	15:11
corvus	clarkb: no change i'm aware of	15:12
donnyd	also I am not sure how up to date the dashboard is... but its says its been deleting 29 instances for a while...	15:14
donnyd	they have been gone since about 37 seconds after the request for delete was placed	15:14
openstackgerrit	Clark Boylan proposed opendev/base-jobs master: Revert "Stop storing logs in OVH" https://review.opendev.org/680446	15:14
clarkb	corvus: ^ for OVH	15:14
*** tosky has quit IRC		15:15
*** smarcet has joined #openstack-infra		15:16
*** mattw4 has quit IRC		15:17
*** jamesmcarthur has quit IRC		15:21
*** jamesmcarthur has joined #openstack-infra		15:22
*** ccamacho has quit IRC		15:26
openstackgerrit	Clark Boylan proposed zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift) https://review.opendev.org/680240	15:27
*** jamesmcarthur has quit IRC		15:27
clarkb	infra-root https://review.opendev.org/#/c/680236/ bumps mirror TTLs up to an hour which was something we noticed with johnsom's dns debugging last night	15:29
clarkb	slaweq: if you have time today it would be great if you could look at these jobs logs from johnsom to double check that neutron isn't breaking host ipv6 networking which results in broken dns	15:29
clarkb	slaweq: at around the same time that dns stops working syslog records a bunch of changes by neutron in iptables, ebtables, ovs, and sysctl	15:30
slaweq	clarkb: sorry, I'm on meeting now and will go offline just after it	15:31
slaweq	clarkb: I can look at it tomorrow morning	15:31
johnsom	Repaste: An ironic job example: https://ae2e20a98890c5458e58-1659b40e5c03e7f989419d6178d67ae8.ssl.cf5.rackcdn.com/679332/3/check/ironic-standalone-ipa-src/50285fa/job-output.txt	15:31
slaweq	is it this link https://github.com/openstack/neutron/commit/f21c7e2851bc99b424bdc5322dcd0e3dee7ee5a3 ?	15:31
*** jamesmcarthur has joined #openstack-infra		15:31
*** ociuhandu has joined #openstack-infra		15:31
johnsom	repaste: an octavia log example: https://79636ab1f0eaf6899dc0-686811860e75a35eabcecb5697905253.ssl.cf2.rackcdn.com/665029/30/check/octavia-v2-dsvm-scenario/51891fa/job-output.txt	15:32
clarkb	slaweq: the one that johnsom just pasted. Job ends up failing beacuse dns stops working	15:32
clarkb	slaweq: cross checking timestamps in the unbound log and in syslog I see dns stops working about when neutron is making networking changes	15:32
clarkb	there is a lotof moving parts there so I ended up getting lost but figured someone that knows neutron would be able to get through it more easily	15:32
johnsom	clarkb Any advice on this? https://zuul.opendev.org/t/openstack/build/c612b1c80c0c4dd49d39706d2a6cc4a5	15:33
fungi	clarkb: 680236 needs a serial increase	15:33
slaweq	clarkb: I will look into it later today or tomorrow morning	15:33
clarkb	fungi: oh right	15:33
clarkb	slaweq: thank you	15:33
openstackgerrit	Clark Boylan proposed opendev/zone-opendev.org master: Increase mirror ttls to an hour https://review.opendev.org/680236	15:34
clarkb	fungi: ^ done and thank you	15:34
clarkb	johnsom: did you see the suggestion of watching the job? typically that means it is failing badly enough that we lose network connectivity and fail to copy logs	15:35
clarkb	johnsom: but if you watch the console log that is live streamed so you'll get the data before networking goes away	15:35
johnsom	clarkb I did, I'm just not sure how to guess which job is going to have the RETRY_LIMIT, it seems to move around.	15:36
clarkb	is it specific to that change?	15:36
johnsom	no	15:36
*** ociuhandu has quit IRC		15:36
johnsom	I will try to catch one, just hoped there was a better way	15:37
clarkb	Not really. We could try to set up holds but if host networking is getting nuked we won't be able to login and debug with a hold	15:37
clarkb	we can check the executor logs to see if there are any crumbs of info there but other than that catching one live is probalby our best bet	15:38
clarkb	johnsom: could it be nested virt causing crashes?	15:38
johnsom	clarkb We disabled that over a year ago	15:38
fungi	rules that out then i guess	15:38
johnsom	clarkb Due to the nodepool image kernel issue at ovh	15:38
johnsom	This just started around the 3rd	15:39
*** smarcet has left #openstack-infra		15:39
clarkb	johnsom: http://paste.openstack.org/show/771454/ as suspected it appeas to be crashing due to connectivity issues	15:42
clarkb	during tempest	15:42
clarkb	if you catch a live one you'll probably be able to narrow it down to the test that ran prior	15:43
clarkb	maybe look for changes in your tempest tests the last few days	15:44
johnsom	I did, we haven't changed them since the 16th	15:45
clarkb	other things it could be (but usually this is more widespread) zuul OOMing and losing its ZK Locks on nodes	15:45
* clarkb checks cacti		15:46
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=64792&rra_id=all seems ok	15:46
*** yamamoto has quit IRC		15:53
*** yamamoto has joined #openstack-infra		15:54
*** yamamoto has quit IRC		15:54
*** yamamoto has joined #openstack-infra		15:57
*** yamamoto has quit IRC		15:57
*** yamamoto has joined #openstack-infra		15:57
*** sshnaidm\|ruck is now known as sshnaidm\|afk		16:00
clarkb	should I be concerned that the opendev base jobs updates seem to just be sitting in zuul http://zuul.opendev.org/t/opendev/status ?	16:00
*** altlogbot_3 has quit IRC		16:01
*** yamamoto has quit IRC		16:02
*** altlogbot_3 has joined #openstack-infra		16:02
*** smarcet has joined #openstack-infra		16:03
*** irclogbot_3 has quit IRC		16:03
*** irclogbot_2 has joined #openstack-infra		16:04
johnsom	I hate to say it, but if you look at the main zuul board, jobs are backed up 18+ hours and other projects are seeing the retries as well.	16:05
haleyb	johnsom: so does the job failure happen at the same place every time? looks like it's when it's going to setup diskimage-builder from the link above	16:05
openstackgerrit	Paul Belanger proposed zuul/zuul master: Remove support for ansible 2.5 https://review.opendev.org/650431	16:05
openstackgerrit	Paul Belanger proposed zuul/zuul master: Switch ansible_default to 2.8 https://review.opendev.org/676695	16:05
openstackgerrit	Paul Belanger proposed zuul/zuul master: WIP: Support Ansible 2.9 https://review.opendev.org/674854	16:05
clarkb	johnsom: ya I'm wodnering if this is executor restart fallout	16:05
johnsom	The top job for cinder has more than an handful doing "2. attempt"	16:05
clarkb	they aren't running	16:05
clarkb	johnsom: ya because executors were stopped	16:06
clarkb	corvus: ^ fyi I don't think they came back up again	16:06
corvus	clarkb: oh, well that makes me sad	16:06
*** irclogbot_2 has quit IRC		16:07
johnsom	haleyb Hi there. The common "timing" is both jobs install additional packages later in the devstack setup.	16:07
*** irclogbot_2 has joined #openstack-infra		16:07
clarkb	johnsom: haleyb ya one theory I had was most jobs don't haev a problem because dns gets broken after they are done doing external work	16:08
haleyb	i just don't see where any of the neutron agents are actively doing anything that would cause it	16:08
clarkb	johnsom: haleyb but if projects like octavia and ironic do extra after neutron comes up and sets up its networking that could explain why they see it	16:08
johnsom	haleyb When they attempt to go out and get those, the DNS queries fail. This only occurs when the DNS is using IPv6.	16:08
clarkb	haleyb: check syslog	16:08
slaweq	clarkb: I don't see anything in neutron that could break this IPv6	16:08
slaweq	clarkb: but I have to leave now, sorry	16:09
clarkb	haleyb: there is a bunch of ovs, iptables, ebtables and such right around teh same time period (starting at the same second and continuing after)	16:09
clarkb	corvus: anything I can do to help?	16:09
corvus	clarkb: i restarted them. was just a systemd timeout	16:09
clarkb	ah	16:09
clarkb	johnsom: haleyb I do still think fixing worlddump will help quite a bit	16:11
clarkb	since that should give us the networking state after things break	16:11
clarkb	and we can work backward from there if necessary (or it will say everything is fine)	16:12
haleyb	clarkb: the iptables, etc calls are after the failure from what i see - 2019-09-04 22:06:01.847407 is dns failure, Sep 04 22:06:05 is first "ip netns exec..." call, so not sure it's the culprit yet	16:14
clarkb	haleyb: in the agent log that starts earlier (I don't know why there is a mismatch in logs timestamps. maybe one is at start and the other is at completion?)	16:15
*** xinranwang has quit IRC		16:15
*** mattw4 has joined #openstack-infra		16:15
clarkb	but ya we are running worlddump and it does nothing, that should be fixed as it exists specifically to debug these problems	16:16
haleyb	i've seen dns failure locally before, typically when "stack.sh" moves IPs to the OVS bridge, sometimes networkmanager trips over itself, but i just work around that by hardcoding dns servers in resolv.conf	16:16
clarkb	we hardcode them in the unbound config then hardcode resolv.conf to resolv against local unbound	16:16
clarkb	(it is using the correct servers accoprding to the unbound log, they aren't reachable though)	16:16
*** bdodd has joined #openstack-infra		16:16
*** pgaxatte has quit IRC		16:17
*** gfidente has quit IRC		16:18
*** kopecmartin is now known as kopecmartin\|off		16:18
haleyb	it would be good to get a snapshot of what the system looks like right before this, like 'ip -6 a' and 'ip -6 r', etc, since i'm assuming we can't get in to look around when it fails?	16:19
clarkb	haleyb: we can that is what I'm trying to say. Worlddump is the tool we have for this	16:23
clarkb	its broken	16:23
clarkb	someone needs to fix it	16:23
*** e0ne has joined #openstack-infra		16:23
clarkb	I'm working on a quick change to devstack to see if we can identify where it is broken	16:24
*** smarcet has quit IRC		16:24
openstackgerrit	Merged opendev/base-jobs master: We need to set the build shard var in post playbook too https://review.opendev.org/680239	16:25
openstackgerrit	Merged opendev/base-jobs master: Revert "Stop storing logs in OVH" https://review.opendev.org/680446	16:25
*** yamamoto has joined #openstack-infra		16:26
clarkb	haleyb: remote: https://review.opendev.org/680458 DO NOT MERGE debugging why worlddump logs are not collected	16:27
AJaeger	config-core, please review https://review.opendev.org/679856 and https://review.opendev.org/679652	16:27
haleyb	clarkb: ack, i'll keep an eye on that, and look around some more	16:27
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: Update to page titles and Users https://review.opendev.org/680459	16:31
*** e0ne has quit IRC		16:31
johnsom	clarkb After that restart it looks like our jobs aren't RETRY_LIMIT anymore. They seem to have made it to devstack.	16:32
*** chandankumar is now known as raukadah		16:33
*** smarcet has joined #openstack-infra		16:39
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: CSS fix for ul/li in FAQ https://review.opendev.org/680465	16:41
openstackgerrit	Merged zuul/zuul-jobs master: Add build shard log path docs to upload-logs(-swift) https://review.opendev.org/680240	16:41
*** zigo has quit IRC		16:43
*** ramishra has quit IRC		16:45
openstackgerrit	Merged openstack/project-config master: New charm for cinder integration with Purestorage https://review.opendev.org/679652	16:46
clarkb	AJaeger: ^ there is one of them	16:46
AJaeger	thanks, clarkb	16:48
*** spsurya has quit IRC		16:48
*** igordc has joined #openstack-infra		16:49
zbr	clarkb: AJaeger any change to persuade you about emit-job-header improvement? https://review.opendev.org/#/c/677971/	16:49
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: CSS fix for ul/li in FAQ https://review.opendev.org/680465	16:49
*** ullbeking has joined #openstack-infra		16:50
*** jaosorior has quit IRC		16:51
AJaeger	zbr: you add molecule files, I do not see where those are used - and I agree with pabelanger that this is in the inventory already... So, I'm torn whether this is really beneficial...	16:54
*** smarcet has quit IRC		16:54
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: CSS fix for ul/li in FAQ https://review.opendev.org/680465	16:55
zbr	AJaeger: molecule files enable a developer to do local testing (did not want to force you to use that tool). Re the comment, users already remarked that the inventory is not available until job finishes and also is not visible in console log. Do we worry about adding few bytes to the console?	16:57
zbr	arguably even the summary page does not include any information about ansible version used, or which python interpreter was used.	16:58
AJaeger	I'm missing the relevance, so will abstain here. Also, adding the molecule files is something we should discuss on #zuul.	16:58
clarkb	a few extra bytes is unlikely to be a problem when compared to the vast quantity of log output by jobs	16:59
zbr	yep, i am for reducing logs size, but starting where it matters.	17:00
AJaeger	mnaser: we merged the swift log changes that clarkb wrote, is your maintenance over and do you want us to use mtl1 again? your change https://review.opendev.org/678440 is still WIP	17:00
AJaeger	bbl	17:00
*** rosmaita has left #openstack-infra		17:01
clarkb	AJaeger: note its only for the base-test so far and I'm testing that it works as we speak (waiting on test results)	17:01
zbr	regarding inclusion of test files, i am curious because we already do allow inclusion of stuff is is purely used for local testing, like `venv` section in tox.ini files, or some other sections which are run run on CI.	17:01
clarkb	AJaeger: mnaser but ya I'd love to push up a change that enables the build uuid sharding globally today and add vexxhost back to the mix if that is possible	17:01
clarkb	mnaser: ^ let us know	17:01
*** rh-jelabarre has quit IRC		17:01
*** ociuhandu has joined #openstack-infra		17:01
zbr	i am not trying to force anyone to adopt the tool, only reason why I included them was because I used them to test the change made to the role.	17:02
clarkb	zbr: I think the concern is that if we aren't testing it then we'll likely break it via changes made over time	17:02
zbr	and i do find them handy, but if that is a reason for not merging, i will remove them, is easy to recreate them,	17:02
*** markvoelker has quit IRC		17:04
*** igordc has quit IRC		17:04
fungi	zbr: we have historically used the testenv:venv definitions in some jobs, notably when producing release artifacts. at this stage they may be vestigial but i wouldn't wager a guess without additional research	17:05
*** tesseract has quit IRC		17:06
zbr	sure. just put a comment with whatever i need to change, no hardfeelins, the only thing really useful was to print python path/version and ansible version (also could prove useful for logstash searches in the future).	17:07
*** jpena is now known as jpena\|off		17:07
*** zigo has joined #openstack-infra		17:10
*** smarcet has joined #openstack-infra		17:10
clarkb	infra-root: config-core: https://openstack.fortnebula.com:13808/v1/AUTH_e8fd161dc34c421a979a9e6421f823e9/zuul_opendev_logs_141/680178/1/check/tox-py27/1416af4/ the build uuid prefixing seems to work	17:10
*** markvoelker has joined #openstack-infra		17:13
*** igordc has joined #openstack-infra		17:15
openstackgerrit	Clark Boylan proposed opendev/base-jobs master: Shard build logs with build uuid in all base jobs https://review.opendev.org/680476	17:17
clarkb	infra-root ^ that will take us to production	17:17
clarkb	and if mnaser gives us the go ahead we can enable vexxhost again once that is in?	17:17
*** udesale has quit IRC		17:18
*** armax has quit IRC		17:20
*** jaosorior has joined #openstack-infra		17:21
*** larainema has quit IRC		17:22
*** smarcet has quit IRC		17:25
johnsom	clarkb Got one: https://zuul.openstack.org/build/8478f4bf656b412b8c613d19e10b1c25	17:25
johnsom	https://www.irccloud.com/pastebin/5Mb14ZtR/	17:25
*** igordc has quit IRC		17:26
openstackgerrit	Clark Boylan proposed zuul/zuul-jobs master: Flip the order of the emit-job-header tests https://review.opendev.org/680477	17:28
clarkb	johnsom: ya so something is breaking networking and that is ipv4 (not v6)	17:29
*** smarcet has joined #openstack-infra		17:29
corvus	clarkb: did we delete all the logs_ containers in vexxhost?	17:30
clarkb	corvus: I believe mnaser did it for us	17:30
clarkb	because ceph was in a bad way	17:30
corvus	kk	17:31
johnsom	I still have that console open if you want me to look at the scrollback. At the surface it looks like ansible was happy before that.	17:31
clarkb	corvus: note: I have not double checked that and that would be a good cleanup (we'll want to clean all the logs_ containers in all the providers in ~30 days too)	17:31
corvus	clarkb: agree; let's just double check it then when we do that	17:31
clarkb	++	17:32
*** ociuhandu_ has joined #openstack-infra		17:32
AJaeger	clarkb, corvus , change I5af7749fefec61f1e9fe8379266e799184a13807 added minimal retention only to base, but not base-minimal and base-test jobs - could you double check the 1month retention, please? See my comment at 680476...	17:33
openstackgerrit	Merged opendev/zone-opendev.org master: Increase mirror ttls to an hour https://review.opendev.org/680236	17:33
clarkb	AJaeger: are you ok with that as a followup?	17:33
AJaeger	clarkb: it's unrelated to your change ;) I just noticed it when reviewing - so, followup is fine and your change is approved	17:34
openstackgerrit	Clark Boylan proposed opendev/base-jobs master: Set container object expiry to 30 days https://review.opendev.org/680480	17:34
clarkb	AJaeger: ^ thank you for the reviews	17:35
AJaeger	great! thanks	17:35
*** ricolin has quit IRC		17:35
*** ociuhandu has quit IRC		17:36
*** ociuhandu_ has quit IRC		17:36
clarkb	johnsom: looking at that it happens after tempest starts but fairly early	17:36
*** smarcet has quit IRC		17:36
clarkb	johnsom: that almost implies to me it is either an import time issue (testr will import all the test files to find the tests) or something separate that happens to just line up with that timing wise	17:36
johnsom	Yeah, none of the tests completed. I had four other consoles open at the same time, they all went into the tests.	17:37
clarkb	johnsom: unfortunately worlddump won't help us when connectivity has gone away completely	17:37
clarkb	johnsom: we probably want to see if it is cloud or region specific as a next step	17:40
clarkb	johnsom: if it is then it could also be a provider problem	17:41
openstackgerrit	Merged opendev/base-jobs master: Shard build logs with build uuid in all base jobs https://review.opendev.org/680476	17:41
johnsom	That ssh IP implies RAX IAD	17:41
*** smarcet has joined #openstack-infra		17:48
clarkb	oh neat that change merged already	17:50
clarkb	exciting!	17:50
*** jamesmcarthur has quit IRC		17:51
clarkb	mnaser: when you have a moment your feedback on whether or not these changes make you more comfortable running the log uploads against vexxhost again would be great	17:51
mnaser	clarkb: i guess the only thing that still feels like a bother is the sheer amount of ara-report's :(	17:57
clarkb	mnaser: I think corvus may be fiddling with options around taht? dmsimard is also brainstorming fixes here https://etherpad.openstack.org/p/Vz5IzxlWFz	17:58
clarkb	mnaser: did we see problems with the logs_XY containers in addition to the periodic container?	17:58
*** smarcet has quit IRC		17:59
corvus	i expect to be able to propose a change that removes the top-level ara generation that happens to every job by the end of the week, if that's the way we want to go. it won't affect nested aras (eg, devstack). it's not clear to me what the balance between the two is (ie, of the X ara files, what percentage comes from the zuul run versus nested runs)	18:00
*** smarcet has joined #openstack-infra		18:00
clarkb	note devastck doesn't do a nested ara I don't think	18:01
clarkb	but osa tripleo etc all do	18:01
*** rh-jelabarre has joined #openstack-infra		18:01
*** jamesmcarthur has joined #openstack-infra		18:04
*** armax has joined #openstack-infra		18:04
*** mattw4 has quit IRC		18:04
*** mattw4 has joined #openstack-infra		18:05
*** smarcet has quit IRC		18:06
*** smarcet has joined #openstack-infra		18:07
*** jamesmcarthur has quit IRC		18:07
*** diga has quit IRC		18:10
*** smarcet has quit IRC		18:16
clarkb	I've approved the gus2019 change for ssh idle timeout updates	18:19
clarkb	that should release the replication disabled on start change too	18:20
clarkb	once that gets applied to the gerrit server I'll work on doing some restarts (review-dev first)	18:20
*** jamesmcarthur has joined #openstack-infra		18:23
paladox	you may also want to enable change.disablePrivateChanges when you upgrade gerrit.	18:24
clarkb	paladox: we effectively disabled them via preventing pushes to refs/for/draft or wahtever that meta ref is	18:25
clarkb	but I guess drafts went away and there is a new thing now?	18:25
paladox	this is a new thing	18:25
paladox	no longer uses refs.	18:25
paladox	users can create a open change and put it as private	18:25
paladox	like wip mode	18:25
clarkb	ah	18:25
* paladox enabled that at wikimedia		18:25
paladox	as we wanted everything to be open	18:26
clarkb	ya also tends to create confusion when changes depend on a change that has since become private and so on	18:26
clarkb	however it may be useful for security bug fixes	18:26
paladox	yup	18:26
clarkb	we'll likely have to experiment with it	18:26
paladox	i've found that changes are not really hidden as the feature makes it out to be.	18:27
clarkb	ah	18:27
clarkb	in that case maybe not great for security bugs :)	18:27
paladox	I filed a bug about this some where upstream	18:27
paladox	https://bugs.chromium.org/p/gerrit/issues/detail?id=8111	18:28
*** kjackal has quit IRC		18:28
fungi	fwiw, that was one of the big problems about the old drafts feature as well, they claimed to be private but they were leaky	18:30
*** kjackal has joined #openstack-infra		18:31
*** mriedem has quit IRC		18:31
*** mriedem has joined #openstack-infra		18:33
*** dtantsur is now known as dtantsur\|afk		18:36
clarkb	did we decide that https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5263-L5267 is the cause of the openstack tenant config error because that repo doesn't have the zuul app on it?	18:36
*** igordc has joined #openstack-infra		18:36
corvus	paladox: thanks -- added a note to https://etherpad.openstack.org/p/gerrit-upgrade	18:37
paladox	yw :)	18:37
corvus	clarkb: unsure; wanted to check with ianw upon his awakening	18:38
clarkb	k	18:38
paladox	corvus also native packaging is https://gerrit.googlesource.com/gerrit-installer/ :)	18:40
corvus	paladox: yeah, we're going to make our own image builds (mostly so that we have the option to fork if necessary, and can include exactly the plugins we use). that's working now, i think we just need to tweak some ownership/permissions stuff to match our current server in order to upgrade	18:41
*** mattw4 has quit IRC		18:41
paladox	ah, ok :)	18:41
corvus	our dockerfile is heavily based on the gerritforce one :)	18:41
paladox	heh	18:41
corvus	gerritforge even	18:41
clarkb	I want to say the old step 0 plan was to switch to 2.13 via docker	18:41
clarkb	then upgrade via docker	18:41
corvus	clarkb: yeah, i think that's still good	18:42
paladox	corvus upgrading to NoteDB was cool! (dboritz added the notedb.config per my request as we use puppet at wikimedia)	18:45
*** pkopec has quit IRC		18:47
paladox	All we did was set https://github.com/wikimedia/puppet/commit/06c8e4122c37508045d84840ac1cb23f4f7d9011#diff-4c58f684fb8a36946bc7616d35570c00 then after the upgrade https://github.com/wikimedia/puppet/commit/d0b08b9675438fe637374a165fdf28c375c3510a#diff-4c58f684fb8a36946bc7616d35570c00	18:49
*** eernst has joined #openstack-infra		18:51
openstackgerrit	Merged opendev/puppet-gerrit master: Set a default idle timeout for ssh connections https://review.opendev.org/678413	18:51
paladox	ah, nice to see ^^	18:54
paladox	we done similar https://github.com/wikimedia/puppet/commit/cf5c343cc787c46cce2d4d1f91b2ab0c09d3492f#diff-4c58f684fb8a36946bc7616d35570c00	18:54
paladox	(task is hidden)	18:54
*** e0ne has joined #openstack-infra		18:54
openstackgerrit	Merged opendev/puppet-gerrit master: Add support for replicateOnStartup config option https://review.opendev.org/678486	18:55
openstackgerrit	Merged opendev/system-config master: Don't run replication on gerrit startup https://review.opendev.org/678487	18:55
corvus	yeah, that was an amusing moment at the hackathon... most people couldn't understand how our server functioned at all without a timeout set, and we were like "theres a timeout option?"	18:55
*** eernst has quit IRC		18:56
paladox	corvus it fixes other issues too :)	18:56
paladox	which i didn't realise at all could happen	18:56
clarkb	I'm pretty sure the timeout was set by default in older gerrit	18:56
clarkb	because I remember zuul losing its connection to our idle review-dev server	18:56
clarkb	I'll watch for those to apply to review-dev and restart it	18:57
*** eernst has joined #openstack-infra		18:58
paladox	corvus you may also want to enable https://github.com/wikimedia/puppet/commit/0564af76c6067f58d5622c8f81ec36d3793f2ddd#diff-4c58f684fb8a36946bc7616d35570c00 (affects gerrit 2.16+)	18:59
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: Removing erroneous og images https://review.opendev.org/680488	19:00
*** eernst has quit IRC		19:02
*** armstrong has quit IRC		19:04
*** e0ne has quit IRC		19:07
*** trident has quit IRC		19:09
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: Replacing OG images with Zuul icon https://review.opendev.org/680490	19:10
*** eernst has joined #openstack-infra		19:10
*** e0ne has joined #openstack-infra		19:12
*** e0ne has quit IRC		19:14
*** eernst has quit IRC		19:15
*** e0ne has joined #openstack-infra		19:16
*** eernst has joined #openstack-infra		19:17
*** e0ne_ has joined #openstack-infra		19:18
*** trident has joined #openstack-infra		19:20
*** e0ne has quit IRC		19:21
*** eernst has quit IRC		19:22
*** ralonsoh has quit IRC		19:23
openstackgerrit	James E. Blair proposed zuul/zuul master: Web: rely on new attributes when determining task failure https://review.opendev.org/680498	19:34
mriedem	is it normal to have a patch queued for nearly 20 hours right now?	19:34
mriedem	(679473,1) Handle VirtDriverNotReady in _cleanup_running_deleted_instances (19h18m/	19:34
AJaeger	mriedem: not normal - we had a cloud failure and needed some restart, so that bit us ;(	19:36
AJaeger	mriedem: so, we have quite a backlog at the moment ;/	19:37
corvus	the trend is heading in the right direction: \	19:37
*** jamesmcarthur has quit IRC		19:37
corvus	it was / then - now \ (ascii sparklines)	19:37
*** jamesmcarthur has joined #openstack-infra		19:38
AJaeger	;)	19:38
*** rosmaita has joined #openstack-infra		19:42
*** jamesmcarthur has quit IRC		19:42
*** eernst has joined #openstack-infra		19:43
openstackgerrit	James E. Blair proposed opendev/base-jobs master: Remove ara from base-test https://review.opendev.org/680500	19:43
openstackgerrit	James E. Blair proposed opendev/base-jobs master: Remove ara from base-test https://review.opendev.org/680501	19:43
openstackgerrit	James E. Blair proposed opendev/base-jobs master: Remove ara from base https://review.opendev.org/680501	19:44
corvus	clarkb, fungi, AJaeger, dmsimard: ^ that's an option i think we can consider now	19:45
corvus	mnaser: ^	19:45
*** eernst has quit IRC		19:46
*** eernst has joined #openstack-infra		19:47
*** e0ne_ has quit IRC		19:49
slittle1	now we need additional cores to be added to the ten new repos created for starlingx yesterday	19:52
*** e0ne has joined #openstack-infra		19:54
slittle1	Can I add them directly myself? Or do I need to go through an admin ?	19:55
slittle1	hmmm ... I'm not even on the core list. Guess I need an admin	19:55
clarkb	we add the first user then you add the rest	19:55
clarkb	give me a minute and I'll add you	19:56
slittle1	ok, great	19:56
clarkb	just have to find the change again to get my list of repos	19:57
*** e0ne_ has joined #openstack-infra		19:59
*** e0ne has quit IRC		20:00
clarkb	slittle1: and done. You should be able to self manage the group membership now	20:00
slittle1	great, thanks	20:00
*** eharney has quit IRC		20:01
clarkb	I have restarted review-dev and it seemed to come up happy	20:02
clarkb	checking the replication log there are no entries in it for today	20:02
clarkb	I think that means the no replication on start flag is working	20:03
clarkb	I'm going to start a stream events ssh connection and see if it goes away in an hour	20:03
*** pkopec has joined #openstack-infra		20:03
openstackgerrit	Paul Belanger proposed zuul/zuul master: WIP: Support Ansible 2.9 https://review.opendev.org/674854	20:04
AJaeger	corvus: Just checked the zuul UI for a job result - yes, I think we can drop ara	20:04
clarkb	no objections from me re dropping the root ara. We should send a note to the mailing list if we merge the production change though as people will notice	20:07
clarkb	but then maybe we can turn on vexxhost when mnaser is in a spot to keep an eye on his side of things and we monitor it?	20:07
paladox	clarkb note that changing replication.config when gerrit is running, effectively breaks replication until gerrit is restarted. (it's a known issue upstream, it bit us at wikimedia)	20:08
clarkb	paladox: we've not experienced that (we even have the replication config set to reload the plugin on updates)	20:08
clarkb	and we tested that and have made such changes multiple times	20:08
paladox	oh	20:08
*** jamesmcarthur has joined #openstack-infra		20:09
clarkb	could be the older plugin version is fine?	20:09
paladox	it bit us, we didn't realise it was a bug until we investigated furthur.	20:09
paladox	possibly	20:09
clarkb	I plan on restarting production gerrit today too, though i do need to pop out for a bit now then do phone calls	20:09
clarkb	wanting to verify the sshd idleTimeout first though	20:10
fungi	clarkb: i thought what we observed was that if the replication config is modified while gerrit is running, it discards all pending replication events in its queue	20:13
fungi	and so we noted that maybe we should just disable that "feature" instead	20:14
fungi	(but i don't recall whether we got around to actually disabling it)	20:14
openstackgerrit	Merged opendev/base-jobs master: Remove ara from base-test https://review.opendev.org/680500	20:16
clarkb	oh ya that may be	20:16
*** e0ne_ has quit IRC		20:19
paladox	fungi that's been fixed i think unless it was reverted.	20:26
fungi	well, odds are it wasn't fixed back in 2.13 (or we missed picking up a relevant backport)	20:29
paladox	fungi https://bugs.chromium.org/p/gerrit/issues/detail?id=10260	20:32
*** lpetrut has joined #openstack-infra		20:34
openstackgerrit	Paul Belanger proposed zuul/zuul master: Switch ansible_default to 2.8 https://review.opendev.org/676695	20:38
openstackgerrit	Paul Belanger proposed zuul/zuul master: WIP: Support Ansible 2.9 https://review.opendev.org/674854	20:38
*** lpetrut has quit IRC		20:38
*** ociuhandu has joined #openstack-infra		20:39
*** Goneri has quit IRC		20:44
*** ociuhandu has quit IRC		20:45
*** ociuhandu has joined #openstack-infra		20:46
*** jamesmcarthur has quit IRC		20:48
*** iurygregory has quit IRC		20:50
*** ociuhandu has quit IRC		20:50
*** jamesmcarthur has joined #openstack-infra		20:50
clarkb	corvus: what is the next step for https://review.opendev.org/#/c/680501/ do we have a test chagne for that yet? I guess you can use https://review.opendev.org/#/c/680178/ which I have just rechecekd.	20:58
clarkb	about now is when I expected review-dev to kill my ssh connection	21:02
clarkb	it hasn't happened yet.	21:03
clarkb	oh it just happend \o/	21:03
clarkb	ok those two changes are confirmed to be working on review-dev I think	21:03
clarkb	next step is restarting gerrit on review.o.o	21:03
clarkb	are any other roots around? should I just go for it?	21:04
clarkb	hrm there is one change in the release pipeline I'll go let the release team know	21:04
*** markvoelker has quit IRC		21:05
*** diablo_rojo has quit IRC		21:06
*** diablo_rojo has joined #openstack-infra		21:07
clarkb	fungi: are you around enough to be a second set of hands/eyeballs if needed for gerrit restart?	21:08
*** rh-jelabarre has quit IRC		21:08
fungi	clarkb: sure, i've just finished up post-election tasks (i hope)	21:09
clarkb	fungi: maybe you can wrote up a status notice? and I'll login and double check configs are in place on that server and proceed with restarting it	21:09
*** mattw4 has joined #openstack-infra		21:10
clarkb	configs are in place I'm ready to stop gerrit and start it when you are	21:10
*** markvoelker has joined #openstack-infra		21:11
clarkb	how about #status notice Gerrit is being restarted to pick up configuration changes. Should be quick. Sorry for the interruption.	21:11
fungi	that wfm, i had one half-typed	21:11
fungi	but slow going as i'm not at my usual keyboard	21:12
clarkb	I'll start asking systemd to do that nicely if you want to send the notice	21:12
fungi	#status notice Gerrit is being restarted to pick up configuration changes. Should be quick. Sorry for the interruption.	21:12
openstackstatus	fungi: sending notice	21:12
-openstackstatus- NOTICE: Gerrit is being restarted to pick up configuration changes. Should be quick. Sorry for the interruption.		21:14
clarkb	the log file and systemd think it is running	21:14
*** markvoelker has quit IRC		21:15
clarkb	and gerrit queue is largely empty	21:15
clarkb	(so replicationi config appears to ahve worked here too)	21:15
openstackstatus	fungi: finished sending notice	21:16
clarkb	web ui is up and running for me	21:16
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: OK, trying again to update with the correct og image tags, removing the erroneous gitlab tags. https://review.opendev.org/680520	21:16
fungi	working for me	21:16
clarkb	seems like people can push to it too ^ :)	21:17
*** pkopec has quit IRC		21:17
fungi	indeed!	21:17
clarkb	context switching: the ara removal seems to haev worked fine. https://33989e35e43da1db0b96-a619ea89024f9935a8230ca8f397a8a1.ssl.cf2.rackcdn.com/680178/1/check/tox-py35/e5c7f47/ maybe want to double check the dashboard renders that once sql reporting happens	21:17
clarkb	https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f83/680178/1/check/tox-py27/f839885/ as well	21:17
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: OK, trying again to update with the correct og image tags, removing the erroneous gitlab tags. https://review.opendev.org/680520	21:19
openstackgerrit	Jimmy McArthur proposed zuul/zuul-website master: OK, trying again to update with the correct og image tags, removing the erroneous gitlab tags. https://review.opendev.org/680520	21:20
clarkb	corvus: I +2'd https://review.opendev.org/#/c/680501/2 as well as AJaeger and fungi so I think that can happen as soon as we like	21:20
clarkb	I can also send email about it to the mailing lists if you would prefer I did that	21:20
clarkb	(but will wait for the chagne to be approved before doing that)	21:20
*** bdodd_ has joined #openstack-infra		21:26
*** bdodd has quit IRC		21:27
clarkb	fungi: https://review.opendev.org/#/c/680480/ makes base-test and base-minimal match base on their object expiry dates. https://review.opendev.org/#/c/680477/ is a simple order change to make the last emit-job-header url match what will actually be uploaded to	21:28
clarkb	those are the last two things I had related to swift and corvus' ara change is the other thing	21:28
clarkb	looks like corvus just removed his WIP on that change.	21:29
*** jamesmcarthur has quit IRC		21:29
corvus	yep -- though i guess you were thinking of waiting until 680178 reports	21:30
clarkb	ya though I guess I'm not super concerned about that since we can confirm the dashboard works with ara in place too	21:31
clarkb	at first I was thinkign we needed the sql report to confirm that then realized we don't	21:31
clarkb	since all jobs with or without ara get that dashboard stuff	21:31
corvus	yeah, i think the thing to check would be if there's something weird about the file layout without the generated report	21:32
corvus	it's more of an unknown unknown thing :)	21:32
clarkb	k happy to wait for that then	21:32
*** jcoufal has quit IRC		21:32
*** jamesmcarthur has joined #openstack-infra		21:50
openstackgerrit	Merged opendev/base-jobs master: Set container object expiry to 30 days https://review.opendev.org/680480	21:53
*** jamesmcarthur has quit IRC		21:55
*** slaweq has quit IRC		22:03
openstackgerrit	Merged opendev/base-jobs master: Remove ara from base https://review.opendev.org/680501	22:06
*** claudiub has joined #openstack-infra		22:07
*** AJaeger_ has joined #openstack-infra		22:10
*** kjackal has quit IRC		22:10
*** slaweq has joined #openstack-infra		22:11
*** AJaeger has quit IRC		22:14
*** jamesmcarthur has joined #openstack-infra		22:16
ianw	clarkb: re zuul errors for testinfra @ https://opendev.org/openstack/project-config/src/branch/master/zuul.d/projects.yaml#L5263-L5267	22:17
ianw	the project does have the zuul app on it -- but i don't think it did have the app when zuul actually started	22:18
clarkb	ah so if we trigger a full reconfigure that may go away?	22:18
ianw	maybe?	22:18
clarkb	corvus: on 680178 zuul has decided it needs to run all the tests against that change now?	22:19
clarkb	odd	22:19
pabelanger	ianw: clarkb: IIRC, you might need to stop / start zuul, in that case. I've seen that with github, but cannot remember the fix ATM	22:20
pabelanger	for now, I always try to make sure github app is installed before adding it to zuul tenant config	22:20
*** jamesmcarthur has quit IRC		22:21
*** slaweq has quit IRC		22:21
corvus	we should try a full reconfiguration. if that does not fix the problem it's a bug.	22:21
corvus	a restart should never be necessary.	22:21
openstackgerrit	Clark Boylan proposed zuul/zuul-website master: Add Zuul FAQ page https://review.opendev.org/679670	22:22
pabelanger	ianw: have you started working on adding fedora-30 to nodepool-builders? or does that need dib release	22:22
ianw	pabelanger: i just need to update for that gate missing you mentioned, then i think we can dib release	22:22
openstackgerrit	Clark Boylan proposed zuul/zuul-website master: CSS fix for ul/li in FAQ https://review.opendev.org/680465	22:23
pabelanger	ianw: great! will hold off on testing until then	22:23
*** eernst has quit IRC		22:23
*** jamesmcarthur has joined #openstack-infra		22:25
*** prometheanfire has quit IRC		22:25
*** prometheanfire has joined #openstack-infra		22:26
corvus	clarkb: yep, that's pretty weird, it shouldn't be doing that.	22:26
openstackgerrit	Ian Wienand proposed openstack/diskimage-builder master: Add fedora-30 testing to gate https://review.opendev.org/680531	22:26
corvus	most of those jobs don't inherit from unittests	22:26
ianw	clarkb / corvus: there does however seem to be something wrong with running the system-config job against upstream testinfra ... https://zuul.opendev.org/t/openstack/build/a018575d4d4e4710b763978069c9cf12/log/job-output.txt#3559	22:28
ianw	"Cloning into '/opt/system-config'...\nfatal: '/home/zuul/src/opendev.org/opendev/system-config' does not appear to be a git repository\	22:28
ianw	it doesn't run that often, but i think it's failed like that every time	22:28
clarkb	ianw: you may need to add that repo as a required project on the job	22:28
ianw	huh ... yeah, i didn't think of that ... different project	22:29
*** jamesmcarthur has quit IRC		22:30
corvus	yeah, if you look here you can see the command: https://zuul.opendev.org/t/openstack/build/a018575d4d4e4710b763978069c9cf12/console#2/1/2/bridge.openstack.org	22:30
corvus	it's what clarkb said	22:30
ianw	haha, sorry yeah seems obvious now. here i am thinking that the base jobs have some missing matching or something crazy	22:31
openstackgerrit	Ian Wienand proposed openstack/project-config master: testinfra : add system-config as required project https://review.opendev.org/680534	22:36
*** kaisers has quit IRC		22:37
openstackgerrit	Merged zuul/zuul-jobs master: Flip the order of the emit-job-header tests https://review.opendev.org/680477	22:37
*** kaisers has joined #openstack-infra		22:38
*** eernst has joined #openstack-infra		22:43
*** eernst has quit IRC		22:44
openstackgerrit	Ian Wienand proposed openstack/project-config master: testinfra : add system-config as required project https://review.opendev.org/680534	22:44
ianw	... duur not openstack/system-config ... get with the times :)	22:45
*** eernst has joined #openstack-infra		22:46
*** igordc has quit IRC		22:48
*** mriedem has quit IRC		22:50
*** eernst has quit IRC		22:51
ianw	corvus / clarkb : so was the conclusion i should SIGHUP zuul?	22:52
clarkb	ianw: or run the zuul-scheduler command that reconfigures it ( Ithink it is a zuul-scheduler command)	22:53
clarkb	either way should work	22:53
corvus	zuul-scheduler full-reconfigure	22:53
corvus	that's the future, but sighup should still work for now i think	22:53
*** exsdev has quit IRC		22:54
ianw	ok, just ran full-reconfigure	22:54
*** exsdev has joined #openstack-infra		22:54
fungi	https://zuul-ci.org/docs/zuul/admin/components.html#operation	22:55
fungi	for reference	22:55
corvus	it'll take a while -- it's done when the timestamp on the bottom of the status page updates	22:55
clarkb	https://etherpad.openstack.org/p/ara-removed-from-jobs how does that llok for giving people notice of the ara change	22:56
corvus	clarkb: seems good	22:57
fungi	looks fine to me	22:58
clarkb	I'll send that to the zuul airship and starlingx ml too	23:01
clarkb	(but will send separate emails)	23:01
ianw	so last reconfigured -> Fri Sep 06 2019 08:59:51 GMT+1000 (Australian Eastern Standard Time)	23:02
*** tkajinam has joined #openstack-infra		23:02
ianw	(that was 2 minutes ago)	23:02
corvus	ianw: so still no joy there. if a restart fixes it, we have a bug in zuul; if it doesn't, then something something github	23:03
corvus	it is, however, not a good time for a restart	23:03
ianw	yeah, just wait for something else to come up i guess. but it does report (despite the job dependency issue) ... so it is talking to github clearly	23:04
clarkb	https://zuul.opendev.org/t/zuul/build/f83988533a4847b4ad6e7e1948755938/logs has no ara-report https://zuul.opendev.org/t/zuul/build/f83988533a4847b4ad6e7e1948755938/console is happy	23:08
clarkb	I'll send the email now	23:08
*** slaweq has joined #openstack-infra		23:11
clarkb	mnaser: to catch up the major changes we've made are to: use a opendev specific container name prefix so that multiple zuul installs can run against a single ceph install (address global container namespace), suffixed container names with first three characters of build uuid sharing all builds into 4096 containers, and removed the top level zuul ara report	23:15
clarkb	some jobs will still run ara internally and we havne't changed those, but we did stop creating a report for every job	23:16
*** slaweq has quit IRC		23:16
openstackgerrit	Merged openstack/project-config master: testinfra : add system-config as required project https://review.opendev.org/680534	23:21
*** rcernin has joined #openstack-infra		23:23
*** threestrands has joined #openstack-infra		23:30
ianw	can we trigger rechecks via github comments?	23:33
clarkb	yes "recheck" should work just lik ewith gerrit	23:34
ianw	hrm doesn't for testinfra, but perhaps that is related to the config error. maybe new events work but not update checks?	23:36
aspiers	Is it possible to configure zuul to fail early if certain jobs fail? The docs suggest that pipeline.failure can do it	23:37
aspiers	but as you can tell I haven't a clue about zuul config yet ...	23:37
ianw	aspiers: this comes up a bit, you can setup dependencies ...	23:42
*** dchen has joined #openstack-infra		23:42
aspiers	I mean, e.g. if the pep8 job fails, then immediately cancel a 2 hour tempest job	23:43
aspiers	that kind of thing	23:43
aspiers	so that CI resources aren't wasted on broken changes	23:43
aspiers	although it would have to be cleverer than that	23:43
clarkb	we actually did try this a long time ago	23:43
clarkb	the problme is then you end up pushing many patchstes as you work through multiple failures	23:44
clarkb	rather than getting as complete a picture as possible upfront	23:44
aspiers	yes, I think there's a trade-off to be had somewhere	23:44
clarkb	but zuul does still support that fail fast iirc	23:44
clarkb	(we never removed the functionality just stopped using it)	23:44
aspiers	e.g. in nova, always run all the shorter running unit/functional test suites, but if any of those fail, then cancel the really slow and expensive tempest / grenade jobs	23:45
aspiers	I get the impression that 90% of nova failures are caught by the unit/functional tests	23:45
aspiers	ICBW	23:45
clarkb	oh ya for that you have to set up the dependencies	23:45
aspiers	the problem with not cancelling is that the queue gets really backlogged, like it is right now (hence my asking)	23:46
aspiers	For example 680296,1 has already failed openstack-tox-pep8 but some other jobs have been running for ALMOST 19 HOURS(!), taking up valuable CI resources	23:46
clarkb	they have been in the queue for that long not running for that long	23:47
aspiers	oh OK	23:47
aspiers	but still	23:47
aspiers	anything waiting in the queue that long is not good	23:47
aspiers	https://www.klipfolio.com/blog/cycle-time-software-development	23:47
clarkb	yes but the problem is due to external factors which have been corrected and we are now waiting to catch back up again	23:48
aspiers	actually https://codeclimate.com/blog/software-engineering-cycle-time/ is a better read	23:48
aspiers	ohhhh OK	23:48
aspiers	now you mention it, I did vaguely notice an IRC announcement about Gerrit earlier	23:48
clarkb	but also gate failures have significantly larger impacts on queue times because of the cost of gate resets	23:48
aspiers	brain too fried to make the connection ;-)	23:48
clarkb	this is why EVERY time this topic comes up I point people to elastic-recheck	23:49
clarkb	and ask people to focus on fixing the bugs in our software	23:49
clarkb	as step 0	23:49
aspiers	yeah :)	23:49
clarkb	because then we win with better software and shorter queues	23:49
clarkb	instead of just shorter queues with just as broken software	23:49
aspiers	I was assuming the backlog was due to nova being in the last week before feature freeze, hence a flurry of reviews	23:49
aspiers	Thanks a lot for all the info. As usual this channel is like 19 steps ahead of my thoughts ;-)	23:51
aspiers	Back to writing my first ever tempest tests \o/	23:51
*** rcernin is now known as rcernin\|brb		23:52
clarkb	aspiers: the other way to make a large impact is to reduce job runtimes (which is the other thing I' vebrought up recently with devstack runtimes and OOMing jobs)	23:53
clarkb	they get really slow due to lack of memory	23:53
aspiers	makes sense	23:53
*** jamesmcarthur has joined #openstack-infra		23:58

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!