Wednesday, 2020-08-12

fungi	i can't quite remember what openssl error 1416F086 is. i'm sure it'll come to me in a moment	00:02
fungi	self-signed cert, i guess?	00:03
fungi	and git-send-email isn't going to let the server present just any ol' cert, huh?	00:04
*** ryohayakawa has joined #opendev		00:05
fungi	okay, jaraco/irc pr is all green now	00:06
fungi	i substituted the python interpreter version in ctcp version replies, we'll see how that flies	00:07
ianw	fungi: this is just set up to send via gmail (the corporate mail server) ... if anything should work i'd think it would be this :/	00:07
fungi	yikes. maybe something's wrong with the ca bundle on the client then?	00:08
fungi	or maybe someone let a test cert leak into production at gmail	00:08
ianw	... no ... i had an old config file lying around from the last time i tried to use git send-email pointing to the old, internal corporate server, which must have been overriding my settings	00:11
fungi	aha	00:11
ianw	i wish i could just let this ipv6 thing go but it's my white whale :)	00:13
*** shtepanie has joined #opendev		00:42
ianw	donnyd/fungi: this is something like what i'm proposing as a libvirt doc update for the nat address choice -> http://paste.openstack.org/show/796749/ ... seem about right?	00:49
donnyd	I would maybe just use your real world example for the libvirt doc	00:52
donnyd	other than that it LGTM	00:53
ianw	donnyd: yeah, i guess the thing is https://tools.ietf.org/html/rfc4193#section-3.2.2 goes into great detail about how to generate a random /48	00:53
ianw	i figure if you put something in the doc, it just gets copied :)	00:53
donnyd	That is correct	00:54
donnyd	and its exactly what people do	00:54
donnyd	I would say its less likely that people will go read the gylphs from ietf	00:55
donnyd	Reading an RFC isn't really at the top of the list for most people on "things I will do with my evening"	00:56
ianw	heh :) you can either have ipv6, or not read RFC's ... choose one :)	00:56
ianw	in think in general, the world has chosen the latter	00:57
clarkb	my ISP says that we may have ipv6 by the end of the year	00:58
clarkb	will be the best outcome of them being bought out if it happens	00:58
ianw	i don't know if it's a fedora bug that fd00: interfaces are not preferenced over ipv4	00:59
ianw	from what i can tell, it's a practical decision that people had fc00::/7 addresses that didn't route anywhere, and it would case all sorts of issues	01:00
ianw	clarkb: odds that your ipv6 also comes with cgnat? :)	01:18
clarkb	I doubt it will	01:20
*** qchris has quit IRC		01:45
donnyd	Overall I think your post to the docs is a large value add and over time as ipv6 becomes less of the mystery to people, the more value things like simple usable docs will have. And I think that is what you wrote up.	01:54
*** qchris has joined #opendev		01:57
*** shtepanie has quit IRC		03:52
*** dmsimard2 has joined #opendev		04:12
*** dmsimard has quit IRC		04:13
*** dmsimard2 is now known as dmsimard		04:13
*** ysandeep\|away is now known as ysandeep		04:14
*** logan- has joined #opendev		04:40
*** weshay\|pto has quit IRC		06:12
*** weshay_ has joined #opendev		06:13
*** DSpider has joined #opendev		07:00
*** openstackgerrit has joined #opendev		07:00
openstackgerrit	yatin proposed zuul/zuul-jobs master: Fix url for ARA report https://review.opendev.org/745792	07:00
*** ryohayakawa has quit IRC		07:03
openstackgerrit	Albin Vass proposed zuul/zuul-jobs master: validate-host: skip linux nly tasks on windows machines https://review.opendev.org/745797	07:10
*** ssbarnea has joined #opendev		07:18
*** zbr has quit IRC		07:29
*** ssbarnea has quit IRC		07:29
*** zbr9 has joined #opendev		07:29
*** hashar has joined #opendev		07:30
*** tosky has joined #opendev		07:41
*** moppy has quit IRC		08:01
*** moppy has joined #opendev		08:01
*** openstackgerrit has quit IRC		08:09
*** zbr9 has quit IRC		08:12
*** zbr has joined #opendev		08:13
*** tkajinam has quit IRC		08:19
*** Eighth_Doctor has quit IRC		09:05
*** mordred has quit IRC		09:06
*** Eighth_Doctor has joined #opendev		09:14
*** mordred has joined #opendev		09:45
*** openstackgerrit has joined #opendev		10:04
openstackgerrit	Riccardo Pittau proposed openstack/diskimage-builder master: Update name of ipa job https://review.opendev.org/743042	10:04
openstackgerrit	Riccardo Pittau proposed openstack/diskimage-builder master: Do not install python2 packages in ubuntu focal https://review.opendev.org/745665	10:16
openstackgerrit	Carlos Goncalves proposed openstack/diskimage-builder master: Add octavia-amphora-image-build-live jobs https://review.opendev.org/745823	10:20
cgoncalves	hey there! openstackgerrit is back online but has not joined #openstack-lbaas and #openstack-infra at least	10:22
*** DSpider has quit IRC		10:32
*** DSpider has joined #opendev		10:33
*** hashar has quit IRC		10:44
*** lpetrut has joined #opendev		10:45
*** calcmandan has quit IRC		10:48
*** calcmandan has joined #opendev		10:49
AJaeger	cgoncalves: are you missing notifications? It only joins channels when there's something to notify about. There's a maximal channel limit a user/bot can be in, so it leaves/re-joins as needed.	10:54
cgoncalves	AJaeger, definitely missing in #openstack-lbaas	10:55
*** sshnaidm is now known as sshnaidm\|afk		10:57
yoctozepto	it has not joined kolla either	11:30
AJaeger	cgoncalves, yoctozepto: see above in this channel when it joined here - so, please give us a link to a change that should have been notified and didn't - and then somebody can check log files...	11:46
cgoncalves	AJaeger, bot did not notify on #openstack-lbaas of https://review.opendev.org/#/c/745831	11:48
cgoncalves	other changes: https://review.opendev.org/#/c/745820/ & https://review.opendev.org/#/c/685337/	11:49
yoctozepto	I guess cgoncalves's changes are enough, in case of k&k-a it's a ton of these ;d	11:50
AJaeger	thanks. Let's ask infra-root to investigate those ^	11:59
AJaeger	yoctozepto: yes, cgoncalves' are enough	12:00
AJaeger	At least I hope so ;)	12:00
*** hashar has joined #opendev		12:11
mnaser	infra-root: is http://mirror.ca-ymq-1.vexxhost.opendev.org having issues? it's taking a long time to respond, but i don't have visiblity into the VM	12:38
mnaser	things load but take a _very_ long time, enough to cause jobs to timeout	12:39
mnaser	nothing in console log	12:41
mnaser	load average on the hypervisor it's on is 1.98 so the system is fine	12:42
openstackgerrit	Carlos Goncalves proposed openstack/project-config master: Update branch checkout for octavia-lib DIB element https://review.opendev.org/745877	12:52
frickler	mnaser: I can log in and don't see anything obviously bad. do you have logs? is it for the AFS mirror or some of the proxies?	12:53
mnaser	frickler: seeing these "urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='mirror.ca-ymq-1.vexxhost.opendev.org', port=443): Read timed out. (read timeout=60.0)"	12:54
mnaser	but also when i was opening it here on my side it took a long time for pages to load	12:54
mnaser	it seems responsive again now though	12:54
frickler	mnaser: hmm, cacti graphs are empty starting around 4:00, maybe some other infra-root can take a deeper look soon	12:57
mnaser	frickler: ok good, i'm not going nuts :)	12:57
*** Marcelo- has joined #opendev		13:02
frickler	infra-root: gerritbot indeed seems to assume it doesn't have to do any notification for most events, logging "INFO gerritbot: Potential channels to receive event notification: set()", so likely some kind of config issue with the new deployment	13:02
mnaser	frickler, infra-root: to note though, i just saw it join #openstack-tc and post a notification about a merged change so..	13:11
fungi	cacti can ping both v4 and v6 addresses for mirror.ca-ymq-1.vexxhost.opendev.org	13:27
fungi	i thought we were only using sjc1 though?	13:28
fungi	snmpd is still running too	13:29
*** sshnaidm\|afk is now known as sshnaidm		13:38
fungi	it's configured to only log warnings and above, but journalctl doesn't have any for it other than restarts (most recent was a few weeks ago)	13:45
fungi	using tcpdump now to see if snmp queries are getting there at all	13:46
fungi	okay, so i think the tcpdump nails down the issue... cacti is sending snmp queries to the mirror, it's receiving them and responding, but then cacti is never receiving the responses	13:57
fungi	it's over ipv6, so looks suspiciously like the unidirectional v6 packet loss we've been seeing with systems in rackspace, but which in particular seems to especially impact the cacti server for some reason	13:58
openstackgerrit	Merged zuul/zuul-jobs master: Fix url for ARA report https://review.opendev.org/745792	14:04
*** ysandeep is now known as ysandeep\|dinner		14:10
frickler	mnaser: yes, it isn't 100% broken, it just seems to serve only a very exquisite subset of events	14:12
fungi	well, keep in mind that the majority of events it logs shouldn't generate notifications to channels	14:28
fungi	it gets the full gerrit event stream, analyzes every event and logs its decisions in the debug log, then only sends notifications for the tiny subset which its configuration says should get them	14:29
fungi	i'll see if i can tell why 745831 wasn't announced to #openstack-lbaas for a start	14:29
frickler	fungi: ah, you're right of course, seems the docker log only holds about 1h worth of data, we might want to log to somewhere more persistent	14:37
fungi	yep, once i can figure out how to get docker-compose to show me more logs...	14:37
fungi	hrm, yeah i'm starting to suspect that docker-compose is just throwing away logs and not saving them anywhere	14:39
fungi	aha! it also writes to syslog	14:40
fungi	looks like event 64c3a1b1decf is the one we want	14:42
fungi	no, nevermind, that's not an event, that's a process	14:42
frickler	but it seems to claim to have logged a message for that. not sure why every line seems to be logged twice in syslog, though	14:43
fungi	Aug 12 11:03:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 11:03:29,173 INFO gerritbot: Potential channels to receive event notification: {'openstack-lbaas'}	14:44
fungi	Aug 12 11:03:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 11:03:29,173 INFO gerritbot: Compiled Message openstack-lbaas: Carlos Goncalves proposed openstack/octavia master: Set Grub timeout to 0 for fast boot times https://review.opendev.org/745831	14:44
fungi	Aug 12 11:03:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 11:03:29,174 INFO gerritbot: Sending "Carlos Goncalves proposed openstack/octavia master: Set Grub timeout to 0 for fast boot times https://review.opendev.org/745831" to openstack-lbaas	14:44
fungi	so it thinks it sent it to the server	14:44
fungi	but yeah, no sign of it in http://eavesdrop.openstack.org/irclogs/%23openstack-lbaas/%23openstack-lbaas.2020-08-12.log.html	14:45
johnsom	The bot isn't in the #openstack-lbaas channel according to my client.	14:45
johnsom	I think it used to lurk in the channel if I remember right.	14:46
fungi	johnsom: yeah, it can't join all channels (it's configured for more than freenode allows) so it opportunistically joins channels if it has a message for them and then parts if it needs to free up available channels to be able to post messages in others	14:46
cgoncalves	johnsom, AJaeger wrote earlier this: " It only joins channels when there's something to notify about. There's a maximal channel limit a user/bot can be in, so it leaves/re-joins as needed."	14:46
johnsom	Ah, ok	14:47
fungi	so at start it's present in no channels, and joins them on demand up to the (120?) channel limit, then starts leaving the least recently needed channels as it has to join others	14:47
frickler	I think there is some issue with the bot wanting to log to "openstack-lbaas" instead of "#openstack-lbaas"	14:47
frickler	for channels where it works, there is a "#" in the channel name	14:48
*** mlavalle has joined #opendev		14:49
fungi	i concur	14:49
fungi	Aug 12 14:48:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 14:48:29,154 INFO gerritbot: Potential channels to receive event notification: {'#openstack-release'}	14:49
fungi	et cetera	14:49
fungi	so looks like maybe a configuration error	14:49
fungi	though the configuration doesn't use # in front of any channel names	14:50
frickler	channel_config.yaml doesn't have a # for any channel	14:50
fungi	yeah, filtering the logs i see it also incorrectly trying to send to a bunch of non-# channel names	14:53
fungi	i wonder if this is a recent regression in gerritbot	14:53
fungi	looks like the version running on review.o.o was e387941 from december	14:56
fungi	luckily there have been only 6 commits since then	14:57
*** priteau has joined #opendev		14:59
frickler	we did run with py2 on review, didn't we? so might be a py3 issue, the commit since dec don't look suspicious to me	14:59
fungi	yeah, i've now gone over all the recent commits since what we had installed on review.o.o and i agree, none of those were significant in ways which should impact this	15:01
fungi	so gonna need to roll up sleeves and dive deeper in the code	15:01
*** ysandeep\|dinner is now known as ysandeep		15:03
frickler	fungi: https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L403-L412 looks to mix up using data and self.data, I'd go for cleaning that up first	15:04
fungi	https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L406-L407 is where it seems to prepend the #	15:05
fungi	oh, you're already there-ish	15:05
fungi	i feel like we're both covering the same ground	15:05
frickler	fungi: my python foo isn't strong, but that code makes me wonder whether changing data after setting "self.data=data" may behave different with py3	15:06
fungi	yeah, modifying an iterable in place	15:07
fungi	while iterating on it	15:08
fungi	actually not a great idea. better to iterate on a copy while using it as a reference to modify the original	15:08
fungi	i bet data.keys() returned a copy in python 2 but returns an iterable tied to the data object in python 3	15:09
fungi	we might want to do keys = list(data.keys()) there?	15:09
AJaeger	"modifying an iterable in place" is not allowed anymore with python 3.7 and a hard failure.	15:10
fungi	`python --version` in the container says "Python 3.7.8"	15:12
fungi	so i guess that's not it	15:12
* frickler needs to leave, will check back later		15:15
*** ysandeep is now known as ysandeep\|away		15:19
fungi	i'll keep fiddling with it	15:19
smcginnis	mnaser: I've seen tow patches with retry failures to a Vexxhost mirror. Not sure if it's just flakiness in the network or something else, but thought I should mention it.	15:20
mnaser	smcginnis: yeah, i notiecd that this morning. i dont see anythign in the system itself	15:20
mnaser	is it happening recently? apparently the most recent i've seen it was around 9am-ish	15:21
mnaser	smcginnis: it was _really_ slow to respond, don't have access to the machine. when did those patches fail?	15:21
smcginnis	mnaser: Just hit it now with https://zuul.opendev.org/t/openstack/build/3ea6602420aa4c3abe80d497f57b5777	15:21
mnaser	yes. i can see that when i click http://mirror.ca-ymq-1.vexxhost.opendev.org/ it takes a while to open folders	15:22
smcginnis	The last one I looked at before this (last night I think) it had some retries that eventually succeeded for an earlier package, ran fine for a few more, then timed out with retries on a later one.	15:22
mnaser	everything is ok from our side :\	15:24
mnaser	cc infra-root ^	15:24
smcginnis	We can blame network gremlins for now.	15:24
mnaser	smcginnis: it's repeated a few times today	15:24
mnaser	plus it's stopped reporting into cacti too, so there's that	15:25
mnaser	we really should get to the bottom of it otherwise we're just wasting compute power	15:25
smcginnis	Yeah	15:26
smcginnis	mnaser: When you said "on our side" above, were you referring to vexxhost or opendev as "our"? :)	15:27
mnaser	smcginnis: sorry, vexxhost is not seeing any issues :)	15:27
openstackgerrit	Thierry Carrez proposed opendev/system-config master: Redirect UC content to TC site https://review.opendev.org/744497	15:27
fungi	mnaser: you probably missed my investigation of the cacti situation above, but i don't see that it's likely to be related	15:28
mnaser	it hurts us even more becuase that means any job running on our cloud will fail and just burn through systems	15:28
mnaser	oh, i guess this might be the ipv6 thing happening eh :\	15:29
mnaser	the thing is, i am having problems accessing http://mirror.ca-ymq-1.vexxhost.opendev.org even over ipv4 (seeing it load slowly when i browse around)	15:29
fungi	tcpdump shows snmp requests reaching the mirror, snmpd on the mirror replies, but those responses never reach cacti. that's over ipv6 and we've got similar situations with other systems not able to get v6 packets back to cacti (even from within rackspace's own network)	15:29
fungi	afs, in contrast, is all over ipv4	15:29
fungi	it doesn't even support ipv6	15:29
mnaser	do apache logs show any slow requests?	15:30
smcginnis	OK, just saw three more recent job failures matching this. Looks like it definitely is a bigger issue.	15:31
mnaser	yeah, i am seeing failures here too	15:31
fungi	apache doesn't log the time a request takes to satisfy, that i can find	15:36
mnaser	ah, that's a bummer	15:37
mnaser	nothing in eror logs?	15:37
fungi	it's constantly spewing file negotiation failures, that's generally just because it doesn't know the file type though i think	15:37
fungi	like:	15:38
fungi	AH00687: Negotiation: discovered file(s) matching request: /var/www/mirror/wheel/ubuntu-18.04-x86_64/a/appdirs/index.html (None could be negotiated).,	15:38
fungi	also seeing some of:	15:38
fungi	AH01401: Zlib: Validation bytes not present,	15:38
fungi	i'll shift gears to look into this deeper, and finish worrying about fixing gerritbot later	15:39
fungi	does someone have a link to a job failure?	15:39
mnaser	fungi: https://zuul.opendev.org/t/vexxhost/build/26648c0867ab4f7eb4aa5567f60007e1 here is one	15:40
fungi	thanks	15:40
fungi	okay, so for starters, /pypi/ isn't anything we mirror, it's not served out of afs, this is a proxy to the nearest fastly cdn endpoint for pypi.org	15:43
fungi	so i'll check to see whether that mirror host is having trouble reaching or getting responses from pypi	15:43
mnaser	https://zuul.opendev.org/t/vexxhost/build/61c6cf30b09c4deb821e05108dc1b0d3 -- another breakage too, this one might not be related to teh cache though fungi	15:43
mnaser	i think /pypifiles/ is hosted locally	15:43
fungi	nope, also a proxy	15:44
mnaser	ah	15:44
fungi	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror/templates/mirror.vhost.j2#L94	15:44
fungi	pypi is split between an index site and a file hosting site, so we have to proxy both	15:45
fungi	but both sites use the same cdn network (fastly) so are likely winding up hitting the same endpoints for it	15:45
fungi	Connecting to pypi.org (pypi.org)\|2a04:4e42:200::223\|:443...	15:45
fungi	when i do `wget -O/dev/null 'https://pypi.org/'`	15:46
fungi	it's just sitting there	15:46
fungi	mnaser: can you reach pypi at all over ipv6 from vexxhost ca-ymq-1	15:47
fungi	?	15:47
mnaser	yep, its hanging indeed	15:48
fungi	so that seems to be the crux of the problem	15:48
mnaser	at this point it wouldn't even surprise me that someone got mad at someone and unpeeered and routes are gone	15:48
fungi	yup	15:49
fungi	i don't miss my days doing isp backbone peering	15:49
fungi	all the finger-pointing between carriers was atrocious	15:49
fungi	we can dial max-servers down to 0 there in the short term if you'd like	15:50
fungi	basically anything trying to pip install in a job is going to fail there for now, i think	15:50
*** shtepanie has joined #opendev		15:51
fungi	i'll push something up	15:53
mnaser	give me a few minutes to try and see whats going on before we stop the whole thing, if that's ok	15:53
fungi	sure, works for me	15:53
mnaser	i mean, maybe we could workaround it by adding pypi.org ipv4 address to /etc/hosts -- just for now i guess	15:54
fungi	that might work. i think apache will check that. but since pypi and pythonhosted are using a cdn, hard-coding its ip addresses could be risky	15:54
mnaser	i agree, but i am talking about doing that for an hour or two at most while i debug, just to get things flowing again	15:55
mnaser	i agree, but i am talking about doing that for an hour or two at most while i debug, just to get things flowing again	15:55
mnaser	oops, wrong arrow up enter window	15:55
fungi	yeah, i also can't ping ipv6 addresses for some stuff in rackspace from the mirror there, so i guess it's a fairly broad set of routes affected	15:56
mnaser	ping6 google.ca seems to work though	15:57
mnaser	so it's a subset of things...	15:57
*** lpetrut has quit IRC		15:58
fungi	yep, i mean, i'm ssh'd in over ipv6 so it's obviously working for some routes	15:58
fungi	but it's clearly more than just the fastly cdn endpoint impacted	15:59
mnaser	fungi: so you said that when you were doing your tcpdumps, incoming traffic arrived, responded to but never reached rax (wrt snmp?)	15:59
fungi	correct	15:59
fungi	so this could be related, and it's an asymmetric routr	15:59
fungi	route	16:00
fungi	installing traceroute to see if i can spot a difference	16:00
mnaser	fungi: if you don't mind, could you run a traceroute from rax to the mirror?	16:00
mnaser	thank you :)	16:00
fungi	heh	16:00
mnaser	that way i can see at least which path it's unhappy about	16:00
fungi	yeah, been in your shoes more times than i care to remember	16:01
fungi	mnaser: oddly, i never get a response even from the gateway	16:03
fungi	no hops responding	16:03
fungi	oh wow.	16:04
fungi	no default route?	16:04
fungi	oh, nevermind, there are two	16:04
fungi	looks like i'm seeing default routes announced through fe80::ce2d:e0ff:fe0f:74af and fe80::ce2d:e0ff:fe5a:d84e but neither are responding with ttl expired when trying to traceroute	16:05
fungi	tried both udp (default) and icmp traceroute	16:06
fungi	i can traceroute to my home ipv6 address just fine, the one i'm ssh'd in from	16:07
fungi	but not to cacti's	16:07
fungi	nor to pypi.org	16:08
fungi	my home address is in 2600:6C00::/24 and traceroute shows responses starting from 2604:e100:1:0:ce2d:e0ff:fe5a:d84e in vexxhost	16:10
fungi	but even that hop doesn't show up when tracerouting to cacti or pypi	16:10
fungi	do you plumb ebgp all the way down into your igp there? or is there something shadowing those prefixes in your igp?	16:11
mnaser	fungi: fe80::ce2d:e0ff:fe0f:74af is link-local interfaces which are announced to all vms	16:13
mnaser	they shouldnt be reachable from the vms there	16:14
mnaser	i an see that its pingable from another machine here	16:14
fungi	pypi is?	16:14
fungi	but yeah, for v6 destinations i can reach, 2604:e100:1:0:ce2d:e0ff:fe5a:d84e responds as the first hop. for v6 destinations i can't reach, there is no first hop (or any hops) responding	16:15
mnaser	which means thats the same host as fe80::ce2d:e0ff:fe5a:d84e	16:16
fungi	making me suspect that the first router is black-holing those prefixes somehow	16:16
mnaser	which means that potentially fe80::ce2d:e0ff:fe0f:74af is the issue	16:17
fungi	if it simply didn't have a route for them i'd expect an icmp no route to host or network unreachable	16:17
mnaser	plot twist, 74af is the one that holds the closer route to pypi	16:18
fungi	fwiw, i can ping 2604:e100:1:0:ce2d:e0ff:fe0f:74af just fine	16:18
mnaser	oh i have an idea	16:19
fungi	so if it were the problem i'd expect to be getting messages back from it in a traceroute	16:19
mnaser	can you ping 2001:550:2:6::26:2	16:19
mnaser	and 2605:9000:400:107::c	16:19
fungi	i get responses from 2605:9000:400:107::c but not 2001:550:2:6::26:2	16:19
mnaser	progress	16:21
mnaser	fungi: what about 2001:550:2:6::26:1 ?	16:22
fungi	no response	16:22
mnaser	can i get a trace to :1 ?	16:22
fungi	traceroute to 2001:550:2:6::26:1 (2001:550:2:6::26:1), 30 hops max, 80 byte packets	16:23
fungi	1 * * *	16:23
fungi	nothing back from any hops	16:23
mnaser	the heck	16:23
mnaser	is there no route towards it?	16:23
mnaser	fungi: i assume you are pinging/tracing from your local system right?	16:23
fungi	or whatever is handling the next hop is eating it silently	16:24
fungi	these are pings/traceroutes from the mirror instance in ca-ymq-1	16:24
mnaser	ah, i thought those were pings externally	16:24
mnaser	are you able to reach those from rax or local at your side?	16:25
fungi	from home i can reach all three of 2001:550:2:6::26:2 2605:9000:400:107::c 2001:550:2:6::26:1	16:26
fungi	same from bridge.openstack.org in rackspace dfw	16:26
mnaser	whats interesting is	16:27
mnaser	it stopped working exactly at 4am utc	16:27
mnaser	which is 12am est	16:27
fungi	i can also ping the mirror from bridge.o.o but can't ping bridge.o.o from the mirror	16:27
mnaser	hmm, provider did an upgrade overnight	16:28
fungi	similarly i can ping mirror.ca-ymq-1.vexxhost.opendev.org from cacti.openstack.org but not the reverse	16:29
fungi	it's like replies to inbound flows for icmp echo request are set up and the returning echo replies get routed correctly (but the same is apparently not true of snmp/udp?)	16:30
fungi	it's just odd to see this sort of stateful behavior at the carrier level. they must be doing some sort of flow-based balancing across their gear or something and not simple hash	16:33
fungi	or maybe i'm just getting lucky and outbound icmp echo replies are getting hashed through a working router but not outbound echo requests	16:35
mnaser	fungi: yeah.. somethign is weird there. i turned off the ipv6 peer to restart it	16:35
mnaser	and now it just stuck oepnsent	16:35
fungi	ew	16:36
mnaser	fungi: ok, it's escalated with the network provider right now	16:42
mnaser	bgp session is back up but this is still a problem	16:42
fungi	want me to go forward with a max-servers=0 patch for now?	16:43
mnaser	yeah lets do that and we can approve it together quickly to unblock world	16:43
openstackgerrit	Jeremy Stanley proposed openstack/project-config master: Temporarily disable vexxhost ca-ymq-1 https://review.opendev.org/745929	16:45
fungi	mnaser: ^ feel free to single-core approve	16:45
fungi	i can enqueue to the gate directly	16:45
mnaser	fungi: done and think that's a good idea	16:46
mnaser	as we may fail a few times along our way there	16:46
fungi	it's in the gate now	16:46
openstackgerrit	Jeremy Stanley proposed opendev/gerritbot master: Iterate over a copy of the channel keys https://review.opendev.org/745930	16:50
fungi	frickler: AJaeger: clarkb: ^ i hope that's the fix for gerritbot	16:50
openstackgerrit	Merged openstack/project-config master: Temporarily disable vexxhost ca-ymq-1 https://review.opendev.org/745929	16:59
*** Baragaki has joined #opendev		17:04
*** priteau has quit IRC		17:27
*** Marcelo- has quit IRC		17:59
fungi	infra-root: if anyone happens to have a moment to spare, i'm hoping https://review.opendev.org/745930 will solve our latest gerritbot regression	18:11
*** knikolla_ has joined #opendev		18:13
clarkb	fungi: one jetty based +2	18:17
corvus	+3	18:18
fungi	much thanks all!	18:19
fungi	i'll give the logs a close watch after this merges	18:20
*** mordred has quit IRC		18:20
*** gouthamr has quit IRC		18:20
*** knikolla has quit IRC		18:20
*** knikolla_ is now known as knikolla		18:20
fungi	clarkb: enjoy the jetty! hope you catch something without catching something	18:20
corvus	heh, i thought clarkb was digging into java web servers, but this is better. :)	18:21
*** Eighth_Doctor has quit IRC		18:21
*** gouthamr has joined #opendev		18:22
*** mordred has joined #opendev		18:28
openstackgerrit	Merged opendev/gerritbot master: Iterate over a copy of the channel keys https://review.opendev.org/745930	18:56
*** Eighth_Doctor has joined #opendev		19:00
*** shtepanie has quit IRC		19:21
mnaser	infra-root: would anyone be kind enough to run a traceroute6 from bridge.openstack.org to mirror.ca-ymq-1.vexxhost.opendev.org ?	19:23
corvus	on it	19:25
corvus	mnaser: http://paste.openstack.org/show/796794/	19:27
fungi	mnaser: traceroute back is unfortunately still blank, no response even from the first hop	19:29
mnaser	corvus, fungi: thank you. yes, the no response from first hop is very confusing	19:30
corvus	mnaser, fungi: fwiw, mtr from bridge: http://paste.openstack.org/show/796795/	19:32
mnaser	corvus: while you're in there, can you confirm if '2604:e100:1:0:ce2d:e0ff:fe0f:74af' and '2604:e100:1:0:ce2d:e0ff:fe5a:d84e' are indeed currently sending icmp requests to that system?	19:43
mnaser	sorry, i just don't have access to a system that is on an 'unreachable' network :(	19:43
corvus	working	19:43
corvus	mnaser: yes receiving and replying	19:48
corvus	seems fairly steady at 1hz each	19:48
mnaser	hrm, ok. alright, so both outbound routes are working just fine	19:48
corvus	and stopped :)	19:48
mnaser	yep, as expected	19:49
*** hashar has quit IRC		20:43
mnaser	sigh	21:07
mnaser	infra-root: can someone run this on the mirror node -- `ip -6 addr list \| grep 2001:db8 \| awk '{ print $2 }' \| xargs -I {} -n1 ip addr del {} dev eth0`	21:08
mnaser	somehow 2001:db8:{0,1}::/64 addresses got dynamically configured, i'm still digging into this, but they'll need to be removed	21:08
openstackgerrit	Mohammed Naser proposed openstack/project-config master: Revert "Temporarily disable vexxhost ca-ymq-1" https://review.opendev.org/745966	21:10
fungi	mnaser: probably you want ens3 instead of eth0, and done	21:10
mnaser	infra-root: ^ appreciate a vote on that, once that's done and verified, we can land that, meanwhile i'll be investigating how the ra showed up	21:10
mnaser	fungi: ah cool, wanna run/merge that?	21:11
fungi	can do, also after clearing those routes i can reach stuff from the mirror again	21:12
fungi	also https://mirror.ca-ymq-1.vexxhost.opendev.org/pypi/ is working	21:12
fungi	mnaser: and approved, thank for looking into its!	21:13
fungi	er, it	21:13
mnaser	fungi: what i noticed was when tcpdumping, it was picking the 2001:db8:: as src address	21:13
mnaser	when trying to reach pypi.org	21:13
mnaser	but going to google, it wasn't	21:13
fungi	got it, so there was a rogue prefix announced on that lan?	21:16
fungi	fun stuff	21:16
fungi	i think we thought we saw that once in limestone too, but couldn't repro it	21:17
fungi	that also explains the symptoms we saw, as far as being able to reach the machine but it not being able to reach stuff	21:18
fungi	it was responding from the address to which things were connecting, but initiating from a different address which wasn't routable	21:18
fungi	2001:db8:0:3::/64 dev ens3 proto ra metric 100 expires 2550846sec pref medium	21:21
fungi	2001:db8:1::/64 dev ens3 proto ra metric 100 expires 2550801sec pref medium	21:21
fungi	we had both of those in the local routing table	21:21
fungi	2001:db8::/32 IPV6-DOC-AP "IPv6 prefix for documentation purpose" (This address range is to be used for documentation purpose only. For more information please see http://www.apnic.net/info/faq/ipv6-documentation-prefix-faq.html )	21:22
logan-	https://bugs.launchpad.net/neutron/+bug/1844712	21:23
openstack	Launchpad bug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete]	21:23
logan-	that was a strange one. what youre seeing looks like a recurrence of that bug mnaser. the block is different, but that's only because the ipv6 blocks those jobs use were updated over the course of that bug: https://bugs.launchpad.net/neutron/+bug/1844712/comments/8 ...to the cidr you saw today :)	21:28
openstack	Launchpad bug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete]	21:28
clarkb	fungi: corvus got two keeper rock fish. we got a ling cod but it was below minimum size so went back in	21:30
fungi	codesearch mostly turns up hits in our docs (unsurprisingly) but the prefix also gets heavy use in test for neutron, horizon, octavia, nova, tripleo, searchlight, ironic, manila, swift, zun, charms, designate, tempest, devstack, kuryr, osc, vitrage, watcher, cinder, monasca, sdk, puppet, several oslo libs... http://codesearch.openstack.org/?q=2001%3Adb8	21:31
fungi	hard to tell just from that what might be spewed via route announcements from misconfigured job nodes	21:32
mnaser	logan-: urgh. but you run lxb right?	21:32
logan-	yup, lxb cloud in that bug	21:32
fungi	but since the mirror is in a different tenant... it's somewhat unexpected behavior	21:33
mnaser	i guess this is more of a firewall driver issue. i think we have some systems that use iptables_hybrid and some with ovs driver	21:33
mnaser	logan-: you dont have a repro i assume?	21:35
logan-	possible way to approach it: find a timestamp for the RA / IP getting added on mirror, and then correlate that with jobs that were running at the time to try and identify the suspect VM(s), then look thru nova/neutron logs to try to find what went wrong	21:36
logan-	nope, never was able to repro	21:36
mnaser	i cant imagine this would be easy to reproduce	21:36
logan-	and you'd think with all the vm launches going on we'd see it more often. it is crazy when it pops up	21:36
fungi	gotta be a rare race with ports and filters or something	21:37
mnaser	logan-: how many times have you hit this? i'm asking because we only recently upgraded to stein for this cloud	21:37
logan-	once	21:37
logan-	on rocky	21:37
mnaser	this cloud was on queens for little while and we never hit it, but hitting it once also isn't a correct index	21:37
mnaser	logan-: what is interesting is this failed almost near exactly at 4am utc / 12am est	21:37
mnaser	https://usercontent.irccloud-cdn.com/file/JGp4BKOx/image.png	21:38
logan-	hmm, iirc ours was mid-morning EST so i didn't think much of the timing	21:38
mnaser	19th of september	21:39
mnaser	let me see if cacti goes that fa back	21:39
mnaser	unfortunately not	21:40
fungi	likely the server got rebuilt	21:49
fungi	replaced, whatever	21:49
fungi	we've been replacing a lot of our mirrors over the past year for ubuntu upgrades and newer domain name	21:49
fungi	and ansibilification	21:49
openstackgerrit	Merged openstack/project-config master: Revert "Temporarily disable vexxhost ca-ymq-1" https://review.opendev.org/745966	21:50
logan-	in the irc log from that bug, https://i.imgur.com/XsJTj6Y.jpg was linked. that's in central time.. so when it happened on our cloud it was around 9:15 AM Central. then it wasn't discovered until around 10:30-11, and the test node(s) that we guessed might have caused the issue were long gone by then.	21:55
fungi	clarkb: if you're around, what triggers updating the docker image for gerritbot on eavesdrop? i see that we published the hopefully-fixed build to dockerhub but the hourly system-config deploy doesn't seem to be doing it. the daily deploy?	22:38
clarkb	I think that is currnetly missing	22:39
clarkb	we can have gerritbot changes themselves trigger them or do tthem hourly like zuul and nodepool	22:39
clarkb	also need to tie in project-config to trigger infra-prod-service-eavesdrop when channel config updates	22:40
*** tkajinam has joined #opendev		22:57
fungi	clarkb: so for now should i just docker-compose down/up -d?	23:00
clarkb	you need to do a pull first I think	23:12
clarkb	but ya that should do it	23:12
fungi	ahh, right-o	23:12
fungi	pulled	23:12
*** openstackgerrit has quit IRC		23:13
fungi	downed	23:13
fungi	upped	23:13
fungi	watching syslog	23:13
*** tosky has quit IRC		23:13
fungi	#status log manually pulled, downed and upped gerritbot container on eavesdrop for recent config parsing fix	23:15
openstackstatus	fungi: finished logging	23:15
*** ryohayakawa has joined #opendev		23:49
*** ryohayakawa has quit IRC		23:56
*** ryohayakawa has joined #opendev		23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!