fungi | i can't *quite* remember what openssl error 1416F086 is. i'm sure it'll come to me in a moment | 00:02 |
---|---|---|
fungi | self-signed cert, i guess? | 00:03 |
fungi | and git-send-email isn't going to let the server present just any ol' cert, huh? | 00:04 |
*** ryohayakawa has joined #opendev | 00:05 | |
fungi | okay, jaraco/irc pr is all green now | 00:06 |
fungi | i substituted the python interpreter version in ctcp version replies, we'll see how that flies | 00:07 |
ianw | fungi: this is just set up to send via gmail (the corporate mail server) ... if anything should work i'd think it would be this :/ | 00:07 |
fungi | yikes. maybe something's wrong with the ca bundle on the client then? | 00:08 |
fungi | or maybe someone let a test cert leak into production at gmail | 00:08 |
ianw | ... no ... i had an old config file lying around from the last time i tried to use git send-email pointing to the old, internal corporate server, which must have been overriding my settings | 00:11 |
fungi | aha | 00:11 |
ianw | i wish i could just let this ipv6 thing go but it's my white whale :) | 00:13 |
*** shtepanie has joined #opendev | 00:42 | |
ianw | donnyd/fungi: this is something like what i'm proposing as a libvirt doc update for the nat address choice -> http://paste.openstack.org/show/796749/ ... seem about right? | 00:49 |
donnyd | I would maybe just use your real world example for the libvirt doc | 00:52 |
donnyd | other than that it LGTM | 00:53 |
ianw | donnyd: yeah, i guess the thing is https://tools.ietf.org/html/rfc4193#section-3.2.2 goes into great detail about how to generate a random /48 | 00:53 |
ianw | i figure if you put something in the doc, it just gets copied :) | 00:53 |
donnyd | That is correct | 00:54 |
donnyd | and its exactly what people do | 00:54 |
donnyd | I would say its less likely that people will go read the gylphs from ietf | 00:55 |
donnyd | Reading an RFC isn't really at the top of the list for most people on "things I will do with my evening" | 00:56 |
ianw | heh :) you can either have ipv6, or not read RFC's ... choose one :) | 00:56 |
ianw | in think in general, the world has chosen the latter | 00:57 |
clarkb | my ISP says that we may have ipv6 by the end of the year | 00:58 |
clarkb | will be the best outcome of them being bought out if it happens | 00:58 |
ianw | i don't know if it's a fedora bug that fd00: interfaces are not preferenced over ipv4 | 00:59 |
ianw | from what i can tell, it's a practical decision that people had fc00::/7 addresses that didn't route anywhere, and it would case all sorts of issues | 01:00 |
ianw | clarkb: odds that your ipv6 also comes with cgnat? :) | 01:18 |
clarkb | I doubt it will | 01:20 |
*** qchris has quit IRC | 01:45 | |
donnyd | Overall I think your post to the docs is a large value add and over time as ipv6 becomes less of the mystery to people, the more value things like simple usable docs will have. And I think that is what you wrote up. | 01:54 |
*** qchris has joined #opendev | 01:57 | |
*** shtepanie has quit IRC | 03:52 | |
*** dmsimard2 has joined #opendev | 04:12 | |
*** dmsimard has quit IRC | 04:13 | |
*** dmsimard2 is now known as dmsimard | 04:13 | |
*** ysandeep|away is now known as ysandeep | 04:14 | |
*** logan- has joined #opendev | 04:40 | |
*** weshay|pto has quit IRC | 06:12 | |
*** weshay_ has joined #opendev | 06:13 | |
*** DSpider has joined #opendev | 07:00 | |
*** openstackgerrit has joined #opendev | 07:00 | |
openstackgerrit | yatin proposed zuul/zuul-jobs master: Fix url for ARA report https://review.opendev.org/745792 | 07:00 |
*** ryohayakawa has quit IRC | 07:03 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: validate-host: skip linux nly tasks on windows machines https://review.opendev.org/745797 | 07:10 |
*** ssbarnea has joined #opendev | 07:18 | |
*** zbr has quit IRC | 07:29 | |
*** ssbarnea has quit IRC | 07:29 | |
*** zbr9 has joined #opendev | 07:29 | |
*** hashar has joined #opendev | 07:30 | |
*** tosky has joined #opendev | 07:41 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
*** openstackgerrit has quit IRC | 08:09 | |
*** zbr9 has quit IRC | 08:12 | |
*** zbr has joined #opendev | 08:13 | |
*** tkajinam has quit IRC | 08:19 | |
*** Eighth_Doctor has quit IRC | 09:05 | |
*** mordred has quit IRC | 09:06 | |
*** Eighth_Doctor has joined #opendev | 09:14 | |
*** mordred has joined #opendev | 09:45 | |
*** openstackgerrit has joined #opendev | 10:04 | |
openstackgerrit | Riccardo Pittau proposed openstack/diskimage-builder master: Update name of ipa job https://review.opendev.org/743042 | 10:04 |
openstackgerrit | Riccardo Pittau proposed openstack/diskimage-builder master: Do not install python2 packages in ubuntu focal https://review.opendev.org/745665 | 10:16 |
openstackgerrit | Carlos Goncalves proposed openstack/diskimage-builder master: Add octavia-amphora-image-build-live jobs https://review.opendev.org/745823 | 10:20 |
cgoncalves | hey there! openstackgerrit is back online but has not joined #openstack-lbaas and #openstack-infra at least | 10:22 |
*** DSpider has quit IRC | 10:32 | |
*** DSpider has joined #opendev | 10:33 | |
*** hashar has quit IRC | 10:44 | |
*** lpetrut has joined #opendev | 10:45 | |
*** calcmandan has quit IRC | 10:48 | |
*** calcmandan has joined #opendev | 10:49 | |
AJaeger | cgoncalves: are you missing notifications? It only joins channels when there's something to notify about. There's a maximal channel limit a user/bot can be in, so it leaves/re-joins as needed. | 10:54 |
cgoncalves | AJaeger, definitely missing in #openstack-lbaas | 10:55 |
*** sshnaidm is now known as sshnaidm|afk | 10:57 | |
yoctozepto | it has not joined kolla either | 11:30 |
AJaeger | cgoncalves, yoctozepto: see above in this channel when it joined here - so, please give us a link to a change that should have been notified and didn't - and then somebody can check log files... | 11:46 |
cgoncalves | AJaeger, bot did not notify on #openstack-lbaas of https://review.opendev.org/#/c/745831 | 11:48 |
cgoncalves | other changes: https://review.opendev.org/#/c/745820/ & https://review.opendev.org/#/c/685337/ | 11:49 |
yoctozepto | I guess cgoncalves's changes are enough, in case of k&k-a it's a ton of these ;d | 11:50 |
AJaeger | thanks. Let's ask infra-root to investigate those ^ | 11:59 |
AJaeger | yoctozepto: yes, cgoncalves' are enough | 12:00 |
AJaeger | At least I hope so ;) | 12:00 |
*** hashar has joined #opendev | 12:11 | |
mnaser | infra-root: is http://mirror.ca-ymq-1.vexxhost.opendev.org having issues? it's taking a long time to respond, but i don't have visiblity into the VM | 12:38 |
mnaser | things load but take a _very_ long time, enough to cause jobs to timeout | 12:39 |
mnaser | nothing in console log | 12:41 |
mnaser | load average on the hypervisor it's on is 1.98 so the system is fine | 12:42 |
openstackgerrit | Carlos Goncalves proposed openstack/project-config master: Update branch checkout for octavia-lib DIB element https://review.opendev.org/745877 | 12:52 |
frickler | mnaser: I can log in and don't see anything obviously bad. do you have logs? is it for the AFS mirror or some of the proxies? | 12:53 |
mnaser | frickler: seeing these "urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='mirror.ca-ymq-1.vexxhost.opendev.org', port=443): Read timed out. (read timeout=60.0)" | 12:54 |
mnaser | but also when i was opening it here on my side it took a long time for pages to load | 12:54 |
mnaser | it seems responsive again now though | 12:54 |
frickler | mnaser: hmm, cacti graphs are empty starting around 4:00, maybe some other infra-root can take a deeper look soon | 12:57 |
mnaser | frickler: ok good, i'm not going nuts :) | 12:57 |
*** Marcelo- has joined #opendev | 13:02 | |
frickler | infra-root: gerritbot indeed seems to assume it doesn't have to do any notification for most events, logging "INFO gerritbot: Potential channels to receive event notification: set()", so likely some kind of config issue with the new deployment | 13:02 |
mnaser | frickler, infra-root: to note though, i just saw it join #openstack-tc and post a notification about a merged change so.. | 13:11 |
fungi | cacti can ping both v4 and v6 addresses for mirror.ca-ymq-1.vexxhost.opendev.org | 13:27 |
fungi | i thought we were only using sjc1 though? | 13:28 |
fungi | snmpd is still running too | 13:29 |
*** sshnaidm|afk is now known as sshnaidm | 13:38 | |
fungi | it's configured to only log warnings and above, but journalctl doesn't have any for it other than restarts (most recent was a few weeks ago) | 13:45 |
fungi | using tcpdump now to see if snmp queries are getting there at all | 13:46 |
fungi | okay, so i think the tcpdump nails down the issue... cacti is sending snmp queries to the mirror, it's receiving them and responding, but then cacti is never receiving the responses | 13:57 |
fungi | it's over ipv6, so looks suspiciously like the unidirectional v6 packet loss we've been seeing with systems in rackspace, but which in particular seems to especially impact the cacti server for some reason | 13:58 |
openstackgerrit | Merged zuul/zuul-jobs master: Fix url for ARA report https://review.opendev.org/745792 | 14:04 |
*** ysandeep is now known as ysandeep|dinner | 14:10 | |
frickler | mnaser: yes, it isn't 100% broken, it just seems to serve only a very exquisite subset of events | 14:12 |
fungi | well, keep in mind that the majority of events it logs shouldn't generate notifications to channels | 14:28 |
fungi | it gets the full gerrit event stream, analyzes every event and logs its decisions in the debug log, then only sends notifications for the tiny subset which its configuration says should get them | 14:29 |
fungi | i'll see if i can tell why 745831 wasn't announced to #openstack-lbaas for a start | 14:29 |
frickler | fungi: ah, you're right of course, seems the docker log only holds about 1h worth of data, we might want to log to somewhere more persistent | 14:37 |
fungi | yep, once i can figure out how to get docker-compose to show me more logs... | 14:37 |
fungi | hrm, yeah i'm starting to suspect that docker-compose is just throwing away logs and not saving them anywhere | 14:39 |
fungi | aha! it also writes to syslog | 14:40 |
fungi | looks like event 64c3a1b1decf is the one we want | 14:42 |
fungi | no, nevermind, that's not an event, that's a process | 14:42 |
frickler | but it seems to claim to have logged a message for that. not sure why every line seems to be logged twice in syslog, though | 14:43 |
fungi | Aug 12 11:03:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 11:03:29,173 INFO gerritbot: Potential channels to receive event notification: {'openstack-lbaas'} | 14:44 |
fungi | Aug 12 11:03:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 11:03:29,173 INFO gerritbot: Compiled Message openstack-lbaas: Carlos Goncalves proposed openstack/octavia master: Set Grub timeout to 0 for fast boot times https://review.opendev.org/745831 | 14:44 |
fungi | Aug 12 11:03:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 11:03:29,174 INFO gerritbot: Sending "Carlos Goncalves proposed openstack/octavia master: Set Grub timeout to 0 for fast boot times https://review.opendev.org/745831" to openstack-lbaas | 14:44 |
fungi | so it thinks it sent it to the server | 14:44 |
fungi | but yeah, no sign of it in http://eavesdrop.openstack.org/irclogs/%23openstack-lbaas/%23openstack-lbaas.2020-08-12.log.html | 14:45 |
johnsom | The bot isn't in the #openstack-lbaas channel according to my client. | 14:45 |
johnsom | I think it used to lurk in the channel if I remember right. | 14:46 |
fungi | johnsom: yeah, it can't join all channels (it's configured for more than freenode allows) so it opportunistically joins channels if it has a message for them and then parts if it needs to free up available channels to be able to post messages in others | 14:46 |
cgoncalves | johnsom, AJaeger wrote earlier this: " It only joins channels when there's something to notify about. There's a maximal channel limit a user/bot can be in, so it leaves/re-joins as needed." | 14:46 |
johnsom | Ah, ok | 14:47 |
fungi | so at start it's present in no channels, and joins them on demand up to the (120?) channel limit, then starts leaving the least recently needed channels as it has to join others | 14:47 |
frickler | I think there is some issue with the bot wanting to log to "openstack-lbaas" instead of "#openstack-lbaas" | 14:47 |
frickler | for channels where it works, there is a "#" in the channel name | 14:48 |
*** mlavalle has joined #opendev | 14:49 | |
fungi | i concur | 14:49 |
fungi | Aug 12 14:48:29 eavesdrop01 64c3a1b1decf[1386]: 2020-08-12 14:48:29,154 INFO gerritbot: Potential channels to receive event notification: {'#openstack-release'} | 14:49 |
fungi | et cetera | 14:49 |
fungi | so looks like maybe a configuration error | 14:49 |
fungi | though the configuration doesn't use # in front of any channel names | 14:50 |
frickler | channel_config.yaml doesn't have a # for any channel | 14:50 |
fungi | yeah, filtering the logs i see it also incorrectly trying to send to a bunch of non-# channel names | 14:53 |
fungi | i wonder if this is a recent regression in gerritbot | 14:53 |
fungi | looks like the version running on review.o.o was e387941 from december | 14:56 |
fungi | luckily there have been only 6 commits since then | 14:57 |
*** priteau has joined #opendev | 14:59 | |
frickler | we did run with py2 on review, didn't we? so might be a py3 issue, the commit since dec don't look suspicious to me | 14:59 |
fungi | yeah, i've now gone over all the recent commits since what we had installed on review.o.o and i agree, none of those were significant in ways which should impact this | 15:01 |
fungi | so gonna need to roll up sleeves and dive deeper in the code | 15:01 |
*** ysandeep|dinner is now known as ysandeep | 15:03 | |
frickler | fungi: https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L403-L412 looks to mix up using data and self.data, I'd go for cleaning that up first | 15:04 |
fungi | https://opendev.org/opendev/gerritbot/src/branch/master/gerritbot/bot.py#L406-L407 is where it seems to prepend the # | 15:05 |
fungi | oh, you're already there-ish | 15:05 |
fungi | i feel like we're both covering the same ground | 15:05 |
frickler | fungi: my python foo isn't strong, but that code makes me wonder whether changing data after setting "self.data=data" may behave different with py3 | 15:06 |
fungi | yeah, modifying an iterable in place | 15:07 |
fungi | while iterating on it | 15:08 |
fungi | actually not a great idea. better to iterate on a copy while using it as a reference to modify the original | 15:08 |
fungi | i bet data.keys() returned a copy in python 2 but returns an iterable tied to the data object in python 3 | 15:09 |
fungi | we might want to do keys = list(data.keys()) there? | 15:09 |
AJaeger | "modifying an iterable in place" is not allowed anymore with python 3.7 and a hard failure. | 15:10 |
fungi | `python --version` in the container says "Python 3.7.8" | 15:12 |
fungi | so i guess that's not it | 15:12 |
* frickler needs to leave, will check back later | 15:15 | |
*** ysandeep is now known as ysandeep|away | 15:19 | |
fungi | i'll keep fiddling with it | 15:19 |
smcginnis | mnaser: I've seen tow patches with retry failures to a Vexxhost mirror. Not sure if it's just flakiness in the network or something else, but thought I should mention it. | 15:20 |
mnaser | smcginnis: yeah, i notiecd that this morning. i dont see anythign in the system itself | 15:20 |
mnaser | is it happening recently? apparently the most recent i've seen it was around 9am-ish | 15:21 |
mnaser | smcginnis: it was _really_ slow to respond, don't have access to the machine. when did those patches fail? | 15:21 |
smcginnis | mnaser: Just hit it now with https://zuul.opendev.org/t/openstack/build/3ea6602420aa4c3abe80d497f57b5777 | 15:21 |
mnaser | yes. i can see that when i click http://mirror.ca-ymq-1.vexxhost.opendev.org/ it takes a while to open folders | 15:22 |
smcginnis | The last one I looked at before this (last night I think) it had some retries that eventually succeeded for an earlier package, ran fine for a few more, then timed out with retries on a later one. | 15:22 |
mnaser | everything is ok from our side :\ | 15:24 |
mnaser | cc infra-root ^ | 15:24 |
smcginnis | We can blame network gremlins for now. | 15:24 |
mnaser | smcginnis: it's repeated a few times today | 15:24 |
mnaser | plus it's stopped reporting into cacti too, so there's that | 15:25 |
mnaser | we really should get to the bottom of it otherwise we're just wasting compute power | 15:25 |
smcginnis | Yeah | 15:26 |
smcginnis | mnaser: When you said "on our side" above, were you referring to vexxhost or opendev as "our"? :) | 15:27 |
mnaser | smcginnis: sorry, vexxhost is not seeing any issues :) | 15:27 |
openstackgerrit | Thierry Carrez proposed opendev/system-config master: Redirect UC content to TC site https://review.opendev.org/744497 | 15:27 |
fungi | mnaser: you probably missed my investigation of the cacti situation above, but i don't see that it's likely to be related | 15:28 |
mnaser | it hurts us even more becuase that means any job running on our cloud will fail and just burn through systems | 15:28 |
mnaser | oh, i guess this might be the ipv6 thing happening eh :\ | 15:29 |
mnaser | the thing is, i am having problems accessing http://mirror.ca-ymq-1.vexxhost.opendev.org even over ipv4 (seeing it load slowly when i browse around) | 15:29 |
fungi | tcpdump shows snmp requests reaching the mirror, snmpd on the mirror replies, but those responses never reach cacti. that's over ipv6 and we've got similar situations with other systems not able to get v6 packets back to cacti (even from within rackspace's own network) | 15:29 |
fungi | afs, in contrast, is all over ipv4 | 15:29 |
fungi | it doesn't even support ipv6 | 15:29 |
mnaser | do apache logs show any slow requests? | 15:30 |
smcginnis | OK, just saw three more recent job failures matching this. Looks like it definitely is a bigger issue. | 15:31 |
mnaser | yeah, i am seeing failures here too | 15:31 |
fungi | apache doesn't log the time a request takes to satisfy, that i can find | 15:36 |
mnaser | ah, that's a bummer | 15:37 |
mnaser | nothing in eror logs? | 15:37 |
fungi | it's constantly spewing file negotiation failures, that's generally just because it doesn't know the file type though i think | 15:37 |
fungi | like: | 15:38 |
fungi | AH00687: Negotiation: discovered file(s) matching request: /var/www/mirror/wheel/ubuntu-18.04-x86_64/a/appdirs/index.html (None could be negotiated)., | 15:38 |
fungi | also seeing some of: | 15:38 |
fungi | AH01401: Zlib: Validation bytes not present, | 15:38 |
fungi | i'll shift gears to look into this deeper, and finish worrying about fixing gerritbot later | 15:39 |
fungi | does someone have a link to a job failure? | 15:39 |
mnaser | fungi: https://zuul.opendev.org/t/vexxhost/build/26648c0867ab4f7eb4aa5567f60007e1 here is one | 15:40 |
fungi | thanks | 15:40 |
fungi | okay, so for starters, /pypi/ isn't anything we mirror, it's not served out of afs, this is a proxy to the nearest fastly cdn endpoint for pypi.org | 15:43 |
fungi | so i'll check to see whether that mirror host is having trouble reaching or getting responses from pypi | 15:43 |
mnaser | https://zuul.opendev.org/t/vexxhost/build/61c6cf30b09c4deb821e05108dc1b0d3 -- another breakage too, this one might not be related to teh cache though fungi | 15:43 |
mnaser | i think /pypifiles/ is hosted locally | 15:43 |
fungi | nope, also a proxy | 15:44 |
mnaser | ah | 15:44 |
fungi | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror/templates/mirror.vhost.j2#L94 | 15:44 |
fungi | pypi is split between an index site and a file hosting site, so we have to proxy both | 15:45 |
fungi | but both sites use the same cdn network (fastly) so are likely winding up hitting the same endpoints for it | 15:45 |
fungi | Connecting to pypi.org (pypi.org)|2a04:4e42:200::223|:443... | 15:45 |
fungi | when i do `wget -O/dev/null 'https://pypi.org/'` | 15:46 |
fungi | it's just sitting there | 15:46 |
fungi | mnaser: can you reach pypi at all over ipv6 from vexxhost ca-ymq-1 | 15:47 |
fungi | ? | 15:47 |
mnaser | yep, its hanging indeed | 15:48 |
fungi | so that seems to be the crux of the problem | 15:48 |
mnaser | at this point it wouldn't even surprise me that someone got mad at someone and unpeeered and routes are gone | 15:48 |
fungi | yup | 15:49 |
fungi | i don't miss my days doing isp backbone peering | 15:49 |
fungi | all the finger-pointing between carriers was atrocious | 15:49 |
fungi | we can dial max-servers down to 0 there in the short term if you'd like | 15:50 |
fungi | basically anything trying to pip install in a job is going to fail there for now, i think | 15:50 |
*** shtepanie has joined #opendev | 15:51 | |
fungi | i'll push something up | 15:53 |
mnaser | give me a few minutes to try and see whats going on before we stop the whole thing, if that's ok | 15:53 |
fungi | sure, works for me | 15:53 |
mnaser | i mean, maybe we could workaround it by adding pypi.org ipv4 address to /etc/hosts -- just for now i guess | 15:54 |
fungi | that might work. i think apache will check that. but since pypi and pythonhosted are using a cdn, hard-coding its ip addresses could be risky | 15:54 |
mnaser | i agree, but i am talking about doing that for an hour or two at most while i debug, just to get things flowing again | 15:55 |
mnaser | i agree, but i am talking about doing that for an hour or two at most while i debug, just to get things flowing again | 15:55 |
mnaser | oops, wrong arrow up enter window | 15:55 |
fungi | yeah, i also can't ping ipv6 addresses for some stuff in rackspace from the mirror there, so i guess it's a fairly broad set of routes affected | 15:56 |
mnaser | ping6 google.ca seems to work though | 15:57 |
mnaser | so it's a subset of things... | 15:57 |
*** lpetrut has quit IRC | 15:58 | |
fungi | yep, i mean, i'm ssh'd in over ipv6 so it's obviously working for some routes | 15:58 |
fungi | but it's clearly more than just the fastly cdn endpoint impacted | 15:59 |
mnaser | fungi: so you said that when you were doing your tcpdumps, incoming traffic arrived, responded to but never reached rax (wrt snmp?) | 15:59 |
fungi | correct | 15:59 |
fungi | so this could be related, and it's an asymmetric routr | 15:59 |
fungi | route | 16:00 |
fungi | installing traceroute to see if i can spot a difference | 16:00 |
mnaser | fungi: if you don't mind, could you run a traceroute from rax to the mirror? | 16:00 |
mnaser | thank you :) | 16:00 |
fungi | heh | 16:00 |
mnaser | that way i can see at least which path it's unhappy about | 16:00 |
fungi | yeah, been in your shoes more times than i care to remember | 16:01 |
fungi | mnaser: oddly, i never get a response even from the gateway | 16:03 |
fungi | no hops responding | 16:03 |
fungi | oh wow. | 16:04 |
fungi | no default route? | 16:04 |
fungi | oh, nevermind, there are two | 16:04 |
fungi | looks like i'm seeing default routes announced through fe80::ce2d:e0ff:fe0f:74af and fe80::ce2d:e0ff:fe5a:d84e but neither are responding with ttl expired when trying to traceroute | 16:05 |
fungi | tried both udp (default) and icmp traceroute | 16:06 |
fungi | i can traceroute to my home ipv6 address just fine, the one i'm ssh'd in from | 16:07 |
fungi | but not to cacti's | 16:07 |
fungi | nor to pypi.org | 16:08 |
fungi | my home address is in 2600:6C00::/24 and traceroute shows responses starting from 2604:e100:1:0:ce2d:e0ff:fe5a:d84e in vexxhost | 16:10 |
fungi | but even that hop doesn't show up when tracerouting to cacti or pypi | 16:10 |
fungi | do you plumb ebgp all the way down into your igp there? or is there something shadowing those prefixes in your igp? | 16:11 |
mnaser | fungi: fe80::ce2d:e0ff:fe0f:74af is link-local interfaces which are announced to all vms | 16:13 |
mnaser | they shouldnt be reachable from the vms there | 16:14 |
mnaser | i an see that its pingable from another machine here | 16:14 |
fungi | pypi is? | 16:14 |
fungi | but yeah, for v6 destinations i can reach, 2604:e100:1:0:ce2d:e0ff:fe5a:d84e responds as the first hop. for v6 destinations i can't reach, there is no first hop (or any hops) responding | 16:15 |
mnaser | which means thats the same host as fe80::ce2d:e0ff:fe5a:d84e | 16:16 |
fungi | making me suspect that the first router is black-holing those prefixes somehow | 16:16 |
mnaser | which means that potentially fe80::ce2d:e0ff:fe0f:74af is the issue | 16:17 |
fungi | if it simply didn't have a route for them i'd expect an icmp no route to host or network unreachable | 16:17 |
mnaser | plot twist, 74af is the one that holds the closer route to pypi | 16:18 |
fungi | fwiw, i can ping 2604:e100:1:0:ce2d:e0ff:fe0f:74af just fine | 16:18 |
mnaser | oh i have an idea | 16:19 |
fungi | so if it were the problem i'd expect to be getting messages back from it in a traceroute | 16:19 |
mnaser | can you ping 2001:550:2:6::26:2 | 16:19 |
mnaser | and 2605:9000:400:107::c | 16:19 |
fungi | i get responses from 2605:9000:400:107::c but not 2001:550:2:6::26:2 | 16:19 |
mnaser | progress | 16:21 |
mnaser | fungi: what about 2001:550:2:6::26:1 ? | 16:22 |
fungi | no response | 16:22 |
mnaser | can i get a trace to :1 ? | 16:22 |
fungi | traceroute to 2001:550:2:6::26:1 (2001:550:2:6::26:1), 30 hops max, 80 byte packets | 16:23 |
fungi | 1 * * * | 16:23 |
fungi | nothing back from any hops | 16:23 |
mnaser | the heck | 16:23 |
mnaser | is there no route towards it? | 16:23 |
mnaser | fungi: i assume you are pinging/tracing from your local system right? | 16:23 |
fungi | or whatever is handling the next hop is eating it silently | 16:24 |
fungi | these are pings/traceroutes from the mirror instance in ca-ymq-1 | 16:24 |
mnaser | ah, i thought those were pings externally | 16:24 |
mnaser | are you able to reach those from rax or local at your side? | 16:25 |
fungi | from home i can reach all three of 2001:550:2:6::26:2 2605:9000:400:107::c 2001:550:2:6::26:1 | 16:26 |
fungi | same from bridge.openstack.org in rackspace dfw | 16:26 |
mnaser | whats interesting is | 16:27 |
mnaser | it stopped working exactly at 4am utc | 16:27 |
mnaser | which is 12am est | 16:27 |
fungi | i can also ping the mirror from bridge.o.o but can't ping bridge.o.o from the mirror | 16:27 |
mnaser | hmm, provider did an upgrade overnight | 16:28 |
fungi | similarly i can ping mirror.ca-ymq-1.vexxhost.opendev.org from cacti.openstack.org but not the reverse | 16:29 |
fungi | it's like replies to inbound flows for icmp echo request are set up and the returning echo replies get routed correctly (but the same is apparently not true of snmp/udp?) | 16:30 |
fungi | it's just odd to see this sort of stateful behavior at the carrier level. they must be doing some sort of flow-based balancing across their gear or something and not simple hash | 16:33 |
fungi | or maybe i'm just getting lucky and outbound icmp echo replies are getting hashed through a working router but not outbound echo requests | 16:35 |
mnaser | fungi: yeah.. somethign is weird there. i turned off the ipv6 peer to restart it | 16:35 |
mnaser | and now it just stuck oepnsent | 16:35 |
fungi | ew | 16:36 |
mnaser | fungi: ok, it's escalated with the network provider right now | 16:42 |
mnaser | bgp session is back up but this is still a problem | 16:42 |
fungi | want me to go forward with a max-servers=0 patch for now? | 16:43 |
mnaser | yeah lets do that and we can approve it together quickly to unblock world | 16:43 |
openstackgerrit | Jeremy Stanley proposed openstack/project-config master: Temporarily disable vexxhost ca-ymq-1 https://review.opendev.org/745929 | 16:45 |
fungi | mnaser: ^ feel free to single-core approve | 16:45 |
fungi | i can enqueue to the gate directly | 16:45 |
mnaser | fungi: done and think that's a good idea | 16:46 |
mnaser | as we may fail a few times along our way there | 16:46 |
fungi | it's in the gate now | 16:46 |
openstackgerrit | Jeremy Stanley proposed opendev/gerritbot master: Iterate over a copy of the channel keys https://review.opendev.org/745930 | 16:50 |
fungi | frickler: AJaeger: clarkb: ^ i hope that's the fix for gerritbot | 16:50 |
openstackgerrit | Merged openstack/project-config master: Temporarily disable vexxhost ca-ymq-1 https://review.opendev.org/745929 | 16:59 |
*** Baragaki has joined #opendev | 17:04 | |
*** priteau has quit IRC | 17:27 | |
*** Marcelo- has quit IRC | 17:59 | |
fungi | infra-root: if anyone happens to have a moment to spare, i'm hoping https://review.opendev.org/745930 will solve our latest gerritbot regression | 18:11 |
*** knikolla_ has joined #opendev | 18:13 | |
clarkb | fungi: one jetty based +2 | 18:17 |
corvus | +3 | 18:18 |
fungi | much thanks all! | 18:19 |
fungi | i'll give the logs a close watch after this merges | 18:20 |
*** mordred has quit IRC | 18:20 | |
*** gouthamr has quit IRC | 18:20 | |
*** knikolla has quit IRC | 18:20 | |
*** knikolla_ is now known as knikolla | 18:20 | |
fungi | clarkb: enjoy the jetty! hope you catch something without catching something | 18:20 |
corvus | heh, i thought clarkb was digging into java web servers, but this is better. :) | 18:21 |
*** Eighth_Doctor has quit IRC | 18:21 | |
*** gouthamr has joined #opendev | 18:22 | |
*** mordred has joined #opendev | 18:28 | |
openstackgerrit | Merged opendev/gerritbot master: Iterate over a copy of the channel keys https://review.opendev.org/745930 | 18:56 |
*** Eighth_Doctor has joined #opendev | 19:00 | |
*** shtepanie has quit IRC | 19:21 | |
mnaser | infra-root: would anyone be kind enough to run a traceroute6 from bridge.openstack.org to mirror.ca-ymq-1.vexxhost.opendev.org ? | 19:23 |
corvus | on it | 19:25 |
corvus | mnaser: http://paste.openstack.org/show/796794/ | 19:27 |
fungi | mnaser: traceroute back is unfortunately still blank, no response even from the first hop | 19:29 |
mnaser | corvus, fungi: thank you. yes, the no response from first hop is very confusing | 19:30 |
corvus | mnaser, fungi: fwiw, mtr from bridge: http://paste.openstack.org/show/796795/ | 19:32 |
mnaser | corvus: while you're in there, can you confirm if '2604:e100:1:0:ce2d:e0ff:fe0f:74af' and '2604:e100:1:0:ce2d:e0ff:fe5a:d84e' are indeed currently sending icmp requests to that system? | 19:43 |
mnaser | sorry, i just don't have access to a system that is on an 'unreachable' network :( | 19:43 |
corvus | working | 19:43 |
corvus | mnaser: yes receiving and replying | 19:48 |
corvus | seems fairly steady at 1hz each | 19:48 |
mnaser | hrm, ok. alright, so both outbound routes are working just fine | 19:48 |
corvus | and stopped :) | 19:48 |
mnaser | yep, as expected | 19:49 |
*** hashar has quit IRC | 20:43 | |
mnaser | sigh | 21:07 |
mnaser | infra-root: can someone run this on the mirror node -- `ip -6 addr list | grep 2001:db8 | awk '{ print $2 }' | xargs -I {} -n1 ip addr del {} dev eth0` | 21:08 |
mnaser | somehow 2001:db8:{0,1}::/64 addresses got dynamically configured, i'm still digging into this, but they'll need to be removed | 21:08 |
openstackgerrit | Mohammed Naser proposed openstack/project-config master: Revert "Temporarily disable vexxhost ca-ymq-1" https://review.opendev.org/745966 | 21:10 |
fungi | mnaser: probably you want ens3 instead of eth0, and done | 21:10 |
mnaser | infra-root: ^ appreciate a vote on that, once that's done and verified, we can land that, meanwhile i'll be investigating how the ra showed up | 21:10 |
mnaser | fungi: ah cool, wanna run/merge that? | 21:11 |
fungi | can do, also after clearing those routes i can reach stuff from the mirror again | 21:12 |
fungi | also https://mirror.ca-ymq-1.vexxhost.opendev.org/pypi/ is working | 21:12 |
fungi | mnaser: and approved, thank for looking into its! | 21:13 |
fungi | er, it | 21:13 |
mnaser | fungi: what i noticed was when tcpdumping, it was picking the 2001:db8:: as src address | 21:13 |
mnaser | when trying to reach pypi.org | 21:13 |
mnaser | but going to google, it wasn't | 21:13 |
fungi | got it, so there was a rogue prefix announced on that lan? | 21:16 |
fungi | fun stuff | 21:16 |
fungi | i think we thought we saw that once in limestone too, but couldn't repro it | 21:17 |
fungi | that also explains the symptoms we saw, as far as being able to reach the machine but it not being able to reach stuff | 21:18 |
fungi | it was responding from the address to which things were connecting, but initiating from a different address which wasn't routable | 21:18 |
fungi | 2001:db8:0:3::/64 dev ens3 proto ra metric 100 expires 2550846sec pref medium | 21:21 |
fungi | 2001:db8:1::/64 dev ens3 proto ra metric 100 expires 2550801sec pref medium | 21:21 |
fungi | we had both of those in the local routing table | 21:21 |
fungi | 2001:db8::/32 IPV6-DOC-AP "IPv6 prefix for documentation purpose" (This address range is to be used for documentation purpose only. For more information please see http://www.apnic.net/info/faq/ipv6-documentation-prefix-faq.html ) | 21:22 |
logan- | https://bugs.launchpad.net/neutron/+bug/1844712 | 21:23 |
openstack | Launchpad bug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete] | 21:23 |
logan- | that was a strange one. what youre seeing looks like a recurrence of that bug mnaser. the block is different, but that's only because the ipv6 blocks those jobs use were updated over the course of that bug: https://bugs.launchpad.net/neutron/+bug/1844712/comments/8 ...to the cidr you saw today :) | 21:28 |
openstack | Launchpad bug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete] | 21:28 |
clarkb | fungi: corvus got two keeper rock fish. we got a ling cod but it was below minimum size so went back in | 21:30 |
fungi | codesearch mostly turns up hits in our docs (unsurprisingly) but the prefix also gets heavy use in test for neutron, horizon, octavia, nova, tripleo, searchlight, ironic, manila, swift, zun, charms, designate, tempest, devstack, kuryr, osc, vitrage, watcher, cinder, monasca, sdk, puppet, several oslo libs... http://codesearch.openstack.org/?q=2001%3Adb8 | 21:31 |
fungi | hard to tell just from that what might be spewed via route announcements from misconfigured job nodes | 21:32 |
mnaser | logan-: urgh. but you run lxb right? | 21:32 |
logan- | yup, lxb cloud in that bug | 21:32 |
fungi | but since the mirror is in a different tenant... it's somewhat unexpected behavior | 21:33 |
mnaser | i guess this is more of a firewall driver issue. i think we have some systems that use iptables_hybrid and some with ovs driver | 21:33 |
mnaser | logan-: you dont have a repro i assume? | 21:35 |
logan- | possible way to approach it: find a timestamp for the RA / IP getting added on mirror, and then correlate that with jobs that were running at the time to try and identify the suspect VM(s), then look thru nova/neutron logs to try to find what went wrong | 21:36 |
logan- | nope, never was able to repro | 21:36 |
mnaser | i cant imagine this would be easy to reproduce | 21:36 |
logan- | and you'd think with all the vm launches going on we'd see it more often. it is crazy when it pops up | 21:36 |
fungi | gotta be a rare race with ports and filters or something | 21:37 |
mnaser | logan-: how many times have you hit this? i'm asking because we only recently upgraded to stein for this cloud | 21:37 |
logan- | once | 21:37 |
logan- | on rocky | 21:37 |
mnaser | this cloud was on queens for little while and we never hit it, but hitting it once also isn't a correct index | 21:37 |
mnaser | logan-: what is interesting is this failed almost near exactly at 4am utc / 12am est | 21:37 |
mnaser | https://usercontent.irccloud-cdn.com/file/JGp4BKOx/image.png | 21:38 |
logan- | hmm, iirc ours was mid-morning EST so i didn't think much of the timing | 21:38 |
mnaser | 19th of september | 21:39 |
mnaser | let me see if cacti goes that fa back | 21:39 |
mnaser | unfortunately not | 21:40 |
fungi | likely the server got rebuilt | 21:49 |
fungi | replaced, whatever | 21:49 |
fungi | we've been replacing a lot of our mirrors over the past year for ubuntu upgrades and newer domain name | 21:49 |
fungi | and ansibilification | 21:49 |
openstackgerrit | Merged openstack/project-config master: Revert "Temporarily disable vexxhost ca-ymq-1" https://review.opendev.org/745966 | 21:50 |
logan- | in the irc log from that bug, https://i.imgur.com/XsJTj6Y.jpg was linked. that's in central time.. so when it happened on our cloud it was around 9:15 AM Central. then it wasn't discovered until around 10:30-11, and the test node(s) that we guessed might have caused the issue were long gone by then. | 21:55 |
fungi | clarkb: if you're around, what triggers updating the docker image for gerritbot on eavesdrop? i see that we published the hopefully-fixed build to dockerhub but the hourly system-config deploy doesn't seem to be doing it. the daily deploy? | 22:38 |
clarkb | I think that is currnetly missing | 22:39 |
clarkb | we can have gerritbot changes themselves trigger them or do tthem hourly like zuul and nodepool | 22:39 |
clarkb | also need to tie in project-config to trigger infra-prod-service-eavesdrop when channel config updates | 22:40 |
*** tkajinam has joined #opendev | 22:57 | |
fungi | clarkb: so for now should i just docker-compose down/up -d? | 23:00 |
clarkb | you need to do a pull first I think | 23:12 |
clarkb | but ya that should do it | 23:12 |
fungi | ahh, right-o | 23:12 |
fungi | pulled | 23:12 |
*** openstackgerrit has quit IRC | 23:13 | |
fungi | downed | 23:13 |
fungi | upped | 23:13 |
fungi | watching syslog | 23:13 |
*** tosky has quit IRC | 23:13 | |
fungi | #status log manually pulled, downed and upped gerritbot container on eavesdrop for recent config parsing fix | 23:15 |
openstackstatus | fungi: finished logging | 23:15 |
*** ryohayakawa has joined #opendev | 23:49 | |
*** ryohayakawa has quit IRC | 23:56 | |
*** ryohayakawa has joined #opendev | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!