Friday, 2021-03-26

fungigitea05 has recovered sufficiently that i'm readding it to the pool00:06
fungialso huge thanks to jralbert for helping us work out that this might be the fallout from pathological deployment failure scenarios with openstack-ansible. if it crops up again that may make it easier to work out the cause, and to design around that case00:07
fungiper discussion in #openstack-ansible00:07
fungiianw: the other unusual (almost certainly unrelated) situation we had earlier was that four nodes booted in ovh-bhs1 around 03:30 utc which the launcher never returned as completed or rejected (three were filled though had some retries, the fourth failed three tries in a row), so the node requests were still locked by the launcher and the change they were for was blocking other changes in a gate queue for01:04
fungisome 14 hours01:04
fungirestarting the nodepool-launcher container on nl04 released those locks and allowed the requests to be picked up by another launcher01:05
ianwfungi: huh ... do we suspect zuul changes?01:38
fungiseems fairly unlikely, unless zuul somehow replaced the request and didn't register the completion/rejectoin01:45
fungithe launcher never logged "Fulfilled node request" for them01:48
fungiwhich i think means it stopped short for some reason01:48
fungigrep 200-0013422532 /var/log/nodepool/launcher-debug.log.2021-03-24_2001:48
fungiyou'll see the last thing it logs is "Node is ready"01:49
fungifor normal fulfilled requests the launcher logs "Fulfilled node request" next01:49
openstackgerritIan Wienand proposed opendev/system-config master: gitea: switch to token auth for project creation
openstackgerritMerged opendev/ master: Remove review-dev
openstackgerritMerged opendev/ master: Add
openstackgerritIan Wienand proposed opendev/system-config master: Add
*** marios has joined #opendev06:09
*** ricolin has joined #opendev07:17
*** ykarel is now known as ykarel|lunch08:09
*** elod_afk is now known as elod08:52
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
*** tosky has joined #opendev09:30
fungi i've got a morning full of errands and won't be around much before 15:00 utc11:29
openstackgerritJeremy Stanley proposed opendev/system-config master: Clean up Gerrit image builds
openstackgerritJeremy Stanley proposed opendev/jeepyb master: Bump gerritlib requirement to 0.10.0
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
openstackgerritMoshiur Rahman proposed openstack/diskimage-builder master: Fix: IPA image buidling with OpenSuse.
openstackgerritMerged opendev/base-jobs master: This is to test the changes made in
openstackgerritJames E. Blair proposed ttygroup/gertty master: Highlight WIP state in change view
corvusfungi: ^ that bit me too14:46
fungiooh, thanks!14:47
fungii saw the wip filter change up, but it was also conflicting with one of the other gertty patches i was trying out (i forget which one now)14:48
corvusi think that's an old one14:48
corvusi don't think there's a patch for filtering wip-state changes14:49
corvushowever, you can change the default queries i think, so you should be able to do that in the config file if you want14:49
fungiaha, yeah 19000114:49
corvus(i'd rather see them myself -- and then maybe add an indication in the listings that they're wip)14:49
fungiand yes, i'd also rather see wip changes listed, but some way to see they're wip in the change list/query result screen could be cool14:50
openstackgerritMerged ttygroup/gertty master: Add support for searching for hashtags
corvusi'm tempted to put in a 1-char wide column for state, but with states of N,M,W that will be hard to read.  like, you couldn't pick 3 letters harder to differentiate at a glance.  :)14:53
fungimaybe a different kind of row highlighting, similar to how already-reviewed changes are dimmed?14:57
corvusyeah, maybe brown?14:58
corvusfungi: do you run in 80 chars?14:59
fungii do15:00
fungii have a full 16 colors though (if you count black)!15:00
corvusk.  then if i add a new multi-char column, it wouldn't show up for you (80 chars already drops the topic column)15:00
corvus(and branch)15:01
corvusi'll cipher on this some more15:01
fungiyeah, i normally just see number, subject, owner, updated, and single-letter label columns15:01
fungiwhich is plenty for me15:01
fungithe only time the 80 columns becomes a challenge is when an insanely long patch series end in changes where i can no longer see subjects15:02
fungisomething like *top tools have to cycle between columns which columns are displayed could be neat, but probably a lot of work15:03
johnsomHi OpenDev folks. We are seeing some DNS failures on jobs this morning.
johnsomComplete output from command python egg_info:\n    Download error on [Errno -3] Temporary failure in name resolution -- Some packages may not be found!15:24
johnsomIt's causing POST_FAILURE on the stackviz pip step15:25
fungijohnsom: thanks for the heads up, digging into it now15:25
johnsomThank you15:26
johnsomAlso, another job in the same check run: 2021-03-26 14:26:59.119795 | controller | E: Failed to fetch  Unable to connect to
fungia few possibilities... the job is for some reason not querying the local unbound daemon on the node, unbound has died, or opendns/googledns to which unbound is forwarding isn't responding15:27
fungithat dns failure occurred in ovh-bhs1 whereas the mirror connection error was in limestone-regionone, so the two are probably unrelated15:29
johnsomAnother job just failed in that patch check pipeline: Err:170 bionic-updates/universe amd64 osinfo-db all 0.20180929-1ubuntu0.1 Unable to connect to
johnsomunbound shows "2606:4700:4700::1111#53" as the forwarded DNS server and it got no answer15:34
fungidns lookup on the first one (i'm still looking at that right now, i can only realistically investigate one failure at a time, sorry) happened between 15:18:16 and 15:18:59, and there's an entry in the unbound log for a similar lookup at 15:18:3615:35
johnsomSo either cloudflare has an issue or IPv6 out of that limestone region is having troubles15:35
johnsomYeah, no worries, was just trying to share the information.15:35
fungiwell, as i said, i'm still looking at the first one, and that didn't happen in limestone15:39
fungitrying to get to the bottom of why the lookup through unbound failed. i'm seeing a number of "Verified that unsigned response is INSECURE" messages toward the end of the processing for that query15:40
fungirelated to records for
*** chkumar|ruck is now known as raukadah15:41 is presently an alias for when i query it from home15:41
fungiyeah, not finding any ds records for either domain15:48
fungiso i expect that's normal15:49
johnsomfastly has a declared issue in Paris:
johnsomNot sure if that is close/related15:49
fungiinterestingly, ovh bhs1 is in quebec so probably not15:49
fungibut could be. fastly often doesn't direct traffic to nearby endpoints15:50
fungiour mirror server in bhs1 is able to retrieve without error presently, but it's possible the issue is intermittent there too15:51
fungiregardless, it does look like unbound in that first failure returned and 2a04:4e42:46::22315:52
fungidifferent addresses than i get when performing a lookup from right now15:53
fungiunrelated, i wonder why the process-stackviz role is hitting pypi directly instead of going through our pypi proxy-caching "mirror" server there15:54
corvusinfra-root: fyi i found a new gitea-based foss code hosting site and said hello:
fungicorvus: neat! glad to see we're not alone15:56
fungijohnsom: traceroute from bhs1 to shows it going to mae-east, so paris involvement is unlikely15:58
fungirtt is far too low to be anywhere farther15:59
fungiand not enough hops past there, so doubtful it's taking the transatlantic line out of mae-east15:59
johnsomlol,  yeah, transatlantic penalty is pretty obvious16:00
fungiso anyway, i'm a bit baffled. it looks like pip performed a host lookup to unbound which in turn queried external dns and returned valid records, so it seems like pip may be confused about the error (or somehow unbound didn't correctly return the results back through to libc)16:08
fungion the limestone errors, we've seen intermittent network connectivity issues there, particularly when connecting to the mirror instance which should be on an adjacent network. it's possible we've got more cases of bug 1844712 there16:11
openstackbug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete]
fungibut it wouldn't hurt to start collecting the neighbor table from the server periodically and seeing if that's showing new examples16:11
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Test changes made using base-test
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: DNM : Test changes made using base-test
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Test changes made using base-test
*** ysandeep|dinner is now known as ysandeep16:24
fungijohnsom: i've got this running in a root screen session on the limestone mirror instance now:16:25
fungiwhile :;do sleep 10;ip -6 ro sh|ts '%Y-%m-%dT%H:%M:%S'|tee gateways.log;done16:25
fungiwe can check that against observed connection failures and see if16:25
fungiit had a stray invalid gateway in that timeframe16:25
johnsomOk, cool, thanks!16:26
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Test changes made using base-test
fungijohnsom: yeah, that bug is pesky. would be good if we could figure out how to reliably reproduce it so neutron and/or openstack-ansible folks can finally have some hope of working on a fix16:29
fungii say openstack-ansible because both the clouds where we've observed it are deployed with it, no idea if it's actually involved or mere coincidence16:29
fungibut hard to say without a better understanding of what causes it as to whether it's an issue in the deployment, in what neutron configures, or even in one of the lower-level components neutron's relying on at the operating system level16:30
johnsomYeah, I know there have been some conflict issues with neutron routes, etc. in the past. I don't know if there is a way to get changes in those tables logged somehow or not. Short of a script like yours16:32
openstackgerritGomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Test changes made using base-test
fungiwell, in this case it's more about filtering i think16:35
fungibasically what we think is happening is that a job node starts errantly leaking route announcements onto the network, and they're arriving at the mirror server even though it's in a separate tenant, so the mirror dutifully adds a new default route to the job node, which not only can't actually route traffic for it, but also completely disappears shortly thereafter, leaving a bogus default route installed on16:37
fungithe mirror until it expires16:37
*** ysandeep is now known as ysandeep|away16:37
fungipossible there's a race in port setup, for example, where during a brief window some packets which should be getting filtered are making it through16:37
*** iurygregory has quit IRC17:12
*** eolivare has quit IRC17:39
*** diablo_rojo_phon has joined #opendev18:33
openstackgerritGomathi Selvi Srinivasan proposed opendev/base-jobs master: Revert
