*** dmellado7452 is now known as dmellado745 | 00:15 | |
opendevreview | Xavier Coulon proposed openstack/diskimage-builder master: Add SUSE Linux Enterprise 15 SP5 basic support https://review.opendev.org/c/openstack/diskimage-builder/+/908421 | 13:30 |
---|---|---|
opendevreview | Merged opendev/system-config master: Upgrade to Keycloak 23.0 https://review.opendev.org/c/opendev/system-config/+/907141 | 15:09 |
opendevreview | Merged opendev/system-config master: Add inventory entry for new Keycloak server https://review.opendev.org/c/opendev/system-config/+/908350 | 15:09 |
opendevreview | Jeremy Stanley proposed opendev/base-jobs master: Temporarily stop uploading logs to OVH regions https://review.opendev.org/c/opendev/base-jobs/+/908505 | 15:24 |
opendevreview | Jeremy Stanley proposed opendev/base-jobs master: Revert "Temporarily stop uploading logs to OVH regions" https://review.opendev.org/c/opendev/base-jobs/+/908506 | 15:24 |
fungi | just being safe ^ in case we don't hear back from amorin soon enough | 15:25 |
fungi | i haven't seen any reply from his ovh e-mail address, so i'll try sending to his gmail address next | 15:25 |
frickler | would that affect only uploads or also access to existing logs? if the latter, we should merge that soonish | 15:27 |
fungi | frickler: also access to logs we've already uploaded there, potentially, so yes merging sooner would be wise | 15:27 |
fungi | my primary concern is not causing half our jobs to start breaking with POST_FAILURE, of course | 15:28 |
fungi | it really depends on what happens when the account goes unpaid past the grace period, i'm not sure if they just block authentication initially or turn off everything, or maybe even delete resources | 15:29 |
frickler | ack, so let's just be on the safe side, merge this now and maybe also disable the nodepool regions tomorrow | 15:31 |
fungi | yes, i figure if we see nodepool start failing to boot there then we can temporarily disable (in case it comes back later with no mirror server or something) | 15:32 |
clarkb | fungi: I feel like I've only ever communicated to amorin through the contribution address which iirc is gmail | 16:14 |
fungi | yeah, i'm trying that next. i tried the ovh address first because it's the first one listed in his launchpad profile | 16:29 |
fungi | and we didn't have any preferred contact info listed where we keep our provider account info | 16:29 |
clarkb | but gerrit does have a preferred email addr :) | 16:32 |
fungi | once again i forgot that we should not approve other system-config changes immediately before changes which add servers to the inventory | 16:35 |
fungi | since if the change that adds to the inventory merges sooner than deploy starts for the change ahead of it, you'll get a base deploy failure where the to-be-added server is considered unreachable | 16:36 |
fungi | 907141 and then 908350 merged at basically the same time, then 907141 ran in deploy and the base job wanted to try to contact keycloak02 even though it was added to inventory by 908350 which hadn't deployed yet | 16:37 |
clarkb | I think that may be a relatively new behavior as we're trying to make infra-prod deploy jobs run less serially | 16:40 |
clarkb | I think the idea may haev been that base can always run which I guess isn't true in the case of inventory updates? | 16:40 |
fungi | um, okay this is new... | 16:42 |
fungi | Feb 8 16:32:06 keycloak02 docker-keycloak[12780]: Fatal glibc error: CPU does not support x86-64-v2 | 16:42 |
fungi | https://github.com/keycloak/keycloak/issues/17290 | 16:43 |
clarkb | the node we booted on doesn't have a CPU new enough for the glibc in the keycloak container (which is probably a rhel 9 based ubi image?) | 16:43 |
fungi | maybe we can't run latest keycloak in rackspace? | 16:43 |
clarkb | I would expect we'd see that in CI too given the shared node types between prod and ci | 16:44 |
clarkb | why doesn't all the centos 9 stream testing also have this problem? Could it be a very specific hypervisor at fault and if we boot 10 nodes 9 will be fine? | 16:45 |
clarkb | do we still collect cpu info in ci jobs? maybe we can cross check some of that info with the node you have booted | 16:45 |
clarkb | side note "Universal Base Image" really failing to live up to the name here | 16:46 |
clarkb | (my distro has figured out how to mix in newer cpu functionality when available, though I'm not sure if that happens only at package install time or if it is a runtime selection) | 16:47 |
clarkb | if it is a package install time thing and that is also how rhel/ubi do it then maybe it has to do with where the keycloak image was built. It would also explain why our centos 9 images seemingly work | 16:49 |
fungi | https://zuul.opendev.org/t/openstack/build/4efddda8b1964226ae6479fe203cdcbf ran in rax-iad (so different region) but i can't find where we report the cpu flags | 16:49 |
fungi | ansible reports x86_64 as its architecture, but no idea if it differentiates v2 at all | 16:50 |
clarkb | I know we recorded that stuff back in the day of figuring out live migration testing with nova. But maybe that was devstack specific logging | 16:50 |
fungi | yeah, i wouldn't be surprised if devstack grabs /proc/cpuinfo or something | 16:51 |
clarkb | fungi: maybe compare https://zuul.opendev.org/t/openstack/build/4efddda8b1964226ae6479fe203cdcbf/log/zuul-info/host-info.keycloak99.opendev.org.yaml#535 to the booted node? though the processor model is fuzzy with VMs as I think hypervisors can still mask flags | 16:52 |
fungi | the last held keycloak test node, which i haven't deleted yet, is in rax-ord: 23.253.54.55 | 16:52 |
clarkb | and possibly emulate features unsupported by the cpu | 16:52 |
clarkb | oh that would be a good comparison then | 16:52 |
clarkb | since you can inspect the cpu flags directly | 16:53 |
clarkb | sse4.2 supposedly being the one keycloak wants according to that github link | 16:53 |
fungi | sse4_2 is present on the held node | 16:53 |
fungi | the production node only reports sse sse2 sse4a | 16:54 |
clarkb | interesting so maybe this is a luck of the draw thing. Or maybe it is region specific? | 16:54 |
clarkb | which region is my held etherpad and bridge in? If dfw you can compare those too | 16:54 |
clarkb | pretty sure it got a rax IP but not sure what region they landed in | 16:54 |
fungi | lists01 is in rax-dfw just like keycloak02 and reports sse4_2 in its flags | 16:55 |
clarkb | my held nodes for etherpad are in iad. | 16:55 |
fungi | so maybe luck of the draw with the hypervisor host we landed on | 16:55 |
clarkb | ya that seems likely. | 16:55 |
fungi | i can boot a keycloak03 and see if we have better luck | 16:55 |
clarkb | I think you can boot keycloak02s and we'll just update dns to point at whichever wins | 16:56 |
clarkb | and then maybe we can find a way to provide feedback back to rax about this | 16:56 |
fungi | mostly worried about ansible cache on bridge | 16:56 |
clarkb | good point. You'll likely have to delete that if you redeploy a new 02 host | 16:56 |
clarkb | 03 works too :) | 16:56 |
clarkb | fungi: idea: we can go ahead and update our launch node script to check for that and error if not present | 17:02 |
clarkb | we already do similar ipv6 pings iirc | 17:02 |
clarkb | that might be a good step before booting new nodes so that it can be semi automated for you | 17:03 |
fungi | well, i have one almost booted, but sure i can do that too | 17:03 |
clarkb | it will simplify cleanup for you in the failure case. Less important here with no floating ips and no volumes though | 17:04 |
fungi | the launch script does actually seem to have a step for zeroing the inventory cache | 17:06 |
fungi | yeah, the keycloak03 i just booted has the same problem | 17:12 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Check launched server for x86-64-v2/sse4_2 support https://review.opendev.org/c/opendev/system-config/+/908512 | 17:20 |
fungi | clarkb: something like that ^ ? | 17:20 |
clarkb | fungi: ya though I think the ipv6 pign test implies it will raise if a nonzero rc is returned? | 17:31 |
fungi | oh maybe | 17:31 |
fungi | i'll test | 17:31 |
clarkb | fungi: ya its the error_ok flag in the ssh() method | 17:32 |
clarkb | by default it raises I think. if you set the flag to true then you can check the rc doe like you do | 17:32 |
clarkb | which might be a good thing so that you can use your more descriptive error message | 17:32 |
fungi | agreed | 17:33 |
fungi | huh, i wonder if we want to boot a "performance" flavor instead of "standard" | 17:43 |
clarkb | yes I think so | 17:45 |
clarkb | is the tool defaulting to standard? I thought our default waas performance | 17:46 |
fungi | what's odd is that i created lists01 with a standard rather than performance flavor (not sure why, maybe i forgot we had a default), but it supports sse4_2 in the same rackspace region | 17:55 |
clarkb | probably just comes down to luck of the draw in scheduling with heterogenous hypervisors | 18:01 |
clarkb | flavors can impact that if scheduling picks subsets of hypervisors for flavors | 18:02 |
fungi | https://docs.paramiko.org/en/latest/api/client.html doesn't seem to match the behavior i'm observing from the SSHClient object we get back | 18:09 |
fungi | it doesn't mention an "ssh" method (which we use), and describes an exec_command method which doesn't seem to exist when i attempt to call it | 18:10 |
fungi | we're using the latest release from pypi | 18:11 |
fungi | AttributeError: 'SSHClient' object has no attribute 'exec_command' | 18:18 |
fungi | i don't get it | 18:18 |
clarkb | fungi: we wrap paramiko | 18:21 |
clarkb | in launch node the object you get is the wrapper. Inside the wrapper is the paramiko object | 18:21 |
fungi | ohhhh, i see it | 18:22 |
fungi | from .sshclient import SSHClient | 18:22 |
fungi | okay, so we have our own SSHClient class which doesn't act like paramiko's in any way | 18:22 |
fungi | that's where the other confusing behavior i was seeing stems from too. i wanted to not have stdout copied to the terminal for a command, but looks like we do it with no option to turn that off | 18:23 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Check launched server for x86-64-v2/sse4_2 support https://review.opendev.org/c/opendev/system-config/+/908512 | 18:43 |
fungi | switching the flavor from "4GB Standard Instance" to "4 GB Performance" seems to succeed for that check | 18:46 |
fungi | heading out to an early dinner shortly, but i'll get inventory and dns changes up to swap the server out when i return | 18:47 |
fungi | okay, heading out for a bite, shouldn't be more than an hour | 19:17 |
clarkb | fungi: I left a question on the launch node change | 20:55 |
fungi | k | 20:58 |
fungi | clarkb: replied | 21:10 |
clarkb | fungi: note that reply was not to my comment but to the top level of the change. I suspect some sort of gertty and gerrit representation mismatch but those can be confusing | 21:12 |
fungi | gertty currently lacks the ability to reply to inline comments. it can comment on the same line if that line was changed in the diff, but your comment seemed to be attached to a line that hadn't been changed | 21:16 |
fungi | so i fabricated a reply as a review comment instead | 21:16 |
clarkb | huh I feel like maybe corvus does reply that way maybe unpushed edits? | 21:16 |
clarkb | I say that because I think corvus has been able ot resolve my comments | 21:17 |
fungi | otherwise an inline comment would have ended up associated with the "left side" (original) source line rather than the "right side" | 21:17 |
JayF | fungi: note that reply was not to my comment but to the top level of the change. I suspect some sort of gertty and gerrit representation mismatch but those can be confusing | 21:17 |
fungi | oh, there's a gertty patch available for that i think. i should double-check whether i've got it checked out currently | 21:17 |
JayF | oh whoops | 21:17 |
JayF | must have done a middle-click paste while scrolling | 21:17 |
JayF | and I'm a highlight-reader! | 21:17 |
corvus | yeah, there's a wip patch; it's like 80% done. it's workable, but still has some missing features and sometimes crashes. | 21:18 |
clarkb | gotcha | 21:18 |
corvus | i use it daily and just avoid the sharp edges | 21:18 |
corvus | (one of which is that you only get one chance to resolve a comment then it disappears never to return until you publish... so.... i accidentally resolve comments sometimes ooops) | 21:19 |
fungi | hah | 21:19 |
fungi | huh, there's two different changes to add sqla v2 support to gertti | 21:20 |
fungi | gertty | 21:20 |
opendevreview | Merged ttygroup/gertty master: make gertty work with sqlalchemy-2 https://review.opendev.org/c/ttygroup/gertty/+/880123 | 21:24 |
corvus | now there's only one. | 21:24 |
fungi | i replied to your question in that change too late, but i suppose it's not super important | 21:26 |
fungi | took me a minute to research which sqla version introduced the syntax 2.0 requires | 21:26 |
corvus | fungi: ah yeah that would be a good change | 21:28 |
corvus | i just checked matthew's change out locally and it worked and took him at his word. but i have 1.4. | 21:29 |
fungi | my connectivity to gitea seems to be sluggish | 21:29 |
fungi | there it goes | 21:30 |
clarkb | the backends respond quickly so whatever it is appears to be affecting the load balancer | 21:31 |
corvus | ditto | 21:31 |
opendevreview | Jeremy Stanley proposed ttygroup/gertty master: Set SQLAlchemy minimum to 1.4 https://review.opendev.org/c/ttygroup/gertty/+/908527 | 21:32 |
clarkb | I'm really trying to finish a review of a change that I don't want to page out though so won't be able to look closer for a bit | 21:32 |
opendevreview | Merged ttygroup/gertty master: Set SQLAlchemy minimum to 1.4 https://review.opendev.org/c/ttygroup/gertty/+/908527 | 21:32 |
corvus | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1 suggests something is afoot | 21:33 |
corvus | gitea11 may merit a closer look | 21:34 |
corvus | oh now it's starting to look like we're getting some null metrics from the lb | 21:35 |
fungi | cacti graphs for gitea11 don't seem that anomalous at least | 21:35 |
corvus | is our current lb gitea-lb02? it's not answering ssh for me | 21:36 |
clarkb | yes that is the current server | 21:36 |
fungi | agreed, responding !H | 21:37 |
fungi | though, again, cacti graphs looks relatively normal | 21:38 |
fungi | suggesting the issue may be upstream from the load balancer | 21:38 |
corvus | i think there may be null data at the end of the cacti graphs | 21:38 |
clarkb | maybe check a nova server show to see if the server is up | 21:38 |
clarkb | cacti polls infrequently enough you won't see it be sad for a bit | 21:38 |
fungi | my !H responses are coming from 64.124.44.243.IDIA-176341-ZYO.zip.zayo.com | 21:38 |
fungi | could be more vexxhost zayo problems and bgp just recently started sending more traffic that way | 21:39 |
fungi | though traceroutes to the individual gitea servers make it all the way through | 21:40 |
fungi | so might also be a core networking problem affecting one rack or something | 21:41 |
fungi | almost looks like the network the gitea servers are in is being announced but not the network the load balancer is in | 21:42 |
fungi | guilhermesp: mnaser: anything untoward happening in sjc? | 21:44 |
fungi | though i can reach our mirror server there as well as the gitea backends, just not the haproxy server | 21:45 |
clarkb | right that is why I wonder if it is a problem with the server not with the networking | 21:46 |
fungi | just odd that we'd be getting a "no route to host" response from the backbone provider reaching that address | 21:47 |
clarkb | nova does say the server is active and running | 21:48 |
fungi | console log show is also returning a gateway timeout from nginx for that server, but other servers return a console log | 21:48 |
fungi | if their cloud is bgp from neutron up, then the !H coming from their backbone peer makes sense if something has gone awry with part of the internal network | 21:49 |
fungi | as opposed to architectures with a separate igp | 21:50 |
corvus | i pinged gitea-lb02 from gitea11 and it worked briefly but is now unreachable | 21:51 |
fungi | i got a console log after a few tries. nothing out of place with the console | 21:53 |
clarkb | ok so not the server itself then | 21:53 |
corvus | i think unable to connect to the lb from the backends suggests an internal cloud networking problem | 21:53 |
fungi | that's what it's seeming like, yes, and not one that's knocked their entire region offline either | 21:54 |
corvus | depending on our optimism we could launch a new lb | 21:55 |
clarkb | we could also set up the backends in a dns round robin and cross our fingers for a bit | 21:55 |
clarkb | oh except they don't expose 443 nevermind | 21:55 |
clarkb | vexxhost status page doesn't show anythhing amiss yet | 21:56 |
corvus | looks like our ttl is 1h | 21:56 |
clarkb | it seems to affect both ipv4 and ipv6 to the host as well | 21:57 |
guilhermesp | fungi: weve got a compute with kernel panic, but we restored already -- still seeing issues? | 21:59 |
clarkb | guilhermesp: ya the server says it is active and running according to server show but it doesn't ping | 22:01 |
clarkb | and there is no network connectivity generally | 22:01 |
clarkb | guilhermesp: the host is gitea-lb02.opendev.org (that will resolve in dns for A and AAAA records if addresses help too) | 22:07 |
guilhermesp | are you able to ping now clarkb ? | 22:16 |
corvus | lgtm now | 22:16 |
corvus | i think there maybe ipv6 connectivity issues still, but ipv4 looks good | 22:17 |
clarkb | yup ipv4 seems to work but ipv6 is still unreachable | 22:17 |
fungi | thanks guilhermesp! | 22:22 |
clarkb | ++ thank you guilhermesp ! | 22:47 |
clarkb | ipv6 seems to be working now too | 22:50 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!