Thursday, 2024-02-08

*** dmellado7452 is now known as dmellado74500:15
opendevreviewXavier Coulon proposed openstack/diskimage-builder master: Add SUSE Linux Enterprise 15 SP5 basic support
opendevreviewMerged opendev/system-config master: Upgrade to Keycloak 23.0
opendevreviewMerged opendev/system-config master: Add inventory entry for new Keycloak server
opendevreviewJeremy Stanley proposed opendev/base-jobs master: Temporarily stop uploading logs to OVH regions
opendevreviewJeremy Stanley proposed opendev/base-jobs master: Revert "Temporarily stop uploading logs to OVH regions"
fungijust being safe ^ in case we don't hear back from amorin soon enough15:25
fungii haven't seen any reply from his ovh e-mail address, so i'll try sending to his gmail address next15:25
fricklerwould that affect only uploads or also access to existing logs? if the latter, we should merge that soonish15:27
fungifrickler: also access to logs we've already uploaded there, potentially, so yes merging sooner would be wise15:27
fungimy primary concern is not causing half our jobs to start breaking with POST_FAILURE, of course15:28
fungiit really depends on what happens when the account goes unpaid past the grace period, i'm not sure if they just block authentication initially or turn off everything, or maybe even delete resources15:29
fricklerack, so let's just be on the safe side, merge this now and maybe also disable the nodepool regions tomorrow15:31
fungiyes, i figure if we see nodepool start failing to boot there then we can temporarily disable (in case it comes back later with no mirror server or something)15:32
clarkbfungi: I feel like I've only ever communicated to amorin through the contribution address which iirc is gmail16:14
fungiyeah, i'm trying that next. i tried the ovh address first because it's the first one listed in his launchpad profile16:29
fungiand we didn't have any preferred contact info listed where we keep our provider account info16:29
clarkbbut gerrit does have a preferred email addr :)16:32
fungionce again i forgot that we should not approve other system-config changes immediately before changes which add servers to the inventory16:35
fungisince if the change that adds to the inventory merges sooner than deploy starts for the change ahead of it, you'll get a base deploy failure where the to-be-added server is considered unreachable16:36
fungi907141 and then 908350 merged at basically the same time, then 907141 ran in deploy and the base job wanted to try to contact keycloak02 even though it was added to inventory by 908350 which hadn't deployed yet16:37
clarkbI think that may be a relatively new behavior as we're trying to make infra-prod deploy jobs run less serially16:40
clarkbI think the idea may haev been that base can always run which I guess isn't true in the case of inventory updates?16:40
fungium, okay this is new...16:42
fungiFeb  8 16:32:06 keycloak02 docker-keycloak[12780]: Fatal glibc error: CPU does not support x86-64-v216:42
clarkbthe node we booted on doesn't have a CPU new enough for the glibc in the keycloak container (which is probably a rhel 9 based ubi image?)16:43
fungimaybe we can't run latest keycloak in rackspace?16:43
clarkbI would expect we'd see that in CI too given the shared node types between prod and ci16:44
clarkbwhy doesn't all the centos 9 stream testing also have this problem? Could it be a very specific hypervisor at fault and if we boot 10 nodes 9 will be fine?16:45
clarkbdo we still collect cpu info in ci jobs? maybe we can cross check some of that info with the node you have booted16:45
clarkbside note "Universal Base Image" really failing to live up to the name here16:46
clarkb(my distro has figured out how to mix in newer cpu functionality when available, though I'm not sure if that happens only at package install time or if it is a runtime selection)16:47
clarkbif it is a package install time thing and that is also how rhel/ubi do it then maybe it has to do with where the keycloak image was built. It would also explain why our centos 9 images seemingly work16:49
fungi ran in rax-iad (so different region) but i can't find where we report the cpu flags16:49
fungiansible reports x86_64 as its architecture, but no idea if it differentiates v2 at all16:50
clarkbI know we recorded that stuff back in the day of figuring out live migration testing with nova. But maybe that was devstack specific logging16:50
fungiyeah, i wouldn't be surprised if devstack grabs /proc/cpuinfo or something16:51
clarkbfungi: maybe compare to the booted node? though the processor model is fuzzy with VMs as I think hypervisors can still mask flags16:52
fungithe last held keycloak test node, which i haven't deleted yet, is in rax-ord:
clarkband possibly emulate features unsupported by the cpu16:52
clarkboh that would be a good comparison then16:52
clarkbsince you can inspect the cpu flags directly16:53
clarkbsse4.2 supposedly being the one keycloak wants according to that github link16:53
fungisse4_2 is present on the held node16:53
fungithe production node only reports sse sse2 sse4a16:54
clarkbinteresting so maybe this is a luck of the draw thing. Or maybe it is region specific?16:54
clarkbwhich region is my held etherpad and bridge in? If dfw you can compare those too16:54
clarkbpretty sure it got a rax IP but not sure what region they landed in16:54
fungilists01 is in rax-dfw just like keycloak02 and reports sse4_2 in its flags16:55
clarkbmy held nodes for etherpad are in iad.16:55
fungiso maybe luck of the draw with the hypervisor host we landed on16:55
clarkbya that seems likely.16:55
fungii can boot a keycloak03 and see if we have better luck16:55
clarkbI think you can boot keycloak02s and we'll just update dns to point at whichever wins16:56
clarkband then maybe we can find a way to provide feedback back to rax about this16:56
fungimostly worried about ansible cache on bridge16:56
clarkbgood point. You'll likely have to delete that if you redeploy a new 02 host16:56
clarkb03 works too :)16:56
clarkbfungi: idea: we can go ahead and update our launch node script to check for that and error if not present17:02
clarkbwe already do similar ipv6 pings iirc17:02
clarkbthat might be a good step before booting new nodes so that it can be semi automated for you17:03
fungiwell, i have one almost booted, but sure i can do that too17:03
clarkbit will simplify cleanup for you in the failure case. Less important here with no floating ips and no volumes though17:04
fungithe launch script does actually seem to have a step for zeroing the inventory cache17:06
fungiyeah, the keycloak03 i just booted has the same problem17:12
opendevreviewJeremy Stanley proposed opendev/system-config master: Check launched server for x86-64-v2/sse4_2 support
fungiclarkb: something like that ^ ?17:20
clarkbfungi: ya though I think the ipv6 pign test implies it will raise if a nonzero rc is returned?17:31
fungioh maybe17:31
fungii'll test17:31
clarkbfungi: ya its the error_ok flag in the ssh() method17:32
clarkbby default it raises I think. if you set the flag to true then you can check the rc doe like you do17:32
clarkbwhich might be a good thing so that you can use your more descriptive error message17:32
fungihuh, i wonder if we want to boot a "performance" flavor instead of "standard"17:43
clarkbyes I think so17:45
clarkbis the tool defaulting to standard? I thought our default waas performance17:46
fungiwhat's odd is that i created lists01 with a standard rather than performance flavor (not sure why, maybe i forgot we had a default), but it supports sse4_2 in the same rackspace region17:55
clarkbprobably just comes down to luck of the draw in scheduling with heterogenous hypervisors18:01
clarkbflavors can impact that if scheduling picks subsets of hypervisors for flavors18:02
fungi doesn't seem to match the behavior i'm observing from the SSHClient object we get back18:09
fungiit doesn't mention an "ssh" method (which we use), and describes an exec_command method which doesn't seem to exist when i attempt to call it18:10
fungiwe're using the latest release from pypi18:11
fungiAttributeError: 'SSHClient' object has no attribute 'exec_command'18:18
fungii don't get it18:18
clarkbfungi: we wrap paramiko18:21
clarkbin launch node the object you get is the wrapper. Inside the wrapper is the paramiko object18:21
fungiohhhh, i see it18:22
fungifrom .sshclient import SSHClient18:22
fungiokay, so we have our own SSHClient class which doesn't act like paramiko's in any way18:22
fungithat's where the other confusing behavior i was seeing stems from too. i wanted to not have stdout copied to the terminal for a command, but looks like we do it with no option to turn that off18:23
opendevreviewJeremy Stanley proposed opendev/system-config master: Check launched server for x86-64-v2/sse4_2 support
fungiswitching the flavor from "4GB Standard Instance" to "4 GB Performance" seems to succeed for that check18:46
fungiheading out to an early dinner shortly, but i'll get inventory and dns changes up to swap the server out when i return18:47
fungiokay, heading out for a bite, shouldn't be more than an hour19:17
clarkbfungi: I left a question on the launch node change20:55
fungiclarkb: replied21:10
clarkbfungi: note that reply was not to my comment but to the top level of the change. I suspect some sort of gertty and gerrit representation mismatch but those can be confusing21:12
fungigertty currently lacks the ability to reply to inline comments. it can comment on the same line if that line was changed in the diff, but your comment seemed to be attached to a line that hadn't been changed21:16
fungiso i fabricated a reply as a review comment instead21:16
clarkbhuh I feel like maybe corvus does reply that way maybe unpushed edits?21:16
clarkbI say that because I think corvus has been able ot resolve my comments21:17
fungiotherwise an inline comment would have ended up associated with the "left side" (original) source line rather than the "right side"21:17
JayF fungi: note that reply was not to my comment but to the top level of the change. I suspect some sort of gertty and gerrit representation mismatch but those can be confusing21:17
fungioh, there's a gertty patch available for that i think. i should double-check whether i've got it checked out currently21:17
JayFoh whoops21:17
JayFmust have done a middle-click paste while scrolling21:17
JayFand I'm a highlight-reader!21:17
corvusyeah, there's a wip patch; it's like 80% done.  it's workable, but still has some missing features and sometimes crashes.21:18
corvusi use it daily and just avoid the sharp edges21:18
corvus(one of which is that you only get one chance to resolve a comment then it disappears never to return until you publish... so.... i accidentally resolve comments sometimes ooops)21:19
fungihuh, there's two different changes to add sqla v2 support to gertti21:20
opendevreviewMerged ttygroup/gertty master: make gertty work with sqlalchemy-2
corvusnow there's only one.21:24
fungii replied to your question in that change too late, but i suppose it's not super important21:26
fungitook me a minute to research which sqla version introduced the syntax 2.0 requires21:26
corvusfungi: ah yeah that would be a good change21:28
corvusi just checked matthew's change out locally and it worked and took him at his word.  but i have 1.4.21:29
fungimy connectivity to gitea seems to be sluggish21:29
fungithere it goes21:30
clarkbthe backends respond quickly so whatever it is appears to be affecting the load balancer21:31
opendevreviewJeremy Stanley proposed ttygroup/gertty master: Set SQLAlchemy minimum to 1.4
clarkbI'm really trying to finish a review of a change that I don't want to page out though so won't be able to look closer for a bit21:32
opendevreviewMerged ttygroup/gertty master: Set SQLAlchemy minimum to 1.4
corvus suggests something is afoot21:33
corvusgitea11 may merit a closer look21:34
corvusoh now it's starting to look like we're getting some null metrics from the lb21:35
fungicacti graphs for gitea11 don't seem that anomalous at least21:35
corvusis our current lb gitea-lb02? it's not answering ssh for me21:36
clarkbyes that is the current server21:36
fungiagreed, responding !H21:37
fungithough, again, cacti graphs looks relatively normal21:38
fungisuggesting the issue may be upstream from the load balancer21:38
corvusi think there may be null data at the end of the cacti graphs21:38
clarkbmaybe check a nova server show to see if the server is up21:38
clarkbcacti polls infrequently enough you won't see it be sad for a bit21:38
fungimy !H responses are coming from
fungicould be more vexxhost zayo problems and bgp just recently started sending more traffic that way21:39
fungithough traceroutes to the individual gitea servers make it all the way through21:40
fungiso might also be a core networking problem affecting one rack or something21:41
fungialmost looks like the network the gitea servers are in is being announced but not the network the load balancer is in21:42
fungiguilhermesp: mnaser: anything untoward happening in sjc?21:44
fungithough i can reach our mirror server there as well as the gitea backends, just not the haproxy server21:45
clarkbright that is why I wonder if it is a problem with the server not with the networking21:46
fungijust odd that we'd be getting a "no route to host" response from the backbone provider reaching that address21:47
clarkbnova does say the server is active and running21:48
fungiconsole log show is also returning a gateway timeout from nginx for that server, but other servers return a console log21:48
fungiif their cloud is bgp from neutron up, then the !H coming from their backbone peer makes sense if something has gone awry with part of the internal network21:49
fungias opposed to architectures with a separate igp21:50
corvusi pinged gitea-lb02 from gitea11 and it worked briefly but is now unreachable21:51
fungii got a console log after a few tries. nothing out of place with the console21:53
clarkbok so not the server itself then21:53
corvusi think unable to connect to the lb from the backends suggests an internal cloud networking problem21:53
fungithat's what it's seeming like, yes, and not one that's knocked their entire region offline either21:54
corvusdepending on our optimism we could launch a new lb21:55
clarkbwe could also set up the backends in a dns round robin and cross our fingers for a bit21:55
clarkboh except they don't expose 443 nevermind21:55
clarkbvexxhost status page doesn't show anythhing amiss yet21:56
corvuslooks like our ttl is 1h21:56
clarkbit seems to affect both ipv4 and ipv6 to the host as well21:57
guilhermespfungi:  weve got a compute with kernel panic, but we restored already -- still seeing issues? 21:59
clarkbguilhermesp: ya the server says it is active and running according to server show but it doesn't ping22:01
clarkband there is no network connectivity generally22:01
clarkbguilhermesp: the host is (that will resolve in dns for A and AAAA records if addresses help too)22:07
guilhermespare you able to ping now clarkb ? 22:16
corvuslgtm now22:16
corvusi think there maybe ipv6 connectivity issues still, but ipv4 looks good22:17
clarkbyup ipv4 seems to work but ipv6 is still unreachable22:17
fungithanks guilhermesp!22:22
clarkb++ thank you guilhermesp !22:47
clarkbipv6 seems to be working now too22:50

Generated by 2.17.3 by Marius Gedminas - find it at!