Thursday, 2024-02-08

*** dmellado7452 is now known as dmellado745		00:15
opendevreview	Xavier Coulon proposed openstack/diskimage-builder master: Add SUSE Linux Enterprise 15 SP5 basic support https://review.opendev.org/c/openstack/diskimage-builder/+/908421	13:30
opendevreview	Merged opendev/system-config master: Upgrade to Keycloak 23.0 https://review.opendev.org/c/opendev/system-config/+/907141	15:09
opendevreview	Merged opendev/system-config master: Add inventory entry for new Keycloak server https://review.opendev.org/c/opendev/system-config/+/908350	15:09
opendevreview	Jeremy Stanley proposed opendev/base-jobs master: Temporarily stop uploading logs to OVH regions https://review.opendev.org/c/opendev/base-jobs/+/908505	15:24
opendevreview	Jeremy Stanley proposed opendev/base-jobs master: Revert "Temporarily stop uploading logs to OVH regions" https://review.opendev.org/c/opendev/base-jobs/+/908506	15:24
fungi	just being safe ^ in case we don't hear back from amorin soon enough	15:25
fungi	i haven't seen any reply from his ovh e-mail address, so i'll try sending to his gmail address next	15:25
frickler	would that affect only uploads or also access to existing logs? if the latter, we should merge that soonish	15:27
fungi	frickler: also access to logs we've already uploaded there, potentially, so yes merging sooner would be wise	15:27
fungi	my primary concern is not causing half our jobs to start breaking with POST_FAILURE, of course	15:28
fungi	it really depends on what happens when the account goes unpaid past the grace period, i'm not sure if they just block authentication initially or turn off everything, or maybe even delete resources	15:29
frickler	ack, so let's just be on the safe side, merge this now and maybe also disable the nodepool regions tomorrow	15:31
fungi	yes, i figure if we see nodepool start failing to boot there then we can temporarily disable (in case it comes back later with no mirror server or something)	15:32
clarkb	fungi: I feel like I've only ever communicated to amorin through the contribution address which iirc is gmail	16:14
fungi	yeah, i'm trying that next. i tried the ovh address first because it's the first one listed in his launchpad profile	16:29
fungi	and we didn't have any preferred contact info listed where we keep our provider account info	16:29
clarkb	but gerrit does have a preferred email addr :)	16:32
fungi	once again i forgot that we should not approve other system-config changes immediately before changes which add servers to the inventory	16:35
fungi	since if the change that adds to the inventory merges sooner than deploy starts for the change ahead of it, you'll get a base deploy failure where the to-be-added server is considered unreachable	16:36
fungi	907141 and then 908350 merged at basically the same time, then 907141 ran in deploy and the base job wanted to try to contact keycloak02 even though it was added to inventory by 908350 which hadn't deployed yet	16:37
clarkb	I think that may be a relatively new behavior as we're trying to make infra-prod deploy jobs run less serially	16:40
clarkb	I think the idea may haev been that base can always run which I guess isn't true in the case of inventory updates?	16:40
fungi	um, okay this is new...	16:42
fungi	Feb 8 16:32:06 keycloak02 docker-keycloak[12780]: Fatal glibc error: CPU does not support x86-64-v2	16:42
fungi	https://github.com/keycloak/keycloak/issues/17290	16:43
clarkb	the node we booted on doesn't have a CPU new enough for the glibc in the keycloak container (which is probably a rhel 9 based ubi image?)	16:43
fungi	maybe we can't run latest keycloak in rackspace?	16:43
clarkb	I would expect we'd see that in CI too given the shared node types between prod and ci	16:44
clarkb	why doesn't all the centos 9 stream testing also have this problem? Could it be a very specific hypervisor at fault and if we boot 10 nodes 9 will be fine?	16:45
clarkb	do we still collect cpu info in ci jobs? maybe we can cross check some of that info with the node you have booted	16:45
clarkb	side note "Universal Base Image" really failing to live up to the name here	16:46
clarkb	(my distro has figured out how to mix in newer cpu functionality when available, though I'm not sure if that happens only at package install time or if it is a runtime selection)	16:47
clarkb	if it is a package install time thing and that is also how rhel/ubi do it then maybe it has to do with where the keycloak image was built. It would also explain why our centos 9 images seemingly work	16:49
fungi	https://zuul.opendev.org/t/openstack/build/4efddda8b1964226ae6479fe203cdcbf ran in rax-iad (so different region) but i can't find where we report the cpu flags	16:49
fungi	ansible reports x86_64 as its architecture, but no idea if it differentiates v2 at all	16:50
clarkb	I know we recorded that stuff back in the day of figuring out live migration testing with nova. But maybe that was devstack specific logging	16:50
fungi	yeah, i wouldn't be surprised if devstack grabs /proc/cpuinfo or something	16:51
clarkb	fungi: maybe compare https://zuul.opendev.org/t/openstack/build/4efddda8b1964226ae6479fe203cdcbf/log/zuul-info/host-info.keycloak99.opendev.org.yaml#535 to the booted node? though the processor model is fuzzy with VMs as I think hypervisors can still mask flags	16:52
fungi	the last held keycloak test node, which i haven't deleted yet, is in rax-ord: 23.253.54.55	16:52
clarkb	and possibly emulate features unsupported by the cpu	16:52
clarkb	oh that would be a good comparison then	16:52
clarkb	since you can inspect the cpu flags directly	16:53
clarkb	sse4.2 supposedly being the one keycloak wants according to that github link	16:53
fungi	sse4_2 is present on the held node	16:53
fungi	the production node only reports sse sse2 sse4a	16:54
clarkb	interesting so maybe this is a luck of the draw thing. Or maybe it is region specific?	16:54
clarkb	which region is my held etherpad and bridge in? If dfw you can compare those too	16:54
clarkb	pretty sure it got a rax IP but not sure what region they landed in	16:54
fungi	lists01 is in rax-dfw just like keycloak02 and reports sse4_2 in its flags	16:55
clarkb	my held nodes for etherpad are in iad.	16:55
fungi	so maybe luck of the draw with the hypervisor host we landed on	16:55
clarkb	ya that seems likely.	16:55
fungi	i can boot a keycloak03 and see if we have better luck	16:55
clarkb	I think you can boot keycloak02s and we'll just update dns to point at whichever wins	16:56
clarkb	and then maybe we can find a way to provide feedback back to rax about this	16:56
fungi	mostly worried about ansible cache on bridge	16:56
clarkb	good point. You'll likely have to delete that if you redeploy a new 02 host	16:56
clarkb	03 works too :)	16:56
clarkb	fungi: idea: we can go ahead and update our launch node script to check for that and error if not present	17:02
clarkb	we already do similar ipv6 pings iirc	17:02
clarkb	that might be a good step before booting new nodes so that it can be semi automated for you	17:03
fungi	well, i have one almost booted, but sure i can do that too	17:03
clarkb	it will simplify cleanup for you in the failure case. Less important here with no floating ips and no volumes though	17:04
fungi	the launch script does actually seem to have a step for zeroing the inventory cache	17:06
fungi	yeah, the keycloak03 i just booted has the same problem	17:12
opendevreview	Jeremy Stanley proposed opendev/system-config master: Check launched server for x86-64-v2/sse4_2 support https://review.opendev.org/c/opendev/system-config/+/908512	17:20
fungi	clarkb: something like that ^ ?	17:20
clarkb	fungi: ya though I think the ipv6 pign test implies it will raise if a nonzero rc is returned?	17:31
fungi	oh maybe	17:31
fungi	i'll test	17:31
clarkb	fungi: ya its the error_ok flag in the ssh() method	17:32
clarkb	by default it raises I think. if you set the flag to true then you can check the rc doe like you do	17:32
clarkb	which might be a good thing so that you can use your more descriptive error message	17:32
fungi	agreed	17:33
fungi	huh, i wonder if we want to boot a "performance" flavor instead of "standard"	17:43
clarkb	yes I think so	17:45
clarkb	is the tool defaulting to standard? I thought our default waas performance	17:46
fungi	what's odd is that i created lists01 with a standard rather than performance flavor (not sure why, maybe i forgot we had a default), but it supports sse4_2 in the same rackspace region	17:55
clarkb	probably just comes down to luck of the draw in scheduling with heterogenous hypervisors	18:01
clarkb	flavors can impact that if scheduling picks subsets of hypervisors for flavors	18:02
fungi	https://docs.paramiko.org/en/latest/api/client.html doesn't seem to match the behavior i'm observing from the SSHClient object we get back	18:09
fungi	it doesn't mention an "ssh" method (which we use), and describes an exec_command method which doesn't seem to exist when i attempt to call it	18:10
fungi	we're using the latest release from pypi	18:11
fungi	AttributeError: 'SSHClient' object has no attribute 'exec_command'	18:18
fungi	i don't get it	18:18
clarkb	fungi: we wrap paramiko	18:21
clarkb	in launch node the object you get is the wrapper. Inside the wrapper is the paramiko object	18:21
fungi	ohhhh, i see it	18:22
fungi	from .sshclient import SSHClient	18:22
fungi	okay, so we have our own SSHClient class which doesn't act like paramiko's in any way	18:22
fungi	that's where the other confusing behavior i was seeing stems from too. i wanted to not have stdout copied to the terminal for a command, but looks like we do it with no option to turn that off	18:23
opendevreview	Jeremy Stanley proposed opendev/system-config master: Check launched server for x86-64-v2/sse4_2 support https://review.opendev.org/c/opendev/system-config/+/908512	18:43
fungi	switching the flavor from "4GB Standard Instance" to "4 GB Performance" seems to succeed for that check	18:46
fungi	heading out to an early dinner shortly, but i'll get inventory and dns changes up to swap the server out when i return	18:47
fungi	okay, heading out for a bite, shouldn't be more than an hour	19:17
clarkb	fungi: I left a question on the launch node change	20:55
fungi	k	20:58
fungi	clarkb: replied	21:10
clarkb	fungi: note that reply was not to my comment but to the top level of the change. I suspect some sort of gertty and gerrit representation mismatch but those can be confusing	21:12
fungi	gertty currently lacks the ability to reply to inline comments. it can comment on the same line if that line was changed in the diff, but your comment seemed to be attached to a line that hadn't been changed	21:16
fungi	so i fabricated a reply as a review comment instead	21:16
clarkb	huh I feel like maybe corvus does reply that way maybe unpushed edits?	21:16
clarkb	I say that because I think corvus has been able ot resolve my comments	21:17
fungi	otherwise an inline comment would have ended up associated with the "left side" (original) source line rather than the "right side"	21:17
JayF	fungi: note that reply was not to my comment but to the top level of the change. I suspect some sort of gertty and gerrit representation mismatch but those can be confusing	21:17
fungi	oh, there's a gertty patch available for that i think. i should double-check whether i've got it checked out currently	21:17
JayF	oh whoops	21:17
JayF	must have done a middle-click paste while scrolling	21:17
JayF	and I'm a highlight-reader!	21:17
corvus	yeah, there's a wip patch; it's like 80% done. it's workable, but still has some missing features and sometimes crashes.	21:18
clarkb	gotcha	21:18
corvus	i use it daily and just avoid the sharp edges	21:18
corvus	(one of which is that you only get one chance to resolve a comment then it disappears never to return until you publish... so.... i accidentally resolve comments sometimes ooops)	21:19
fungi	hah	21:19
fungi	huh, there's two different changes to add sqla v2 support to gertti	21:20
fungi	gertty	21:20
opendevreview	Merged ttygroup/gertty master: make gertty work with sqlalchemy-2 https://review.opendev.org/c/ttygroup/gertty/+/880123	21:24
corvus	now there's only one.	21:24
fungi	i replied to your question in that change too late, but i suppose it's not super important	21:26
fungi	took me a minute to research which sqla version introduced the syntax 2.0 requires	21:26
corvus	fungi: ah yeah that would be a good change	21:28
corvus	i just checked matthew's change out locally and it worked and took him at his word. but i have 1.4.	21:29
fungi	my connectivity to gitea seems to be sluggish	21:29
fungi	there it goes	21:30
clarkb	the backends respond quickly so whatever it is appears to be affecting the load balancer	21:31
corvus	ditto	21:31
opendevreview	Jeremy Stanley proposed ttygroup/gertty master: Set SQLAlchemy minimum to 1.4 https://review.opendev.org/c/ttygroup/gertty/+/908527	21:32
clarkb	I'm really trying to finish a review of a change that I don't want to page out though so won't be able to look closer for a bit	21:32
opendevreview	Merged ttygroup/gertty master: Set SQLAlchemy minimum to 1.4 https://review.opendev.org/c/ttygroup/gertty/+/908527	21:32
corvus	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1 suggests something is afoot	21:33
corvus	gitea11 may merit a closer look	21:34
corvus	oh now it's starting to look like we're getting some null metrics from the lb	21:35
fungi	cacti graphs for gitea11 don't seem that anomalous at least	21:35
corvus	is our current lb gitea-lb02? it's not answering ssh for me	21:36
clarkb	yes that is the current server	21:36
fungi	agreed, responding !H	21:37
fungi	though, again, cacti graphs looks relatively normal	21:38
fungi	suggesting the issue may be upstream from the load balancer	21:38
corvus	i think there may be null data at the end of the cacti graphs	21:38
clarkb	maybe check a nova server show to see if the server is up	21:38
clarkb	cacti polls infrequently enough you won't see it be sad for a bit	21:38
fungi	my !H responses are coming from 64.124.44.243.IDIA-176341-ZYO.zip.zayo.com	21:38
fungi	could be more vexxhost zayo problems and bgp just recently started sending more traffic that way	21:39
fungi	though traceroutes to the individual gitea servers make it all the way through	21:40
fungi	so might also be a core networking problem affecting one rack or something	21:41
fungi	almost looks like the network the gitea servers are in is being announced but not the network the load balancer is in	21:42
fungi	guilhermesp: mnaser: anything untoward happening in sjc?	21:44
fungi	though i can reach our mirror server there as well as the gitea backends, just not the haproxy server	21:45
clarkb	right that is why I wonder if it is a problem with the server not with the networking	21:46
fungi	just odd that we'd be getting a "no route to host" response from the backbone provider reaching that address	21:47
clarkb	nova does say the server is active and running	21:48
fungi	console log show is also returning a gateway timeout from nginx for that server, but other servers return a console log	21:48
fungi	if their cloud is bgp from neutron up, then the !H coming from their backbone peer makes sense if something has gone awry with part of the internal network	21:49
fungi	as opposed to architectures with a separate igp	21:50
corvus	i pinged gitea-lb02 from gitea11 and it worked briefly but is now unreachable	21:51
fungi	i got a console log after a few tries. nothing out of place with the console	21:53
clarkb	ok so not the server itself then	21:53
corvus	i think unable to connect to the lb from the backends suggests an internal cloud networking problem	21:53
fungi	that's what it's seeming like, yes, and not one that's knocked their entire region offline either	21:54
corvus	depending on our optimism we could launch a new lb	21:55
clarkb	we could also set up the backends in a dns round robin and cross our fingers for a bit	21:55
clarkb	oh except they don't expose 443 nevermind	21:55
clarkb	vexxhost status page doesn't show anythhing amiss yet	21:56
corvus	looks like our ttl is 1h	21:56
clarkb	it seems to affect both ipv4 and ipv6 to the host as well	21:57
guilhermesp	fungi: weve got a compute with kernel panic, but we restored already -- still seeing issues?	21:59
clarkb	guilhermesp: ya the server says it is active and running according to server show but it doesn't ping	22:01
clarkb	and there is no network connectivity generally	22:01
clarkb	guilhermesp: the host is gitea-lb02.opendev.org (that will resolve in dns for A and AAAA records if addresses help too)	22:07
guilhermesp	are you able to ping now clarkb ?	22:16
corvus	lgtm now	22:16
corvus	i think there maybe ipv6 connectivity issues still, but ipv4 looks good	22:17
clarkb	yup ipv4 seems to work but ipv6 is still unreachable	22:17
fungi	thanks guilhermesp!	22:22
clarkb	++ thank you guilhermesp !	22:47
clarkb	ipv6 seems to be working now too	22:50

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!