Thursday, 2024-01-11

frickler	clarkb: tonyb: I think it would be good if the start of the announcement provided a bit more context for people who have no prior knowledge of what this is all about. like which wheel do we build and why, and how are they getting consumed. I'll try to put some words together for that	06:14
frickler	I also don't like the term "wheel mirror", which implies we copy those from some other location. "wheel cache" like in the job names seems better, although also not perfect	06:15
opendevreview	Jan Marchel proposed openstack/project-config master: Add netdata onfiguration repo to NebulOuS https://review.opendev.org/c/openstack/project-config/+/905293	09:02
fungi	infra-root: we got another ticket from rackspace a few minutes ago saying they're moving zuul's trove instance again due to hardware issues	14:38
fungi	i guess we should be on the lookout for more disruption	14:38
clarkb	frickler: ack. I don't see edits yet. I'm about to join the gerrit community meeting but can also try and make some edits after	15:51
frickler	clarkb: yes, I was trying to find some wording but with no success so far. after reading #openstack-infra I also wasn't sure if this is meant to be sent soonish or wait to see whether the jobs can be made to work again	16:11
clarkb	frickler: I think tonyb wanted to send it soon	16:11
jrosser	just an observation but it feels like jobs are spending a while queued (~15min) which i've not seen before	16:12
fungi	that can sometimes mean one of our node providers is strugging/failing to satisfy node requests in a timely fashion (or at all)	16:13
jrosser	i see there is a big spike in node usage at this moment, but it's been that way all day even when the usage in grafana looked small	16:14
fungi	https://grafana.opendev.org/d/6c807ed8fd/nodepool?orgId=1 shows overall launch time and error counts	16:14
fungi	ovh is taking upwards of half an hour to boot stuff in either region	16:15
fungi	https://public-cloud.status-ovhcloud.com/ doesn't reflect any incidents currently	16:16
frickler	there has been a huge burst of node requests at 16:05. I think might be due to a gate reset with 11 patches in the integrated pipeline	16:17
fungi	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1 does indicate a backlog of node requests which haven't been satisfied yet, and the test nodes graph looks like it topped out at max for maybe 10 minutes	16:18
fungi	maybe this is the start of the openstack feature freeze rush	16:19
hashar	clarkb: and indeed I am still connected to this channel \o/ (it is Antoine)	16:29
clarkb	o/	16:30
clarkb	hashar joined the gerrit community meeting (which was quiet otherwise) and we talked about how we build and test gerrit within opendev	16:30
hashar	as a result I now have hundreds of yaml / ansible playbooks files to read	16:31
clarkb	frickler: maybe something as simpel as referring to the mirror as a cache/mirror initially then we can just talk about it as a mirror elsewhere with the implication we are talking about the same thing?	16:33
clarkb	frickler: and I added a sentence to give some quick background.	16:34
clarkb	hashar: definitely reach out if you have questions. Also feel free to even push changes to system-config if you like to see it in action. You can even depends on changes in gerrit's gerrit service	16:36
hashar	clarkb: is that cause your Zuul is smart enough to resolve a change from the an url like https://gerrit-review.googlesource.com/ ?	16:36
fungi	hashar: yes, we have some examples i can dig up	16:37
clarkb	hashar: yes, we have our zuul configured with an account that can talk to gerrit-review.googlesource and fetch changes	16:37
hashar	- name: 'googlesource'	16:37
hashar	driver: 'gerrit'	16:37
hashar	server: 'gerrit-review.googlesource.com'	16:37
hashar	that is quite nice. And I am reading playbooks/zuul/bootstrap-test-review.yaml and playbooks/zuul/test-review.yaml	16:39
fungi	hashar: https://review.opendev.org/c/opendev/system-config/+/896290 is an example of doing a cross-repo change dependency between our system-config repo and a proposed upstream change for gerrit's replication plugin	16:49
hashar	oh very nice, thanks for the example link!	16:51
fungi	and yes, since the relevant jobs are set up to consume gerrit and its plugins from source, we're able to have zuul set up speculative versions of those repositories to incorporate proposed changes for them in those jobs	16:52
fungi	we do the same sort of thing with some other projects we rely on too, for example we have jobs that use ansible's devel branch, and so can test with cross-repo dependencies on upstream ansible github pull requests	16:54
*** travisholton9 is now known as travisholton		16:59
clarkb	gmail flags googlegroup messages as spam. Or would if I didn't have specific rules for the gerrit mailing list in gmail to keep it out of spam	17:05
clarkb	an interesting side note to our mailing list struggles with google. Even google doesn't like their own implementation	17:06
frickler	clarkb: thx for the edit, it does state what I had in mind quite well	19:24
frickler	I'm seeing a weirdness that may be related to the rackspace zuul db issue. https://zuul.opendev.org/t/openstack/buildset/e43341130f534d339440eb8245a7a578 is showing the openstack-tox-py311 job as being in progress, while it has been reported as successful on the change	19:49
fungi	frickler: yeah, that seems wrong indeed	19:52
fungi	thankfully it looks like the one voting job which failed in that buildset wasn't the one that didn't get its completion recorded in the db	19:53
fungi	so logs are still discoverable	19:53
fungi	and things which reported after that time seem to be working	19:54
frickler	ah, the logs for that job are also not available, I hadn't even tried looking at them. let me try to check zuul logs	19:57
frickler	there's "ERROR zuul.SQLReporter: Unable to update build, will retry" without any reference to which build failed updating, but the timestamp of that is exactly 5 min after the build finished, so I assume that to match	20:01
fungi	sounds almost definitely related to the trove issues in rackspace in that case	20:06
frickler	also this 5s later: 2024-01-11 15:55:38,500 ERROR zuul.SQLReporter: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'b52b0ab7c5d18bd64c3a20f5ac0c84ac3408116f.rackspaceclouddb.com' ([Errno 111] Connection refused)")	20:06
clarkb	that may not be fixable other than to edit the db directly depending on how aggressively zuul will retry	20:07
frickler	this seems to have recurred until 15:58, so ~ 3 mins outage	20:07
frickler	or maybe 8 mins if we include the retrying timeout before that	20:08
frickler	I think rechecking the affected jobs if needed should be fine. it seems zuul did report the results that it had in memory, regardless of whether they were written to the db, so for most jobs this won't even be noticed	20:09
clarkb	ya I think its mostly only an issue if you need the logs and/or historical info when querying the db (how many failures type of thing)	20:09
clarkb	impact is definitely low and I agree for most a recheck should be fine	20:10
frickler	I only saw it because I checked the timeline for the above buildset, to see whether the failing job was the first one that ran	20:10
clarkb	the grafana graphs for dib image status show the focal image as failing but it built about 11-12 hours ago and was uploaded successfully to both arm64 cloud regions	20:13
clarkb	not sure what is going on there but we are working through the arm64 image build backlog successfully despite what the graph seems to show	20:13
frickler	I think the grafana dib page has been weird for some time already	20:15
tonyb	In testing https://review.opendev.org/c/opendev/system-config/+/905183 I discovered that the JVB nodes can't communicate with the prosody on meetpad because of firewall rules	23:06
tonyb	I think it's because in the test setup we don't have an jvb ansible group so https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/meetpad.yaml#L8 never triggers	23:07
tonyb	I kinda verified this on the held node by inserting the needed rule into iptables and there was much success	23:08
clarkb	tonyb: the jvb group should include all hosts matching jvb[0-9]*.opendev.org	23:08
clarkb	which I think includes the test jvb server?	23:09
tonyb	so to setup the testing groups correctly can I add something like that to playbooks/zuul/templates/gate-groups.yaml.j2	23:09
clarkb	tonyb: well it should already be happening. Your test node is called jvb99.opendev.org which should match jvb[0-9]*.opendev.org	23:11
tonyb	clarkb: I don't think we include the full groups.yaml in the testing config	23:11
clarkb	playbooks/zuul/templates/gate-groups.yaml.j2 should only be there to add nodes that don't match our existing production groups to the appropriate groups	23:11
clarkb	tonyb: we should	23:12
tonyb	clarkb: https://paste.opendev.org/show/b4d1dz0GSAsOvftCijfA/	23:12
tonyb	I don't see it there	23:13
tonyb	Oh you're coirrect	23:14
clarkb	tonyb: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/groups.yaml this file is the one that defines the groups. Your paste is showing us the group var file definitions for things in the groups that are overridden for testing	23:14
tonyb	https://paste.opendev.org/show/bgjIJWVwbzB0zP6Un7h4/	23:15
clarkb	and I think that points to the actual problem which is that we have a meetpad.yaml group vars file which overrides the one in system-config which sets the iptables rules?	23:15
clarkb	for some reason I thought it didn't override unless we provided colliding var names though and they are merged so maybe that isn't it	23:15
tonyb	the meetpad group var only contains passwords	23:15
clarkb	that would imply it is a full override which I thought it shouldn't be but we should probably investigate that thread further	23:16
clarkb	I have to do a school run now though	23:16
tonyb	okay. I'll poke	23:16
clarkb	gitea also for example only provides password since those test group vars are basically substituting for private vars and then they should be mixed in with the public varts just like our normal public/private split	23:17
tonyb	It looks like the group vars have been merged. I see the expected entries in iptables_extra_allowed_groups on meetpad99	23:25
Clark[m]	Maybe the base playbook with the IPtables role didn't run or it added the wrong IP?	23:35
tonyb	the generated rules file doen't have matching entries	23:38
tonyb	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/iptables/templates/rules.v4.j2#L31-L37	23:38
tonyb	so I kinda want to run that role manually to see what's going on in that template	23:38
tonyb	Okay got it.	23:43
tonyb	hostvars["jvb99.opendev.org"].public_v4 != the actual public IP of jvb99	23:44
tonyb	so yeah it added the wrong IP	23:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!