Thursday, 2024-01-11

fricklerclarkb: tonyb: I think it would be good if the start of the announcement provided a bit more context for people who have no prior knowledge of what this is all about. like which wheel do we build and why, and how are they getting consumed. I'll try to put some words together for that06:14
fricklerI also don't like the term "wheel mirror", which implies we copy those from some other location. "wheel cache" like in the job names seems better, although also not perfect06:15
opendevreviewJan Marchel proposed openstack/project-config master: Add netdata onfiguration repo to NebulOuS
fungiinfra-root: we got another ticket from rackspace a few minutes ago saying they're moving zuul's trove instance *again* due to hardware issues14:38
fungii guess we should be on the lookout for more disruption14:38
clarkbfrickler: ack. I don't see edits yet. I'm about to join the gerrit community meeting but can also try and make some edits after15:51
fricklerclarkb: yes, I was trying to find some wording but with no success so far. after reading #openstack-infra I also wasn't sure if this is meant to be sent soonish or wait to see whether the jobs can be made to work again16:11
clarkbfrickler: I think tonyb wanted to send it soon16:11
jrosserjust an observation but it feels like jobs are spending a while queued (~15min) which i've not seen before16:12
fungithat can sometimes mean one of our node providers is strugging/failing to satisfy node requests in a timely fashion (or at all)16:13
jrosseri see there is a big spike in node usage at this moment, but it's been that way all day even when the usage in grafana looked small16:14
fungi shows overall launch time and error counts16:14
fungiovh is taking upwards of half an hour to boot stuff in either region16:15
fungi doesn't reflect any incidents currently16:16
fricklerthere has been a huge burst of node requests at 16:05. I think might be due to a gate reset with 11 patches in the integrated pipeline16:17
fungi does indicate a backlog of node requests which haven't been satisfied yet, and the test nodes graph looks like it topped out at max for maybe 10 minutes16:18
fungimaybe this is the start of the openstack feature freeze rush16:19
hasharclarkb: and indeed I am still connected to this channel \o/  (it is Antoine)16:29
clarkbhashar joined the gerrit community meeting (which was quiet otherwise) and we talked about how we build and test gerrit within opendev16:30
hasharas a result I now have hundreds of yaml / ansible playbooks files to read16:31
clarkbfrickler: maybe something as simpel as referring to the mirror as a cache/mirror initially then we can just talk about it as a mirror elsewhere with the implication we are talking about the same thing?16:33
clarkbfrickler: and I added a sentence to give some quick background.16:34
clarkbhashar: definitely reach out if you have questions. Also feel free to even push changes to system-config if you like to see it in action. You can even depends on changes in gerrit's gerrit service16:36
hasharclarkb: is that cause your Zuul is smart enough to resolve a change from the an url like ?16:36
fungihashar: yes, we have some examples i can dig up16:37
clarkbhashar: yes, we have our zuul configured with an account that can talk to gerrit-review.googlesource and fetch changes16:37
hashar  - name: 'googlesource'16:37
hashar    driver: 'gerrit'16:37
hashar    server: ''16:37
hasharthat is quite nice.  And I am reading playbooks/zuul/bootstrap-test-review.yaml and playbooks/zuul/test-review.yaml16:39
fungihashar: is an example of doing a cross-repo change dependency between our system-config repo and a proposed upstream change for gerrit's replication plugin16:49
hasharoh very nice, thanks for the example link!16:51
fungiand yes, since the relevant jobs are set up to consume gerrit and its plugins from source, we're able to have zuul set up speculative versions of those repositories to incorporate proposed changes for them in those jobs16:52
fungiwe do the same sort of thing with some other projects we rely on too, for example we have jobs that use ansible's devel branch, and so can test with cross-repo dependencies on upstream ansible github pull requests16:54
*** travisholton9 is now known as travisholton16:59
clarkbgmail flags googlegroup messages as spam. Or would if I didn't have specific rules for the gerrit mailing list in gmail to keep it out of spam17:05
clarkban interesting side note to our mailing list struggles with google. Even google doesn't like their own implementation17:06
fricklerclarkb: thx for the edit, it does state what I had in mind quite well19:24
fricklerI'm seeing a weirdness that may be related to the rackspace zuul db issue. is showing the openstack-tox-py311 job as being in progress, while it has been reported as successful on the change19:49
fungifrickler: yeah, that seems wrong indeed19:52
fungithankfully it looks like the one voting job which failed in that buildset wasn't the one that didn't get its completion recorded in the db19:53
fungiso logs are still discoverable19:53
fungiand things which reported after that time seem to be working19:54
fricklerah, the logs for that job are also not available, I hadn't even tried looking at them. let me try to check zuul logs19:57
fricklerthere's "ERROR zuul.SQLReporter: Unable to update build, will retry" without any reference to which build failed updating, but the timestamp of that is exactly 5 min after the build finished, so I assume that to match20:01
fungisounds almost definitely related to the trove issues in rackspace in that case20:06
frickleralso this 5s later: 2024-01-11 15:55:38,500 ERROR zuul.SQLReporter:   sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '' ([Errno 111] Connection refused)")20:06
clarkbthat may not be fixable other than  to edit the db directly depending on how aggressively zuul will retry20:07
fricklerthis seems to have recurred until 15:58, so ~ 3 mins outage20:07
frickleror maybe 8 mins if we include the retrying timeout before that20:08
fricklerI think rechecking the affected jobs if needed should be fine. it seems zuul did report the results that it had in memory, regardless of whether they were written to the db, so for most jobs this won't even be noticed20:09
clarkbya I think its mostly only an issue if you need the logs and/or historical info when querying the db (how many failures type of thing)20:09
clarkbimpact is definitely low and I agree for most a recheck should be fine20:10
fricklerI only saw it because I checked the timeline for the above buildset, to see whether the failing job was the first one that ran20:10
clarkbthe grafana graphs for dib image status show the focal image as failing but it built about 11-12 hours ago and was uploaded successfully to both arm64 cloud regions20:13
clarkbnot sure what is going on there but we are working through the arm64 image build backlog successfully despite what the graph seems to show20:13
fricklerI think the grafana dib page has been weird for some time already20:15
tonybIn testing I discovered that the JVB nodes can't communicate with the prosody on meetpad because of firewall rules23:06
tonybI *think* it's because in the test setup we don't have an jvb ansible group so never triggers23:07
tonybI kinda verified this on the held node by inserting the needed rule into iptables and there was much success23:08
clarkbtonyb: the jvb group should include all hosts matching jvb[0-9]*.opendev.org23:08
clarkbwhich I think includes the test jvb server?23:09
tonybso to setup the testing groups correctly can I add something like that to playbooks/zuul/templates/gate-groups.yaml.j223:09
clarkbtonyb: well it should already be happening. Your test node is called which should match jvb[0-9]*.opendev.org23:11
tonybclarkb: I don't think we include the full groups.yaml in the testing config23:11
clarkbplaybooks/zuul/templates/gate-groups.yaml.j2 should only be there to add nodes that don't match our existing production groups to the appropriate groups23:11
clarkbtonyb: we should23:12
tonybI don't see it there23:13
tonybOh you're coirrect23:14
clarkbtonyb: this file is the one that defines the groups. Your paste is showing us the group var file definitions for things in the groups that are overridden for testing23:14
clarkband I think that points to the actual problem which is that we have a meetpad.yaml group vars file which overrides the one in system-config which sets the iptables rules?23:15
clarkbfor some reason I thought it didn't override unless we provided colliding var names though and they are merged so maybe that isn't it23:15
tonybthe meetpad group var only contains passwords23:15
clarkbthat would imply it is a full override which I thought it shouldn't be but we should probably investigate that thread further23:16
clarkbI have to do a school run now though23:16
tonybokay.  I'll poke23:16
clarkbgitea also for example only provides password since those test group vars are basically substituting for private vars and then they should be mixed in with the public varts just like our normal public/private split23:17
tonybIt looks like the group vars have been merged.  I see the expected entries in iptables_extra_allowed_groups on meetpad9923:25
Clark[m]Maybe the base playbook with the IPtables role didn't run or it added the wrong IP?23:35
tonybthe generated rules file doen't have matching entries23:38
tonybso I kinda want to run that role manually to see what's going on in that template23:38
tonybOkay got it.23:43
tonybhostvars[""].public_v4 != the actual public IP of jvb9923:44
tonybso yeah it added the wrong IP23:45

Generated by 2.17.3 by Marius Gedminas - find it at!