frickler | clarkb: tonyb: I think it would be good if the start of the announcement provided a bit more context for people who have no prior knowledge of what this is all about. like which wheel do we build and why, and how are they getting consumed. I'll try to put some words together for that | 06:14 |
---|---|---|
frickler | I also don't like the term "wheel mirror", which implies we copy those from some other location. "wheel cache" like in the job names seems better, although also not perfect | 06:15 |
opendevreview | Jan Marchel proposed openstack/project-config master: Add netdata onfiguration repo to NebulOuS https://review.opendev.org/c/openstack/project-config/+/905293 | 09:02 |
fungi | infra-root: we got another ticket from rackspace a few minutes ago saying they're moving zuul's trove instance *again* due to hardware issues | 14:38 |
fungi | i guess we should be on the lookout for more disruption | 14:38 |
clarkb | frickler: ack. I don't see edits yet. I'm about to join the gerrit community meeting but can also try and make some edits after | 15:51 |
frickler | clarkb: yes, I was trying to find some wording but with no success so far. after reading #openstack-infra I also wasn't sure if this is meant to be sent soonish or wait to see whether the jobs can be made to work again | 16:11 |
clarkb | frickler: I think tonyb wanted to send it soon | 16:11 |
jrosser | just an observation but it feels like jobs are spending a while queued (~15min) which i've not seen before | 16:12 |
fungi | that can sometimes mean one of our node providers is strugging/failing to satisfy node requests in a timely fashion (or at all) | 16:13 |
jrosser | i see there is a big spike in node usage at this moment, but it's been that way all day even when the usage in grafana looked small | 16:14 |
fungi | https://grafana.opendev.org/d/6c807ed8fd/nodepool?orgId=1 shows overall launch time and error counts | 16:14 |
fungi | ovh is taking upwards of half an hour to boot stuff in either region | 16:15 |
fungi | https://public-cloud.status-ovhcloud.com/ doesn't reflect any incidents currently | 16:16 |
frickler | there has been a huge burst of node requests at 16:05. I think might be due to a gate reset with 11 patches in the integrated pipeline | 16:17 |
fungi | https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1 does indicate a backlog of node requests which haven't been satisfied yet, and the test nodes graph looks like it topped out at max for maybe 10 minutes | 16:18 |
fungi | maybe this is the start of the openstack feature freeze rush | 16:19 |
hashar | clarkb: and indeed I am still connected to this channel \o/ (it is Antoine) | 16:29 |
clarkb | o/ | 16:30 |
clarkb | hashar joined the gerrit community meeting (which was quiet otherwise) and we talked about how we build and test gerrit within opendev | 16:30 |
hashar | as a result I now have hundreds of yaml / ansible playbooks files to read | 16:31 |
clarkb | frickler: maybe something as simpel as referring to the mirror as a cache/mirror initially then we can just talk about it as a mirror elsewhere with the implication we are talking about the same thing? | 16:33 |
clarkb | frickler: and I added a sentence to give some quick background. | 16:34 |
clarkb | hashar: definitely reach out if you have questions. Also feel free to even push changes to system-config if you like to see it in action. You can even depends on changes in gerrit's gerrit service | 16:36 |
hashar | clarkb: is that cause your Zuul is smart enough to resolve a change from the an url like https://gerrit-review.googlesource.com/ ? | 16:36 |
fungi | hashar: yes, we have some examples i can dig up | 16:37 |
clarkb | hashar: yes, we have our zuul configured with an account that can talk to gerrit-review.googlesource and fetch changes | 16:37 |
hashar | - name: 'googlesource' | 16:37 |
hashar | driver: 'gerrit' | 16:37 |
hashar | server: 'gerrit-review.googlesource.com' | 16:37 |
hashar | that is quite nice. And I am reading playbooks/zuul/bootstrap-test-review.yaml and playbooks/zuul/test-review.yaml | 16:39 |
fungi | hashar: https://review.opendev.org/c/opendev/system-config/+/896290 is an example of doing a cross-repo change dependency between our system-config repo and a proposed upstream change for gerrit's replication plugin | 16:49 |
hashar | oh very nice, thanks for the example link! | 16:51 |
fungi | and yes, since the relevant jobs are set up to consume gerrit and its plugins from source, we're able to have zuul set up speculative versions of those repositories to incorporate proposed changes for them in those jobs | 16:52 |
fungi | we do the same sort of thing with some other projects we rely on too, for example we have jobs that use ansible's devel branch, and so can test with cross-repo dependencies on upstream ansible github pull requests | 16:54 |
*** travisholton9 is now known as travisholton | 16:59 | |
clarkb | gmail flags googlegroup messages as spam. Or would if I didn't have specific rules for the gerrit mailing list in gmail to keep it out of spam | 17:05 |
clarkb | an interesting side note to our mailing list struggles with google. Even google doesn't like their own implementation | 17:06 |
frickler | clarkb: thx for the edit, it does state what I had in mind quite well | 19:24 |
frickler | I'm seeing a weirdness that may be related to the rackspace zuul db issue. https://zuul.opendev.org/t/openstack/buildset/e43341130f534d339440eb8245a7a578 is showing the openstack-tox-py311 job as being in progress, while it has been reported as successful on the change | 19:49 |
fungi | frickler: yeah, that seems wrong indeed | 19:52 |
fungi | thankfully it looks like the one voting job which failed in that buildset wasn't the one that didn't get its completion recorded in the db | 19:53 |
fungi | so logs are still discoverable | 19:53 |
fungi | and things which reported after that time seem to be working | 19:54 |
frickler | ah, the logs for that job are also not available, I hadn't even tried looking at them. let me try to check zuul logs | 19:57 |
frickler | there's "ERROR zuul.SQLReporter: Unable to update build, will retry" without any reference to which build failed updating, but the timestamp of that is exactly 5 min after the build finished, so I assume that to match | 20:01 |
fungi | sounds almost definitely related to the trove issues in rackspace in that case | 20:06 |
frickler | also this 5s later: 2024-01-11 15:55:38,500 ERROR zuul.SQLReporter: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'b52b0ab7c5d18bd64c3a20f5ac0c84ac3408116f.rackspaceclouddb.com' ([Errno 111] Connection refused)") | 20:06 |
clarkb | that may not be fixable other than to edit the db directly depending on how aggressively zuul will retry | 20:07 |
frickler | this seems to have recurred until 15:58, so ~ 3 mins outage | 20:07 |
frickler | or maybe 8 mins if we include the retrying timeout before that | 20:08 |
frickler | I think rechecking the affected jobs if needed should be fine. it seems zuul did report the results that it had in memory, regardless of whether they were written to the db, so for most jobs this won't even be noticed | 20:09 |
clarkb | ya I think its mostly only an issue if you need the logs and/or historical info when querying the db (how many failures type of thing) | 20:09 |
clarkb | impact is definitely low and I agree for most a recheck should be fine | 20:10 |
frickler | I only saw it because I checked the timeline for the above buildset, to see whether the failing job was the first one that ran | 20:10 |
clarkb | the grafana graphs for dib image status show the focal image as failing but it built about 11-12 hours ago and was uploaded successfully to both arm64 cloud regions | 20:13 |
clarkb | not sure what is going on there but we are working through the arm64 image build backlog successfully despite what the graph seems to show | 20:13 |
frickler | I think the grafana dib page has been weird for some time already | 20:15 |
tonyb | In testing https://review.opendev.org/c/opendev/system-config/+/905183 I discovered that the JVB nodes can't communicate with the prosody on meetpad because of firewall rules | 23:06 |
tonyb | I *think* it's because in the test setup we don't have an jvb ansible group so https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/meetpad.yaml#L8 never triggers | 23:07 |
tonyb | I kinda verified this on the held node by inserting the needed rule into iptables and there was much success | 23:08 |
clarkb | tonyb: the jvb group should include all hosts matching jvb[0-9]*.opendev.org | 23:08 |
clarkb | which I think includes the test jvb server? | 23:09 |
tonyb | so to setup the testing groups correctly can I add something like that to playbooks/zuul/templates/gate-groups.yaml.j2 | 23:09 |
clarkb | tonyb: well it should already be happening. Your test node is called jvb99.opendev.org which should match jvb[0-9]*.opendev.org | 23:11 |
tonyb | clarkb: I don't think we include the full groups.yaml in the testing config | 23:11 |
clarkb | playbooks/zuul/templates/gate-groups.yaml.j2 should only be there to add nodes that don't match our existing production groups to the appropriate groups | 23:11 |
clarkb | tonyb: we should | 23:12 |
tonyb | clarkb: https://paste.opendev.org/show/b4d1dz0GSAsOvftCijfA/ | 23:12 |
tonyb | I don't see it there | 23:13 |
tonyb | Oh you're coirrect | 23:14 |
clarkb | tonyb: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/groups.yaml this file is the one that defines the groups. Your paste is showing us the group var file definitions for things in the groups that are overridden for testing | 23:14 |
tonyb | https://paste.opendev.org/show/bgjIJWVwbzB0zP6Un7h4/ | 23:15 |
clarkb | and I think that points to the actual problem which is that we have a meetpad.yaml group vars file which overrides the one in system-config which sets the iptables rules? | 23:15 |
clarkb | for some reason I thought it didn't override unless we provided colliding var names though and they are merged so maybe that isn't it | 23:15 |
tonyb | the meetpad group var only contains passwords | 23:15 |
clarkb | that would imply it is a full override which I thought it shouldn't be but we should probably investigate that thread further | 23:16 |
clarkb | I have to do a school run now though | 23:16 |
tonyb | okay. I'll poke | 23:16 |
clarkb | gitea also for example only provides password since those test group vars are basically substituting for private vars and then they should be mixed in with the public varts just like our normal public/private split | 23:17 |
tonyb | It looks like the group vars have been merged. I see the expected entries in iptables_extra_allowed_groups on meetpad99 | 23:25 |
Clark[m] | Maybe the base playbook with the IPtables role didn't run or it added the wrong IP? | 23:35 |
tonyb | the generated rules file doen't have matching entries | 23:38 |
tonyb | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/iptables/templates/rules.v4.j2#L31-L37 | 23:38 |
tonyb | so I kinda want to run that role manually to see what's going on in that template | 23:38 |
tonyb | Okay got it. | 23:43 |
tonyb | hostvars["jvb99.opendev.org"].public_v4 != the actual public IP of jvb99 | 23:44 |
tonyb | so yeah it added the wrong IP | 23:45 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!