opendevreview | Dmitriy Rabotyagov proposed zuul/zuul-jobs master: Add role for uploading Ansible collections to Galaxy https://review.opendev.org/c/zuul/zuul-jobs/+/899230 | 08:12 |
---|---|---|
pcheli | Hello, I'm setting up ThirdParty CI with jenkins and gerrit-trigger plugin. | 08:45 |
pcheli | Generally, it works. However, results posting fails with Too many concurrent connections (96) - max. allowed: 96. | 08:45 |
pcheli | Can anybody help with this? | 08:45 |
zigo | I'm really not sure what to do to re-trigger the puppet-nova release job and get the release notes in order ... can someone help? | 08:48 |
zigo | https://docs.openstack.org/releasenotes/puppet-heat/2023.2.html <--- 404 as well ... | 08:57 |
frickler | zigo: did you check the release jobs as clarkb suggested earlier? | 09:24 |
frickler | pcheli: seems you need to limit the number of connections your setup uses, no idea how to help with that. also please don't ask the same question in multiple channels if possible | 09:25 |
zigo | frickler: I'm really not sure how to do this ... :/ | 09:30 |
zigo | Alltogether, we have puppet-{heat,nova,octavia} that have broken release notes. | 09:31 |
tkajinam | frickler clarkb zigo, hmm it's strange that the promote job succeeded without any error after https://review.opendev.org/c/openstack/puppet-nova/+/898384 was merged | 10:45 |
zigo | Ah, thanks for looking into it! :) | 10:46 |
zigo | I had the same thinking and didn't get it too... | 10:46 |
tkajinam | I subscribe to the release-job-failures list but I've not seen any failures about these puppet repos, either | 10:47 |
tkajinam | (I mean release-job-failures@lists.openstack.org | 10:47 |
tkajinam | https://zuul.opendev.org/t/openstack/builds?job_name=publish-openstack-releasenotes-python3&project=openstack%2Fpuppet-nova&skip=0 | 10:48 |
tkajinam | it looks like we have to trigger the job to build release notes based on the latest master content to reflect the change in the index made by that 898384 but I don't clearly understand why it hasn't been done | 10:50 |
tkajinam | sorry I have be disconnected for a while, but I'll check the status later (or tomorrow) | 10:52 |
tkajinam | I have to be * | 10:52 |
frickler | seems the above publish job ran at the same time as the job for the update of the 2023.2 branch, which did not have the 2023.2 reno update yet https://review.opendev.org/c/openstack/puppet-nova/+/898383 | 11:24 |
frickler | so that may have overwritten the content from the master patch. I'm not sure whether we can simply reenqueue the promote job, another - maybe safer - solution would be to commit any new update on the release notes, like just a typo or formatting fix, which should cause the whole site to be republished in the correct form | 11:26 |
fungi | yes, the problem with those release notes jobs for different branches sharing the same file tree is that changes for different branches can race one another and publish content out of sequence compared to the order in which they were built/merged | 12:03 |
*** d34dh0r5- is now known as d34dh0r53 | 12:20 | |
*** Guest4496 is now known as diablo_rojo | 13:09 | |
clarkb | pcheli: I would use netstat/ss/lsof to determine how many connectiosn you've got to gerrit from the Jenkins host. If it is a high number (near 96) then you'll need to debug the Jenkins server. If it is much smaller and you are traversing NAT then you may need to identify other sources of connections. | 13:57 |
clarkb | pcheli: however, I suspect they will be from the Jenkins server because the 96 connections limit is per username iirc and not per IP. We have a separate slightly higher limit for IPs | 13:57 |
fungi | yes, also it's likely you have a bug with something not correctly closing ssh sessions | 13:58 |
pcheli | clarkb: I've found only one connection. tcp6 0 0 xxxx:34254 199.204.45.33:29418 ESTABLISHED 9210/java | 13:58 |
pcheli | that's why I'm asking :) | 13:59 |
fungi | 96 open connections to gerrit's ssh api is unlikely to represent normal behavior | 13:59 |
fungi | pcheli: is it possible you have a firewall in front of your jenkins server that is uncleanly dropping "idle" ssh connections? if it doesn't cleanly terminate the connection by sending a tcp/rst or fin on behalf of the client, then the gerrit server will assume those old connections are still open | 14:00 |
fungi | we can manually close them, but they'll just pile back up again if the problem isn't addressed | 14:00 |
clarkb | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gerrit/templates/gerrit.config.j2#L56 this is where the limit comes from and it is configured by user account | 14:01 |
clarkb | (just to be sure the 96 limit wasn't our IP limit) | 14:01 |
fungi | yeah, the connections per ip address limit we set with conntrack in iptables is 100, and if you hit that you'll start getting icmp port-unreachable errors rather than error messages from the api itself | 14:04 |
pcheli | I've found the same issue in mailing list resolved by Clark Boylan by killing stale connections. May I ask you to do the same? | 14:04 |
fungi | like i said, doing that may temporarily stop the errors, but unless you know what caused you to end up with so many unclosed connections (like a poorly-configured firewall, for example) then it will start happening again at some point | 14:05 |
pcheli | fungi: I've updated gerrit trigger plugin. Hopefully, it will resolve the issue. | 14:16 |
fungi | pcheli: if it has an ssh keepalive option, or dead peer detection feature, make sure those are turned on | 14:17 |
pcheli | Hm, nothing like this. | 14:19 |
fungi | looks like the only account with 96 established ssh sessions is a/33746 | 14:21 |
pcheli | yep, this is mine | 14:23 |
fungi | i've got a loop going telling gerrit to close all those now | 14:23 |
fungi | this will take a few minutes to complete | 14:24 |
fungi | #status log Manually closed 96 stale SSH connections to Gerrit for account 33746 | 14:25 |
opendevstatus | fungi: finished logging | 14:25 |
fungi | pcheli: there's just 1 established session for that account now | 14:25 |
pcheli | fungi: can you check again pls? | 14:27 |
pcheli | just to be sure that everything is fine | 14:27 |
fungi | pcheli: still only 1 session for that account at the moment | 14:27 |
pcheli | Great | 14:28 |
pcheli | #thanks fungi | 14:28 |
fungi | i'll check again later in the day and see if the count starts to climb | 14:28 |
opendevstatus | pcheli: Added your thanks to Thanks page (https://wiki.openstack.org/wiki/Thanks) | 14:28 |
opendevreview | Merged opendev/system-config master: Stop building python3.9 container images https://review.opendev.org/c/opendev/system-config/+/898480 | 14:52 |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/898989 is ready for review and there is a link in the comments to a held test node where you can see that the conversion appears to be working in comments of the linked change | 14:52 |
clarkb | fungi: and I've marked the secondary email lookup thing in gerrit as a non issue as the tools only use primarily emails | 14:53 |
clarkb | fungi: if you are back today https://review.opendev.org/c/opendev/system-config/+/898505 might be a good one to try and get in. I've intentionally been waiting until more people are around so will defer on others' availability | 15:15 |
fungi | yep, i'm around enough today, parents are headed home but i have a repair tech coming to try to fix my washing machine | 15:20 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Add OpenInfra EU mailing lists https://review.opendev.org/c/opendev/system-config/+/898846 | 15:33 |
clarkb | fungi: for the ansible 8 change do you want to review it? | 15:52 |
clarkb | fungi: there are notes about the testing done in comments there as well | 15:52 |
fungi | clarkb: yep, i just approved it | 15:53 |
fungi | hoping it will also fix infra-prod-run-cloud-launcher | 15:53 |
clarkb | cool so the thing to check is that the virtualenv updates properly (it should) | 15:53 |
clarkb | and then monitor jobs | 15:53 |
fungi | yep | 15:53 |
opendevreview | Clark Boylan proposed opendev/system-config master: Revert "Cap ruamel.yaml install for ARA" https://review.opendev.org/c/opendev/system-config/+/899283 | 16:05 |
clarkb | testing if that cap is no longer necessary after some updates were made to ruamel.yaml | 16:05 |
fungi | oh, did they roll some stuff back or fix regressions? | 16:09 |
clarkb | fungi: they replaced a sys.exit() call with an exception throw | 16:10 |
clarkb | apparently they were hard crashign things previously by exiting 1 in the library... | 16:11 |
fungi | ouch | 16:12 |
fungi | yeah, sys.exit() is really never appropriate in a library | 16:13 |
opendevreview | Merged opendev/system-config master: Update to Ansible 8 on bridge https://review.opendev.org/c/opendev/system-config/+/898505 | 16:25 |
clarkb | ansible==8.5.0 | 16:31 |
clarkb | I believe the upgrade of ansible in the venv worked | 16:31 |
fungi | that was fast! | 16:31 |
clarkb | fungi: the merge for the list creation will probably be the first thing that runs under ansible 8 just fyi | 16:32 |
clarkb | I can execute ansible-playbook --version successfully as well so the install seems to be good | 16:33 |
clarkb | https://zuul.opendev.org/t/openstack/build/d095cf5cd898428982a71742f30a7c74/log/bridge99.opendev.org/ansible/install-root-key.2023-10-25T16:17:50.log this log shows the ruamel thing is no logner fatal (the rest of the playbook runs rather tahn stopping) | 16:36 |
clarkb | and we get an ara report https://44e79568cedacd253db2-e38ecce2b4446ed6b5d96caa6af2a2c7.ssl.cf2.rackcdn.com/899283/1/check/system-config-run-base/d095cf5/bridge99.opendev.org/ara-report/ | 16:36 |
fungi | oh nice | 16:36 |
clarkb | so ara is still working. I guess that isn't a super critical piece of code? | 16:36 |
clarkb | (I think it is in the ara server path which we don't really use maybe) | 16:36 |
clarkb | so ya https://review.opendev.org/c/opendev/system-config/+/899283 should be safe to merge | 16:39 |
opendevreview | Merged opendev/system-config master: Add OpenInfra EU mailing lists https://review.opendev.org/c/opendev/system-config/+/898846 | 16:42 |
clarkb | fungi: the lists playbook is running now | 16:59 |
fungi | thanks! looks like it worked | 17:03 |
clarkb | ya I see the public list that was created | 17:04 |
clarkb | there are a number of gerrit 3.8 changes that affect theming plugins and general ui plugins. https://217.182.143.183/c/x/test-project/+/3?tab=change-view-tab-header-zuul-results-summary looks fine though | 17:47 |
clarkb | I'll do some grepping of the removed/renamed methods across the two plugins we run to see if there are any hits but I suspect all that is a non issue based on the held node's behavior | 17:48 |
clarkb | fungi: can you check my notes for 358975 in https://etherpad.opendev.org/p/gerrit-upgrade-3.8? I think this is somethign we don't really care about but its a big enough chagne that I want another set of eyeballs on it. I tried to sumarrize the behavior change as well as my interpretation for why this doesn't affect us | 18:40 |
clarkb | If we can cross that one off then the commentlinks chagne is the only one out of that list to take action on. I'll have to look at the other changes listed next (the non breaking but still called out changes) | 18:42 |
fungi | clarkb: yeah, i think it'll be fine. if anything, tooling we have that queries such things may be able to drop some error checks because now they'll get well-formed empty responses | 18:43 |
clarkb | thanks I've struck it out. Leaving just commentlinks so far as something we need to address pre upgrade | 18:46 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Upgrade to latest Mailman 3 releases https://review.opendev.org/c/opendev/system-config/+/899300 | 19:39 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304 | 19:46 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Clean up old Mailman v2 roles and vars https://review.opendev.org/c/opendev/system-config/+/899305 | 19:52 |
fungi | infra-root: ^ more post-migration changes for mailman v3 | 19:53 |
fungi | not urgent, just trying to make sure they didn't fall off my plate while it's still fresh in my mind | 19:54 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304 | 20:11 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Clean up old Mailman v2 roles and vars https://review.opendev.org/c/opendev/system-config/+/899305 | 20:16 |
clarkb | fungi: I'm not seeing any special upgrade steps between these versions of mm3 components. is taht your read too? | 20:25 |
clarkb | basically we stop the containers, then start the containers which will run db migratiosn as necessary and that should be it? (those steps are automated too iirc) | 20:25 |
fungi | right | 20:25 |
fungi | just like last upgrade | 20:26 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304 | 20:47 |
clarkb | fungi: looks like the upgrade change is failign on the db check for the auth user table being present | 20:51 |
clarkb | I wonder if that table has a new name | 20:51 |
fungi | the db container log shows auth errors | 20:52 |
fungi | still digging | 20:52 |
fungi | hard to tell in the console log what the timestamps would really be for starting and stopping trying to check for that table | 20:57 |
fungi | these are suspicious: https://zuul.opendev.org/t/openstack/build/f24a998cc95340bd82fc69f3e637b0e2/log/lists99.opendev.org/docker/mailman-compose_database_1.txt#87-116 | 20:59 |
clarkb | fungi: https://zuul.opendev.org/t/openstack/build/f24a998cc95340bd82fc69f3e637b0e2/log/job-output.txt#17764 that is connecting as the mailman user | 21:04 |
clarkb | fungi: looking in ara it seems to be saying we never get any stdout which implies to me that the database table just doesn't exist | 21:07 |
clarkb | could be something isn't creating it because there is an error or the db table was renamed | 21:07 |
fungi | yeah, i'll probably have to hold a node and inspect the db, or add a mysqldump | 21:27 |
TheJulia | o/ Regarding glean, is the testing just image builds, or do we try to boot the image with say, static network config via configuration drive? | 21:52 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Merge production and test node mailman configs https://review.opendev.org/c/opendev/system-config/+/899304 | 22:01 |
clarkb | TheJulia: the integration testing with nodepool and dib does a full build and boot and ssh into the node test | 22:02 |
fungi | dib-nodepool-functional-openstack-centos-9-stream-src et cetera | 22:02 |
clarkb | TheJulia: the unittests simply rely on that os detection library to mock out /etc/os-release stuff and then we check output results for the config files | 22:02 |
TheJulia | clarkb: but do those nodes operate with full static metadata, or are we just doing dhcp? I ask because at least on centos9, I've noticed I'm not getting static config applying necessarily on an instance boot, which has me raising my eyebrow | 22:08 |
clarkb | TheJulia: oh is the question whether or not dhcp is used or static config? I'm not sure. It could be default dhcp. We would need to look at the nodepool config for the provider | 22:11 |
clarkb | also I think openstack actually makes it difficult to not do dhcp. Which makes the fact that multiple public clouds fail at dhcp all the more surprising | 22:11 |
TheJulia | Okay, I ask because I have been working on an advanced ironic job without dhcp | 22:11 |
TheJulia | and expecting simple-init/glean to just work, and it thinks it does things, but doesn't seem to | 22:11 |
TheJulia | At least, with the instance image, which is still a bit curious. | 22:12 |
clarkb | fwiw glean does work without dhcp on our images beacuse they all boot in rackspace | 22:12 |
TheJulia | Yeah, that is a good data point | 22:12 |
TheJulia | I know this worked in the past, but maybe something changed. Dunno. It is also weird it just works with the ramdisk I boot, but not again when I reboot | 22:13 |
TheJulia | I can see it doing what it expects, I might just have to reproduce it locally | 22:13 |
clarkb | with centos 9 you have to use network manager with glean but I thought that was autmatic when using simple-init | 22:16 |
TheJulia | ... yeah, that is what I was thinking as well. | 22:17 |
TheJulia | I might be grazing upon some problematic case | 22:17 |
TheJulia | so in my stack of changes, I can see where I explicitly re-run glean to extract the configuration, and then trigger networkmanager to refresh and it does the needful, it is an instance image though that fails | 22:21 |
TheJulia | which is built very similarly | 22:21 |
TheJulia | hmmmmm | 22:21 |
TheJulia | I wonder if this is centos vs centos-minimal... | 22:23 |
TheJulia | err, that makes no sense | 22:23 |
* TheJulia will look deeper tomorrow | 22:23 | |
clarkb | diablo_rojo: tonyb: the ptgbot etherpad for tomorrow doesn't ahve any agenda. IIRC that was a session frickler was interested in but requested a meetpad location instead of zoom? | 22:59 |
clarkb | I was planning to be there but wanted to call that otu to make sure everyone could attend | 23:00 |
tonyb | I think it's on Friday sometime? | 23:04 |
diablo_rojo_phone | Heh I guess i don't remember signing up for that time but okay lol. | 23:05 |
diablo_rojo_phone | Yes we can definitely do meetpad instead. | 23:05 |
diablo_rojo_phone | I am happy to meet there instead. | 23:05 |
tonyb | https://meetpad.opendev.org/oct2023-ptg-ptgbot registered for tomorrow | 23:08 |
diablo_rojo_phone | Perfect. | 23:09 |
diablo_rojo_phone | frickler: should we do an hour earlier so you don't miss tc stuff? | 23:11 |
diablo_rojo_phone | Assuming that works for you clarkb and you tonyb | 23:11 |
diablo_rojo_phone | fungi: too. | 23:12 |
clarkb | that is fine with me. But I'm not sure if frickler is attendnign tc things due to zoom? | 23:12 |
clarkb | I dont' mind either way | 23:12 |
tonyb | I thought the TC agreed to use meetpad rather than zoom | 23:16 |
clarkb | if they did it isn't in the schedule. The previous tc sessions were on zoom not meetpad | 23:16 |
tonyb | But that's not what's in the bot | 23:16 |
tonyb | so I guess I imagined it | 23:16 |
tonyb | diablo_rojo_phone: an hour ealier would be good for me as I'd like to be in the "leaderless projects reto/discussion" | 23:17 |
fungi | an hour earlier will conflict with openstack qa rather than tc, not sure if frickler wanted to attend both | 23:20 |
clarkb | also apologies if I misremembered frickler's interested in that session. I swear that was one that frickler said would be attended if held on meetpad though | 23:21 |
fungi | two hours earlier wouldn't conflict with either one, but might be early for folks in pdt | 23:21 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!