| -@gerrit:opendev.org- Zuul merged on behalf of Henrik Wahlqvist: [openstack/project-config] 975154: Add back volvocars repos in volvocars tenant https://review.opendev.org/c/openstack/project-config/+/975154 | 14:42 | |
| @fungicide:matrix.org | attempting to request apache server status on lists.o.o is timing out so i suspect all worker slots are in use yet again. i'm going to restart apache to clear out any that might be hung and haven't timed out yet | 15:28 |
|---|---|---|
| @fungicide:matrix.org | looks like gitea14 may be overloaded, looks like my ssh may time out trying to log in even | 15:46 |
| @clarkb:matrix.org | fungi: do we know if any opendev stuff is affected by the setuptools release yet? Some naive greps on my side don't show any pkg_resource hits so I think as long as the pbr updates we made late last yaer are happy then we may be ok | 15:49 |
| @clarkb:matrix.org | as for web crawler traffic load maybe we should land your docs.opendev.org update and see how that fares? | 15:49 |
| @fungicide:matrix.org | no clue, i'm still trying to catch up on morning moderation tasks and outages, haven't even gotten to the e-mail thread about setuptools on openstack-discuss yet | 15:50 |
| @clarkb:matrix.org | ack. I suspect if we do have any problems it is in our dependencies and not a direct issue | 15:50 |
| @fungicide:matrix.org | when my ssh into gitea14 finally connected, load averages reported on it were lower than on any other backends, fwiw | 15:50 |
| @clarkb:matrix.org | fungi: I think a behavior we see with those backends is one will get overloaded by a particularly aggressive client ip and then it gets removed from the pool and slowly returns to normal as it processes out the running requests. While that happens a new server is hit in the pool since the old one was removed and we do that in a cycle | 15:51 |
| @fungicide:matrix.org | i can't see what resource issues could have slowed down my ability to ssh into gitea14 though, cpu and memory look okay, maybe disk i/o is just super, super slow? | 15:54 |
| @clarkb:matrix.org | maybe? I think a lot of that git disk content manages to get cached, but if we're processing enough requests to load memory up with "real" data those disk caches get evicted and maybe we become very sensitive to disk io? | 15:55 |
| @clarkb:matrix.org | also possible that this isn't a crawler problem and instead some sort of VM hosting / VM problem? | 15:55 |
| @clarkb:matrix.org | fungi: for example I think live migrations might exhibit slow network behaviors | 15:56 |
| @fungicide:matrix.org | yeah, i'm wondering if there's something pathological with network to/from the server maybe | 15:56 |
| @clarkb:matrix.org | I think `openstack client server event list` or similar will give you a history of migrations | 15:56 |
| @clarkb:matrix.org | * I think `openstack server event list` or similar will give you a history of migrations | 15:56 |
| @fungicide:matrix.org | oh, ipv6 ping is 100% packet loss for me to all the gitea servers, maybe something is happening with v6 routes to vexxhost, though that doesn't explain the problem with gitea14 | 15:57 |
| @clarkb:matrix.org | fungi: well your ssh conenction may have taken time to timeout on ipv6 before falling back to ipv4? | 15:58 |
| @fungicide:matrix.org | 0% packet loss for ipv4 to gitea14 at least | 15:58 |
| @clarkb:matrix.org | that may explain what you saw connecting | 15:58 |
| @fungicide:matrix.org | i was able to ssh into the other 5 backends just fine though | 15:58 |
| @clarkb:matrix.org | ah | 15:59 |
| @fungicide:matrix.org | huh, though `ssh -4` does connect instantly | 15:59 |
| @fungicide:matrix.org | ssh over ipv6 is working for me to gitea09, just not icmp echo over ipv6 | 16:00 |
| @fungicide:matrix.org | so looks like ipv6 ping is broken to all the gitea servers, but ssh over v6 is working to all except gitea14. i guess that's the difference | 16:01 |
| @fungicide:matrix.org | and explains the long delay i see logging into that one specifically | 16:01 |
| @clarkb:matrix.org | that may also explain the failures that were seen. We can drop the ipv6 A record for opendev.org if we think that is a good idea | 16:02 |
| @fungicide:matrix.org | but haproxy is only load balancing to v4 addresses of the backends, so this presumably isn't affecting actual users | 16:02 |
| @clarkb:matrix.org | * that may also explain the failures that were seen. We can drop the ipv6 AAAA record for opendev.org if we think that is a good idea | 16:02 |
| @clarkb:matrix.org | fungi: right it would just be the ipv6 connectivity to the haproxy server | 16:02 |
| @fungicide:matrix.org | i'm able to ssh into gitea-lb03 over ipv6 just fine though | 16:04 |
| @fungicide:matrix.org | so i guess it's not that either | 16:04 |
| @clarkb:matrix.org | I'm trying to get some local updates done so that I can reboot then I'll load keys and can get a second set of eyes on it to see if there is naything obviously wrong. Another thing to check is there is a way to ask haproxy for the active backends that would identify what if any had been removed automatically | 16:16 |
| @clarkb:matrix.org | `echo "show stat" | sudo socat /var/lib/haproxy/run/stats stdio` says all six backends are up | 16:23 |
| @clarkb:matrix.org | system load on all six backends looks fine too | 16:23 |
| @clarkb:matrix.org | I don't think I'll bother with a socks proxy to check them all directly. I suspect that whatever issue there was is either no longer and issue or is related to ipv6 connectivity that you ran into | 16:24 |
| @fungicide:matrix.org | yeah, i never saw the gitea problem, stephenfin reported it in #openstack-infra irc | 16:25 |
| @fungicide:matrix.org | i've just been trying to track down a potential cause | 16:25 |
| @fungicide:matrix.org | (system-config-run-base) `TASK [pip3 : Install latest pip and virtualenv] An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ModuleNotFoundError: No module named 'pkg_resources'` https://zuul.opendev.org/t/openstack/build/4fa10b98b2b844e98786e78b65483bd1 | 16:45 |
| @fungicide:matrix.org | just noticed that failure | 16:45 |
| @clarkb:matrix.org | looks like we're failing to setup the ansibel virtualenv there | 16:50 |
| @clarkb:matrix.org | because pip or virtualenv themselves depend on pkg_resources? | 16:50 |
| @clarkb:matrix.org | `item=virtualenv` | 16:50 |
| @clarkb:matrix.org | isn't that ironic that even virtualenv doesn't work | 16:51 |
| @clarkb:matrix.org | oh maybe it is the ansible module checking if the packages are installed via pkg_resources? | 16:51 |
| @clarkb:matrix.org | ok I think I get it. We install ansible to a virtualenv. Then later we run ansible out of that virtualenv and ansible within that virtualenv expects setuptools to have pkg_resources available | 16:55 |
| @fungicide:matrix.org | yeah, it looks like it's ansible or the pip module from ansible's stdlib doing it | 16:55 |
| @clarkb:matrix.org | I can push a change up that pins setuptools in the ansible venv install there | 16:56 |
| @clarkb:matrix.org | I half wonder if bridge updated its venv ( can't remember if we do that or not) | 16:56 |
| @fungicide:matrix.org | it's likely this was already fixed in ansible for a while and we're just running an older version | 16:57 |
| @clarkb:matrix.org | possible. There is a comment in the install-ansible role indicating that we upgrade the even once a day | 16:58 |
| @clarkb:matrix.org | so its possible that we are already in a chicken and egg situation here and need to manually fix bridge | 16:59 |
| @clarkb:matrix.org | yes we have setuptools 82 installed in that venv on bridge | 17:01 |
| @clarkb:matrix.org | we should check the hourly jobs | 17:01 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 976149: Pin setuptools in our bridge ansible virtualenv https://review.opendev.org/c/opendev/system-config/+/976149 | 17:04 | |
| @clarkb:matrix.org | I added the setuptools pin/install using the same conditions taht cause normal installation to update once a day so that may automatically update. However, if the problem is in pip itself then we may need to do surgery on bridge to build a new virtualenv by hand (or maybe just move the old virtualenv aside and let it rebuild it in the hourly jobs?) | 17:06 |
| @clarkb:matrix.org | infra-prod-service-bridge is failing but the other hourly jobs seem to be ok | 17:08 |
| @clarkb:matrix.org | I think that is because we're doing containers for so many services now that we aren't doing a lot of pip. So ya I think maybe we land 976149 or something like it once that passes testing then see if hourly updates can correct itself. If not we can probably move the virtualenv aside and have it rebuild it from scratch on the next hourly run | 17:09 |
| @clarkb:matrix.org | that hit a merge failure | 17:10 |
| @clarkb:matrix.org | I think github may be having problems I see a zuul image build hit a 500 error trying to get skopeo. Maybe our ansible dev job which pulls ansible hit something similar. I'll check merger logs | 17:11 |
| @clarkb:matrix.org | zm05 handled refstate job 74143b6bdf5c47188ef07cad0ce6a9a9 which failed due to `stderr: 'fatal: unable to access 'https://github.com/pytest-dev/pytest-testinfra/': Operation too slow. Less than 1000 bytes/sec transferred the last 30 seconds'` | 17:16 |
| @clarkb:matrix.org | https://www.githubstatus.com/ and they are having a sad | 17:17 |
| @clarkb:matrix.org | If there was ever a signal from the universe that I should go do some of that paperwork that I've been putting off this may be it | 17:19 |
| @clarkb:matrix.org | the latest update says git operations are normal again. I'll give it a few more minutes then recheck that change | 17:28 |
| @fungicide:matrix.org | https://discuss.python.org/t/106079 is covering the setuptools v82 pkg_resources removal, and also mentions the github outage | 17:30 |
| @clarkb:matrix.org | I rechecekd and jobs seem to be queued up now | 17:35 |
| @clarkb:matrix.org | re upgrading ansible iirc the problem is that the next version of ansible wants a newer version of python than is one bridge so we'd need to upgrade bridge or run ansible out of a container or something | 17:35 |
| @clarkb:matrix.org | I think as long as this pin works that we'll get by reasonable well. And if a pkg_resources package shows up on pypi we can start installing that instead | 17:36 |
| @fungicide:matrix.org | agreed | 17:43 |
| @clarkb:matrix.org | fungi: that change is in the gate now so I think that did fix it in CI. I don't think it iwll merge before the next hourly run but that is ok it should deploy itself and if that fails then we can move the venv aside and see if the next hourly runs fixes it? | 17:58 |
| @fungicide:matrix.org | yep, sounds right | 17:59 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 976149: Pin setuptools in our bridge ansible virtualenv https://review.opendev.org/c/opendev/system-config/+/976149 | 18:14 | |
| @clarkb:matrix.org | that is running the bootstrap bridge job right now. I'll check pip list in that venv when it is done | 18:16 |
| @clarkb:matrix.org | doesn't look like that updated the venv. So we need to wait for service-bridge to run first I guess. But I suspect that is in the chicken and egg state | 18:18 |
| @fungicide:matrix.org | so 40 minutes, i guess | 18:19 |
| @fungicide:matrix.org | or do we run that on deploy? | 18:19 |
| @clarkb:matrix.org | we did not run that job as part of the deployment | 18:20 |
| @clarkb:matrix.org | hrm but the bootstrap bridge playbook seems to be what runs install-ansible | 18:21 |
| @clarkb:matrix.org | I need to find logs | 18:21 |
| @clarkb:matrix.org | oh! the reason is the condition that causes us to update the venv once a day | 18:22 |
| @clarkb:matrix.org | fungi: I think I can force it to run on the next hourly run by updating the requirements.txt file on bridge for that | 18:22 |
| @clarkb:matrix.org | then we can reenqueue the deployment buildset | 18:22 |
| @clarkb:matrix.org | /usr/ansible-venv/requirements.txt has a date in a comment in it. I'll append clarkb to the comment and that should cause the file to change on the next run. Which will run because I reenqueue the buildset | 18:24 |
| @fungicide:matrix.org | ah okay | 18:25 |
| @clarkb:matrix.org | It is reenqueued | 18:26 |
| @clarkb:matrix.org | setuptools 81 is installed now. Now we wait for the next hourly runs to confirm that things are happy again | 18:29 |
| @clarkb:matrix.org | The reason that deployment worked is that the ansible env for the zuul executor is what updated the python venv for ansible on bridge. it looks like Ansible 9 and newer fix that via packaging.requirements (though it isn't clear yet if packaging.requirements is installed by default) | 18:36 |
| @clarkb:matrix.org | all that to say I don't think the zuul upgrade Friday/Saturday will break us if the ansible install there inscludes packaging which I'll check against the container images shortly | 18:37 |
| @clarkb:matrix.org | but I also realize that we could update bridge ansible to ansible 9 or 10 but not 11 with the python versions we have | 18:37 |
| @clarkb:matrix.org | so maybe that is a good next step for us as that will also fix this issue I think | 18:37 |
| @clarkb:matrix.org | packaging 26.0 is installed into the zuul ansible 11 venv and the zuul ansible 9 venv in the executor container. So ya I think we'll be ok on the zuul side even when setuptools updates there | 18:42 |
| @clarkb:matrix.org | Looking at https://zuul.opendev.org/t/openstack/buildset/b1e214783230427b924007ab3630868b only the service-bridge job failed (if we ignore image mirroring jobs) so I think once we get this sorted system-cofnig should be happy | 18:51 |
| @clarkb:matrix.org | still possible a tool like git review or bindep or something is mad but I haven't seen evidence of that yet | 18:51 |
| @clarkb:matrix.org | the hourly service-bridge has succeeded so I think that is fixed for now | 19:05 |
| @fungicide:matrix.org | yeah, so far that's the only issue i've seen in our stuff | 19:05 |
| @fungicide:matrix.org | so well-prepared, i suppose | 19:06 |
| @fungicide:matrix.org | github seems to still be struggling, i just did a `git remote update` on a repository with an origin there, and it hung for several minutes then came back with a http/502 error | 19:10 |
| @fungicide:matrix.org | then ran again and it succeeded ~instantly, so seems to be hit-or-miss | 19:11 |
| @clarkb:matrix.org | exciting | 19:23 |
| @clarkb:matrix.org | I'd like to keep updating container images to trixie (like say gitea) but with the github errors it is probably best to wait a bit | 19:24 |
| @clarkb:matrix.org | for tomorrow's meeting agenda I'll add pkg_resources to the agenda. Drop gerrit upgrades. Any other edits to make? | 19:25 |
| @clarkb:matrix.org | I've updated the meeting agenda as noted above. I also removed the zuul-registry item as the fixes last week appear to be working (I just rechecked the lodgeit image test change and it was able to find the image build last week without an issue) | 23:09 |
| @jim:acmegating.com | huzzah! | 23:14 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!