*** dmellado_ is now known as dmellado | 00:36 | |
clarkb | I'm still able to ssh into servers like codeserach and nl01 and eavesdrop01 | 00:56 |
---|---|---|
clarkb | they all have a single port 22 entry now | 00:56 |
clarkb | (I think it was infra-prod-base that applied the updates | 00:56 |
clarkb | I don't know that I'll be functional when this deploy finishes in order to clean up old ansible stuff on bridge. Might have to just be very very careful tomorrow morning and clean stuff up that is a day old? | 00:58 |
clarkb | also the matrix oftc bridge seems to have died | 00:58 |
fungi | yep, the servers lgtm | 02:06 |
fungi | matrix bridge less so :/ | 02:07 |
Unit193 | fungi: Thanks for fixing things, btw! | 05:14 |
*** corvus is now known as Guest4101 | 09:45 | |
*** dviroel|ruck|out is now known as dviroel|ruck | 11:13 | |
*** diablo_rojo is now known as Guest4115 | 11:36 | |
fungi | Unit193: it's not fixed yet, is it? we're still working on debugging the problem afaik | 12:06 |
frickler | if mordred happens to show up again later, would be great if someone can point him to today's backlog in #openstack-sdks where he might be able to help (regression in sdk's caching caused by major release of decorator lib) | 12:38 |
frickler | of course anyone else who might be able to help is also welcome, the issue seems to be out of reach for my mediocre python skills | 12:39 |
opendevreview | Merged opendev/system-config master: Upgrade etherpad to 1.8.14 https://review.opendev.org/c/opendev/system-config/+/804136 | 14:31 |
Clark[m] | fungi: I'm making tea now but will be at a proper keyboard soon | 14:35 |
fungi | no worries, i think we've still got a while before the deployment happens | 14:35 |
fungi | etherpad just restarted | 14:39 |
fungi | i'm able to load an existing pad just fine | 14:40 |
clarkb | oh that was quicker than I expected but I'm at a keyboard now | 14:46 |
clarkb | let me load ssh keys and all that | 14:46 |
fungi | there's no rush, i doubt anyone's using meetpad right this moment | 14:47 |
fungi | and the infra-prod-service-etherpad job is still running | 14:47 |
clarkb | It loads in meetpad, but I'm not sure this is using the newer version as the colors don't touch between lines | 14:48 |
fungi | we haven't cleaned up the ansible processes on bridge yet, eh? load average tehre is ~12 | 14:48 |
clarkb | ya ansible was still going last night when I needed to call it a day | 14:49 |
fungi | i guess we can disable-ansible after this and clean up | 14:49 |
clarkb | fungi: ya the image we are running is not updated on prod | 14:50 |
fungi | wonder why it restarted in that case... checking the deploy log on bridge | 14:50 |
clarkb | I wonder if that is a race with the dockerhub indexes eventual consistency | 14:51 |
fungi | it's at the docker-compose pull phase just now | 14:51 |
clarkb | oh weird are we going to restart twice? | 14:52 |
fungi | i wonder | 14:52 |
clarkb | fungi: looks like the pull started almost 15 minutes ago | 14:52 |
clarkb | related to the high system load maybe? | 14:52 |
fungi | that's what i suspect, yeah | 14:52 |
clarkb | I think the pull may have completed and then the restart just hasn't been properly logged yet? Becuase the timing lines up for the restart I think | 14:54 |
clarkb | I half suspect that we restarted after the pull because the mariadb image updated but the pull of the etherpad image didn't update due to docker hub races | 14:54 |
clarkb | if that is the case we should be able to safely pull and up -d manually once ansible is out of the way | 14:54 |
clarkb | fungi: it seems like ansible is making no progress at all | 14:59 |
clarkb | I'm going to start looking at process listings on bridge | 14:59 |
clarkb | fungi: `ps -elf | grep ansible | grep Aug12 | grep remote_puppet_else` maybe we start by killing those processes? | 15:00 |
clarkb | I'm going to start there. There are no remote puppet else jobs running and those are all from yseterday | 15:02 |
fungi | yeah, that should be plenty safe | 15:02 |
clarkb | fungi: also we should look for ssh connectivity problems as these tend to start from that | 15:03 |
*** Guest4101 is now known as notcorvus | 15:03 | |
*** notcorvus is now known as corvus | 15:03 | |
clarkb | I think if you ps and grep for the controlmaster processes you can find old ones that might indicate bad conenctivity | 15:03 |
fungi | i also see a couple from Aug11 | 15:03 |
*** corvus is now known as reallynotcorvus | 15:04 | |
*** reallynotcorvus is now known as corvus | 15:04 | |
fungi | some of these are showing up as defunct too, so not sure if they'll be killable | 15:04 |
clarkb | logstash-worker11 and elasticsearch06 are maybe sad hosts | 15:05 |
clarkb | fungi: do you think you can check on those and reboot them while I dig through processes that we might be able to clean up? | 15:05 |
fungi | yeah, looking into them | 15:06 |
clarkb | elasticsearch02 maybe as well | 15:06 |
fungi | Connection closed by 2001:4800:7819:103:be76:4eff:fe04:b9d7 port 22 | 15:06 |
clarkb | next I'll clean out the base.yaml playbooks from august 12. That playbook doesn't seem to be running in zuul either | 15:07 |
fungi | yeah, all three of them are resetting connections on 22/tcp | 15:07 |
fungi | i'll check their oob consoles | 15:07 |
corvus | tristanC: http://eavesdrop01.opendev.org:9001/ is answering ... what's the URI for prometheus stats? and are you monitoring it now? do you have enough data to see if the connection issue is resolved? | 15:08 |
clarkb | All of the august 12 ansible processes seem to be cleaned up and load has fallen significantly | 15:11 |
clarkb | Looking at etherpad the job finished and it is still running the old image. I think we should manually pull and up -d as soon as we are happy with bridge | 15:11 |
fungi | all three of the servers you mentioned were showing hung kernel tasks reported on their consoles, i've hard rebooted them | 15:12 |
clarkb | thanks | 15:12 |
clarkb | those were the three IPs I saw with stale ssh control processes | 15:12 |
fungi | i can ssh into all three of them now, though i expect the elasticsearch data is shot | 15:13 |
clarkb | fungi: I would give it a bit to try and recover on its own (but check if the processes need to start) | 15:14 |
clarkb | then we can delete any corrupted indexes once it has had a chance to recover | 15:14 |
fungi | #status log Hard rebooted elasticsearch02, elasticsearch06, and logstash-worker11 as all three seemed to be hung | 15:14 |
opendevstatus | fungi: finished logging | 15:14 |
clarkb | ansible is busy now but all of the processses related to ansible on bridge seem to be current | 15:15 |
clarkb | ya elasticsearch doesn't seem to be running on 02 | 15:16 |
clarkb | fungi: should I start those processes? | 15:16 |
fungi | oh, yeah i forgot it doesn't start them automatically | 15:16 |
fungi | i guess once the prod-hourly build complete, we can see if there are any lingering ansible proceses | 15:17 |
fungi | fatal: [etherpad01.opendev.org]: FAILED! => { ... "cmd": "docker-compose up -d" | 15:18 |
clarkb | there are a set of base.yaml playbooks running with current timestamps however I see no associated job. I half wonder if we unstuck those processes by rebooting the servers | 15:19 |
fungi | maybe | 15:19 |
clarkb | fungi: ya but it definitely restarted the containers. I think from ansible's perspecitve it sees it as a failure but it did restart | 15:19 |
clarkb | fungi: that said I think our next step is to rerun pull on etherpad and up -d to get the image | 15:19 |
clarkb | fungi: do you want to do that or should I? | 15:19 |
fungi | i'll do that now | 15:19 |
clarkb | thanks | 15:19 |
fungi | ERROR: readlink /var/lib/docker/overlay2/l: invalid argument | 15:20 |
clarkb | you get that when trying to up the service? | 15:21 |
fungi | yes | 15:21 |
fungi | the compose file looks fine though, not truncated | 15:21 |
clarkb | stackoverflow says that is a corrupted image. https://stackoverflow.com/questions/55334380/error-readlink-var-lib-docker-overlay2-invalid-argument | 15:21 |
fungi | argh | 15:22 |
clarkb | fungi: can you up just the mariadb container and see if that stargs? | 15:22 |
clarkb | if that starts then we can delete and repull the etherpad image | 15:22 |
fungi | yeah, that works | 15:22 |
clarkb | `sudo docker image rm 5dbd5f4908bd` then docker-compose pull again? | 15:23 |
fungi | already done, almost finished pulling | 15:23 |
fungi | that's better | 15:23 |
fungi | that looks like newer etherpad now | 15:24 |
clarkb | https://etherpad.opendev.org/p/project-renames-2021-07-30 loads for me now and ya looks newer | 15:24 |
fungi | i loaded the same one | 15:25 |
fungi | i see you active on it | 15:25 |
clarkb | https://meetpad.opendev.org/isitbroken loads that etherpad for me and I can add text | 15:25 |
clarkb | I'm not too worried about the actual call as long as the pad loads there and it seems to | 15:25 |
fungi | joining | 15:25 |
fungi | also looks like recent improvements in jitsi-meet or etherpad (or both) have made the window embedding a bit more serviceable | 15:32 |
clarkb | meetpad and etherpad both seem happy. If you notice anything feel free to metnion it | 15:32 |
fungi | clarkb: for the kata listserv, should i go ahead and start trying to create a server snapshot? | 15:33 |
clarkb | fungi: if you want to. The thing I'm always confused about is what do we need to do on the server to make it safe to boot the resulting snapshot? Do we disable and stop exim and mailman? | 15:34 |
clarkb | that is my biggest concern and I'm not completely up to speed on how all the file spooling works there to feel confident in doing it myself | 15:34 |
fungi | clarkb: i guess it's a question of what we want to do with the snapshot. if we just keep it as insurance in case the in-place upgrade goes sideways, we shouldn't need to disable anything because we wouldn't boot them both at the same time | 15:34 |
clarkb | oh ya I was thinking we would boot the snapshot and run through an upgrade on the booted snapshot | 15:35 |
clarkb | then do the upgrade on the actual server and it will serve as both a fallback and a test system | 15:35 |
fungi | in that case we could stop and disable the exim and mailman services while snapshotting, i guess | 15:36 |
clarkb | I figured doing that sort of thing with the lower traffic lists.kc.io would be less impactful | 15:37 |
clarkb | but then we'd get basically the same confidence out of upgrading it vs the prod snapshot | 15:37 |
fungi | sure, versions would be the same, though our multi-site setup wouldn't | 15:38 |
fungi | i also need to work out what to tweak in a dnm change to break lodgeit testing so i get a held paste equivalent for further troubleshooting the pastebinit regression | 15:39 |
clarkb | fungi: put an assert False in system-config testinfra/test_paste.py | 15:40 |
fungi | oh, yeah that'd do it | 15:40 |
clarkb | I'm going to go find something to eat now that etherpad seems happy, but then after will look at the lists.kc.io stuff if you haven't already done it | 15:40 |
tristanC | corvus: `curl http://eavesdrop01.opendev.org:9001/metrics | grep ssh_errors` shows no errors | 15:40 |
fungi | clarkb: sounds good, i'll get started temporarily disabling things there shortly | 15:41 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM: Break paste for an autohold https://review.opendev.org/c/opendev/system-config/+/804535 | 15:42 |
corvus | tristanC: great, thanks! i'll work on an email / timeline for moving #zuul :) | 15:43 |
fungi | clarkb: i guess i can clear your autohold for the etherpad upgrade testing? | 15:45 |
fungi | mnaser: do you still need held nodes for multi-arch container debugging in node-labeler or uwsgi build errors in loci-keystone? | 15:47 |
Clark[m] | fungi yes you can clear my etherpad autohold | 15:50 |
fungi | thanks, done | 15:51 |
fungi | it's fun that autohold and autohold-list need --tenant but autohold-delete errors if you supply --tenant | 15:51 |
fungi | i should probably be using the standalone zuul-client instead of the rpc client anyway | 15:52 |
fungi | prod-hourly jobs are almost done, it's on the last one now. though it likely won't complete before the top of the hour | 15:53 |
fungi | regardless, there are no ansible processes on bridge older than a minute, so looks like cleanup was thorough | 15:54 |
fungi | and load average has dropped from 12 to around 1, so lots better | 15:55 |
fungi | the last job did wrap up its deployment tasks before the top of the hour, and i caught bridge with 0 ansible processes | 15:59 |
fungi | squeaky clean | 15:59 |
clarkb | just in time for the next hourly run | 16:00 |
fungi | indeed | 16:00 |
fungi | i've put lists.katacontainers.io into the deployment disable list, disabled and stopped the exim4.service and mailman.service units, and initiated image creation in the provider for the server now | 16:07 |
clarkb | fungi: oh the other qusetion I had about that was what client do you use to talk to the rax snapshot api? | 16:08 |
clarkb | doesi t work with a modern osc? | 16:08 |
clarkb | also thank you! | 16:08 |
fungi | i just used their webui | 16:08 |
clarkb | ah | 16:08 |
fungi | since i already had it up for the oob console stuff on the hung servers a few minutes ago | 16:08 |
fungi | it's currently still "saving" | 16:14 |
fungi | imaging is complete, putting services back in place now | 16:19 |
clarkb | another benefit to using that server for this is much quicker snapshotting | 16:19 |
fungi | eys | 16:19 |
fungi | and it's back out of the disable list again | 16:20 |
clarkb | I need to do a bunch of paperwork type stuff today, but hopefully monday we can boot that and test an upgrade | 16:20 |
fungi | i also double-checked that services were running on it after starting | 16:20 |
fungi | wfm | 16:20 |
fungi | time to see if my well-laid trap caught a paste server | 16:20 |
fungi | we got one | 16:21 |
fungi | this is going to get tricky, pastebinit hard-codes server names, and also verifies ssl certs | 16:27 |
fungi | i'm starting to wonder if it's the server rename or redirect to https confusing it | 16:30 |
clarkb | fungi: try it with your browser to see? | 16:31 |
clarkb | with etherpad we had to set up /etc/hosts because of the redirect | 16:31 |
fungi | the browser's fine, and yeah i'm doing it with /etc/hosts entries to work around it | 16:31 |
clarkb | maybe use curl instead of pastebinit? | 16:33 |
clarkb | then you can control cert verification | 16:33 |
fungi | yeah, but i'll need to work out what pastebinit is passing to the method | 16:33 |
fungi | yep, i think i've confirmed it's the redirects | 16:34 |
fungi | i was able to use pastebinit with the held server by making the vhost no longer redirect from http to https | 16:34 |
fungi | thing is, pastebinit has a list of allowed hostnames, one of which is paste.openstack.org | 16:35 |
fungi | trying to use it with the name paste.opendev.org throws an error | 16:35 |
fungi | oh, though i think it may be due to the way the redirect was constructed | 16:36 |
fungi | we didn't redirect to https://paste.opendev.org/$1 we're just redirecting to the root url | 16:36 |
fungi | i'll see if that works with a more thorough redirect | 16:37 |
fungi | yeah, no luck getting the redirect to work with pastebinit, but if i get rid of the redirect it's fine. just "tested" on the production server by editing its apache vhost config and that got pastebinit working | 16:49 |
fungi | also since we don't allow search engines to index the content there, and we don't support "secretly" pasting to it really, there's no real need to redirect from http to https | 16:51 |
fungi | i'll propose a change | 16:51 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Stop redirecting for the paste site https://review.opendev.org/c/opendev/system-config/+/804539 | 17:01 |
fungi | Unit193: ianw: clarkb: ^ that seems to be the fix | 17:01 |
clarkb | fungi: does pastebinit work with https:// too? | 17:01 |
fungi | it would i think, but we'd need to update the site entry at https://phab.lubuntu.me/source/pastebinit/browse/master/pastebin.d/paste.openstack.org.conf | 17:02 |
fungi | regexp = http://paste.openstack.org | 17:02 |
fungi | right now trying it results in the following error: | 17:03 |
fungi | Unknown website, please post a bugreport to request this pastebin to be added (https://paste.openstack.org) | 17:03 |
opendevreview | Jeremy Stanley proposed opendev/lodgeit master: Properly handle paste exceptions https://review.opendev.org/c/opendev/lodgeit/+/804540 | 17:09 |
fungi | and that's ^ the other bug i discovered in digging into the problem | 17:09 |
fungi | lest upstream just starts smacking down every bug report from someone using a distro package | 17:20 |
fungi | (which happens in lots of projects) | 17:20 |
clarkb | wrong window? :) | 17:21 |
fungi | hah, yep | 17:21 |
clarkb | looks like refstack had a backup failure. I'm hoping that it like lists is a one off internet is weird situation | 17:21 |
Unit193 | fungi: https://github.com/lubuntu-team/pastebinit/issues/6 isn't reassuring about the state of things. | 17:25 |
fungi | Unit193: well, regardless we'll strive to keep backward compatibility with old pastebinit versions, so once 804539 merges and deployed things should hopefully stay working | 17:26 |
fungi | (and it's temporarily working now, since i directly applied that change to the apache config to make sure it's sane) | 17:27 |
fungi | but thanks for the pointer to that github issue, i didn't know about arch using a fork... if i get a moment i'll file a bug in debian to suggest switching to the same fork | 17:28 |
Unit193 | The maintainer in Debian is the Lubuntu team guy... I may go poking around to see what I can find. | 17:29 |
fungi | ahh, yeah. also that fork on gh arch is using doesn't seem to actually differ from the revision history in the lubuntu phabricator | 17:31 |
fungi | Unit193: please let me know what you find, and thanks again for alerting us to the issue before i ran into it myself! | 17:33 |
Unit193 | Hah, sure thing. | 17:33 |
Unit193 | And thanks for taking errors over IRC too. | 17:33 |
fungi | my preference, really ;) | 17:34 |
fungi | clarkb: ianw: looks like lance has e-mailed us asking if we've seen new issues with leaked/stuck images in osuosl | 17:42 |
clarkb | | 0000041150 | 0000000001 | osuosl-regionone | ubuntu-focal-arm64 | ubuntu-focal-arm64-1628601595 | 7e23243b-aee2-4100-b702-d7e05f456606 | deleting | 01:00:50:58 | that might be a leak | 17:44 |
clarkb | looking at a cloud side image list there may be a few leaks there too | 17:45 |
clarkb | debian-bullseye-arm64-1627056483 for example | 17:46 |
clarkb | created_at | 2021-07-23T16:08:06Z for that bullseye image but I don't see it in nodepool | 17:47 |
clarkb | I can compile a list and see what others think of it | 17:47 |
clarkb | ubuntu-focal-arm64-1628601595 cannot be deleted because it is in use through the backend store outside of glance | 17:49 |
clarkb | server list shows no results though | 17:49 |
fungi | we're not doing bfv for the mirror or builder are we? | 17:57 |
clarkb | we might be, but those should use images we don't build | 17:59 |
clarkb | the builder is in linaro not osusol. The osuosul mirror is booted from Ubuntu 20.04 (7ffbb2e7-d2f4-467a-9512-313a1c6b6afd) | 18:00 |
clarkb | I've got an email just about ready to send to Lance | 18:00 |
clarkb | sent | 18:02 |
fungi | thanks! | 18:16 |
*** dviroel|ruck is now known as dviroel|out | 19:42 | |
clarkb | fungi: looks like a bunch of hosts had backup failures? | 19:58 |
clarkb | both servers report they have disk space | 19:59 |
clarkb | looking at kdc03 the main backup failed but then the stream succeeded | 20:00 |
clarkb | Connection closed by remote host. Is borg working on the server? was the error | 20:00 |
clarkb | I'm going to try rerunning in screen on kdc03 | 20:02 |
clarkb | looking at the log more closely they all started about 2 hours before they errored | 20:03 |
fungi | mm, yeah refstack, storyboard, kdc03, translate, review | 20:04 |
fungi | also gitea01 twice (i guess one was the usual db backup failure?) | 20:04 |
fungi | all of those except kdc03 have mysql databases | 20:04 |
clarkb | kdc03 does do a stream backup of something though | 20:05 |
clarkb | that said running it manually succeeded | 20:05 |
clarkb | I suspect there was a network blip of some sort | 20:05 |
clarkb | fungi: note all of those started around 17:12 then timed out after 2 hours and reoprted failure around 19:12 | 20:05 |
clarkb | I suspect this isn't a persistent issue given the kdc03 rerun succeeded | 20:06 |
fungi | yeah, makes sense | 20:06 |
clarkb | fungi: but you can check the log on kdc03 to see what it did. It was failure on normal backup, success on stream, then a bit later (nowish) I reran and you get success for both | 20:06 |
clarkb | I guess check it tomorrow and see if things persist | 20:08 |
fungi | yeah, missing one day is unlikely to be catastrophic | 20:09 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/804460 reviewing that one would be good before the memory of what the renames were like bcomes too stale :) | 20:11 |
fungi | sure, i should be able to take a look now, thanks | 20:26 |
fungi | clarkb: left one question on it, otherwise lgtm | 20:30 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update our project rename docs https://review.opendev.org/c/opendev/system-config/+/804460 | 20:33 |
clarkb | nice catch that was indeed meant to be rooted | 20:34 |
fungi | debian bullseye releasing this weekend, probably | 20:34 |
clarkb | exciting | 20:35 |
fungi | yeah, scheduled for tomorrow | 20:36 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!