clarkb | ++ | 00:01 |
---|---|---|
ianw | clarkb: thanks, responded to the commenty bits; will fill all the other stuff now | 00:07 |
ianw | clarkb: oh, that was the other thing, i called it codesearch because it's pretty heavily configured to be our codesearch | 00:10 |
ianw | like the container starts and writes out the config pulled from project-config projects. so it's not really a generic hound container | 00:11 |
clarkb | ya I think we've done that with things like gitea too but still call it gitea? | 00:11 |
ianw | fair enough | 00:11 |
*** tosky has quit IRC | 00:14 | |
openstackgerrit | Merged openstack/project-config master: Revert "Disable limestone provider due to IPv4-less nodes" https://review.opendev.org/763254 | 00:25 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Migrate codesearch site to container https://review.opendev.org/762960 | 00:35 |
*** dmellado has quit IRC | 01:03 | |
*** dmellado has joined #opendev | 01:04 | |
*** hamalq has quit IRC | 02:09 | |
openstackgerrit | Merged opendev/system-config master: devel job: use ansible-core name https://review.opendev.org/763099 | 02:28 |
*** ysandeep|holiday is now known as ysandeep|off | 02:59 | |
*** d34dh0r53 has quit IRC | 03:24 | |
*** d34dh0r53 has joined #opendev | 03:27 | |
openstackgerrit | Ian Wienand proposed opendev/zone-opendev.org master: Add codesearch.opendev.org https://review.opendev.org/763297 | 03:32 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add codesearch.opendev.org server https://review.opendev.org/763298 | 03:34 |
ianw | ok, i've brought up a new codesearch server, and also added the acme-challenge cname in openstack.org to acme.opendev.org so it's cert can cover that too | 03:38 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Build new gerrit images https://review.opendev.org/763299 | 04:02 |
*** raukadah is now known as chandankumar | 04:03 | |
*** ykarel has joined #opendev | 04:19 | |
*** jaicaa has quit IRC | 04:56 | |
*** jaicaa has joined #opendev | 04:56 | |
*** marios has joined #opendev | 06:05 | |
*** marios has quit IRC | 06:15 | |
*** marios has joined #opendev | 06:18 | |
*** hamalq has joined #opendev | 06:25 | |
*** jaicaa has quit IRC | 06:37 | |
*** jaicaa has joined #opendev | 06:39 | |
*** slaweq has joined #opendev | 07:00 | |
*** eolivare has joined #opendev | 07:09 | |
*** sboyron has joined #opendev | 07:20 | |
*** sboyron has quit IRC | 07:23 | |
*** sboyron has joined #opendev | 07:23 | |
*** marios is now known as marios|ruck | 07:37 | |
*** DSpider has joined #opendev | 07:39 | |
*** ralonsoh has joined #opendev | 07:43 | |
*** bhagyashris|off is now known as bhagyashris | 07:55 | |
*** hashar has joined #opendev | 08:00 | |
*** rpittau|afk is now known as rpittau | 08:03 | |
*** andrewbonney has joined #opendev | 08:10 | |
*** roman_g has joined #opendev | 08:15 | |
*** icey has joined #opendev | 08:17 | |
*** hamalq has quit IRC | 08:28 | |
*** tosky has joined #opendev | 08:44 | |
*** lpetrut has joined #opendev | 08:47 | |
*** mgoddard has joined #opendev | 08:58 | |
*** ykarel_ has joined #opendev | 09:04 | |
*** ykarel has quit IRC | 09:07 | |
*** mlavalle has quit IRC | 09:09 | |
*** mlavalle has joined #opendev | 09:12 | |
*** icey has quit IRC | 09:17 | |
*** icey has joined #opendev | 09:24 | |
*** hamalq has joined #opendev | 09:29 | |
*** hamalq has quit IRC | 09:34 | |
*** dtantsur|afk is now known as dtantsur | 09:41 | |
*** ykarel_ is now known as ykarel | 09:57 | |
*** hamalq has joined #opendev | 10:07 | |
*** hamalq has quit IRC | 10:11 | |
*** d34dh0r53 has quit IRC | 10:21 | |
*** hamalq has joined #opendev | 10:27 | |
*** hamalq has quit IRC | 10:32 | |
*** ykarel_ has joined #opendev | 10:33 | |
*** ykarel has quit IRC | 10:35 | |
*** icey has quit IRC | 10:36 | |
*** icey has joined #opendev | 10:47 | |
*** icey has quit IRC | 10:52 | |
*** icey has joined #opendev | 11:03 | |
*** hamalq has joined #opendev | 11:09 | |
*** hamalq has quit IRC | 11:14 | |
*** ykarel__ has joined #opendev | 11:25 | |
*** ykarel_ has quit IRC | 11:28 | |
*** tkajinam has quit IRC | 11:29 | |
*** tkajinam has joined #opendev | 11:30 | |
*** hamalq has joined #opendev | 11:30 | |
*** ykarel__ is now known as ykarel | 11:33 | |
*** hamalq has quit IRC | 11:35 | |
*** hamalq has joined #opendev | 11:51 | |
*** hamalq has quit IRC | 11:56 | |
openstackgerrit | Slawek Kaplonski proposed zuul/zuul-jobs master: [multi-node-bridge] Add script to configure connectivity https://review.opendev.org/762650 | 12:02 |
*** hamalq has joined #opendev | 12:12 | |
*** hamalq has quit IRC | 12:16 | |
*** hamalq has joined #opendev | 12:34 | |
*** kevinz has joined #opendev | 12:36 | |
*** hamalq has quit IRC | 12:39 | |
*** rpittau is now known as rpittau|brb | 13:16 | |
*** hamalq has joined #opendev | 13:40 | |
dtantsur | hey folks! is it only me or viewing logs on https://zuul.opendev.org/t/openstack/build/ has become really slow recently? | 13:45 |
*** hamalq has quit IRC | 13:45 | |
fungi | dtantsur: slowness in the logs tab, the summary tab or the console tab? | 14:00 |
fungi | there was recent ansi color rendering which got added to the summary and console views which we have evidence to suggest is causing order of magnitude or greater increases in display time at least for summary and console | 14:01 |
fungi | apparently it gets much worse the larger the ansible json is | 14:01 |
*** ykarel_ has joined #opendev | 14:05 | |
dtantsur | fungi: pretty everything is quite slow for me, the summary takes seconds to open, firefox shows "this tab slows down your browser" | 14:07 |
*** ykarel has quit IRC | 14:07 | |
*** lamt has joined #opendev | 14:08 | |
fungi | dtantsur: on all build results or just ones with lots of output? for example, this loads quickly for me: https://zuul.opendev.org/t/openstack/build/183b590240ab4527a2f6d5e3382d2a05 | 14:11 |
dtantsur | yep, this was pretty fast | 14:15 |
dtantsur | probably our dsvm jobs then | 14:15 |
fungi | we're talking in #zuul about reverting the ansi color rendering for now to continue working on it and get some better performance benchmarks before adding it back | 14:15 |
fungi | so if that's the cause, you'll probably know some time today | 14:15 |
*** auristor has quit IRC | 14:20 | |
*** auristor has joined #opendev | 14:20 | |
*** d34dh0r53 has joined #opendev | 14:25 | |
dtantsur | great! I'll check again | 14:31 |
*** mgoddard has quit IRC | 14:48 | |
*** rpittau|brb is now known as rpittau | 14:51 | |
dtantsur | also, could we remove the browser warning from the meetup, given that 1) firefox works okay nowadays, 2) chromium is currently broken? | 15:04 |
dtantsur | s/meetup/meetpad/ | 15:05 |
*** mgoddard has joined #opendev | 15:07 | |
fungi | chromium's broken? | 15:07 |
fungi | and yeah, we mainly added that warning because we were getting many firefox users saying they were unable to get jitsi to work for them, so we added that warning to reduce the number of requests for assistance | 15:08 |
fungi | also supposedly the webrtc renderer in firefox performed worse, no idea if they've worked on improving that in recent months | 15:09 |
dtantsur | fungi: there is a bug currently with chromium crashing on switching windows when webrtc is used | 15:12 |
dtantsur | I think firefox has improved, but I have no data to back this statement | 15:13 |
fungi | oh neat. i haven't witnessed that but i generally use chromium only for videoconferencing and keep it separate from my locked-down firefox with all the privacy usability extensions | 15:13 |
fungi | dtantsur: any specific chromium versions impacted? i'm using the 83.0.4103.116-3.1 build in debian/unstable currently | 15:14 |
dtantsur | fungi: https://bugzilla.redhat.com/show_bug.cgi?id=1895920 | 15:15 |
openstack | bugzilla.redhat.com bug 1895920 in chromium "Chromium 86 crashes on WebRTC videos when switching window" [Urgent,New] - Assigned to spotrh | 15:15 |
dtantsur | I haven't dived into that, just using firefox | 15:15 |
fungi | ahh, okay, so my chromium is fairly old apparently. that would probably explain why i haven't seen it | 15:16 |
*** tosky has quit IRC | 15:18 | |
*** mlavalle has quit IRC | 15:21 | |
fungi | anybody else happen to know if firefox is working better with jitsi recently? | 15:22 |
*** tosky has joined #opendev | 15:22 | |
*** tosky has quit IRC | 15:26 | |
*** elod has quit IRC | 15:27 | |
*** elod has joined #opendev | 15:27 | |
*** ykarel_ has quit IRC | 15:29 | |
*** tosky has joined #opendev | 15:50 | |
smcginnis | Just noticed today I am getting a publickey error when trying to "git review -d". | 15:57 |
smcginnis | Tried SSHing into the port and see the error - debug1: send_pubkey_test: no mutual signature algorithm | 15:58 |
smcginnis | Any recent changes that might be related to this? | 15:58 |
clarkb | smcginnis: did you upgrade to fedora33? | 15:58 |
smcginnis | Only recent change on my end, that I can think of, is upgrading to Fedora 33 and now having py39 as default.. yep | 15:59 |
clarkb | thats the change then. openssh has deprecated using sha1 for hostkey exchanges. fedora33 has taken it a step further and disabled that. Our gerrit 2.13 ssh server doesn't do sha2 and you get that failure | 15:59 |
clarkb | once we've upgraded that should go away. So hopefully next week it is a non issue. In the meantime you can do a host specific ssh config override to allow sha1 host key exchanges with gerrit | 16:00 |
smcginnis | Ah... any examples of that I can use? | 16:00 |
smcginnis | KexAlgorithms +diffie-hellman-group1-sha1 ? | 16:01 |
clarkb | https://unix.stackexchange.com/a/340853 | 16:02 |
smcginnis | Thanks clarkb! | 16:02 |
clarkb | I think | 16:02 |
clarkb | I haven't had to do it myself. | 16:02 |
smcginnis | I'll give it a shot. | 16:02 |
smcginnis | And report back. | 16:02 |
fungi | yeah, ought to just be able to add a review.opendev.org section to your ~/.ssh/config and set that for now | 16:08 |
smcginnis | Looks like maybe that is the sshd option. Client side, I had to add PubkeyAcceptedKeyTypes +ssh-rsa | 16:10 |
fungi | that sounds right | 16:11 |
smcginnis | I'll post something to the ML in case anyone else upgrades to f33 before the gerrit upgrade. | 16:11 |
clarkb | note the gerrit upgrade is supposed to start tomorrow :) | 16:11 |
smcginnis | Yeah, small window. | 16:11 |
smcginnis | Guess I can skip the ML. Not likely someone will decide to do it during the weekday. | 16:12 |
clarkb | fwiw I did test that this error goes away with upgraded gerrit | 16:12 |
clarkb | so pretty confident in that :) and not just hoping | 16:13 |
fungi | infra-root: not sure if you saw earlier, but dtantsur linked to a bug about chromium 86 builds being broken with jitsi and similar videoconferencing tools (though it looks like maybe it's fixed in chromium 87). worth keeping an eye out for if people report problems with meetpad | 16:15 |
clarkb | looks like it affects wayland and x11 | 16:16 |
clarkb | I'm up for doing a call with chrome later and seeing if it fails similarly | 16:16 |
fungi | yep, and apparently downgrading to 85 is a "bad idea" because of a serious security vulnerability in it | 16:16 |
fungi | supposedly people testing with chrome did not experience the same problems as with chromium | 16:17 |
clarkb | fungi: yes chrom* has patched a number of vulnerabilities that are being exploited in the wild | 16:17 |
clarkb | in like the last 2 weeks | 16:17 |
clarkb | ah ya I see where chrome is reported to not be affected | 16:18 |
fungi | related, he's requested we drop the warning about firefox not working with meetpad... apparently it does work at least to some degree (we knew that much) but i suppose it's worth revisiting whether the reduced support burden from people using firefox and seeing suboptimal behavior/performance justifies annoying the firefox users who have been able to make it work for them anyway | 16:19 |
clarkb | fungi: I'm willing to test firefox too and at least try to quickly reproduce our previous experiences | 16:20 |
clarkb | if we can't reproduce then we can drop the warning | 16:20 |
*** hamalq has joined #opendev | 16:20 | |
*** hamalq has quit IRC | 16:25 | |
*** lpetrut has quit IRC | 16:42 | |
*** zaro69 has joined #opendev | 16:51 | |
*** rpittau is now known as rpittau|afk | 16:52 | |
*** marios|ruck is now known as marios|out | 16:57 | |
*** marios|out has quit IRC | 17:00 | |
*** hamalq has joined #opendev | 17:01 | |
*** d34dh0r53 has quit IRC | 17:02 | |
*** zaro69 has quit IRC | 17:03 | |
*** zaro95 has joined #opendev | 17:05 | |
*** zaro95 has quit IRC | 17:06 | |
*** d34dh0r53 has joined #opendev | 17:06 | |
*** zaro48 has joined #opendev | 17:07 | |
*** eolivare has quit IRC | 17:09 | |
*** fressi has quit IRC | 17:22 | |
*** ralonsoh_ has joined #opendev | 17:27 | |
*** ralonsoh has quit IRC | 17:28 | |
*** roman_g has quit IRC | 17:43 | |
*** roman_g has joined #opendev | 17:44 | |
*** roman_g has quit IRC | 17:44 | |
*** hamalq has quit IRC | 17:45 | |
clarkb | fungi: going through my pre upgrade notes, we did update prod to update the refs/meta/config perms right? Otherwise I'm not really finding much other than get those images rebuilt and published then work through the backup initialization/prep/testing | 17:45 |
*** roman_g has joined #opendev | 17:45 | |
*** hamalq has joined #opendev | 17:45 | |
*** roman_g has quit IRC | 17:45 | |
clarkb | but please call out anything that we may have missed or should double check prior to tomorrow | 17:45 |
fungi | yes, we (at least i think i) did | 17:46 |
fungi | we can work on adding our backup volume next i suppose | 17:46 |
fungi | i'll get that created and attached shortly | 17:46 |
*** roman_g has joined #opendev | 17:46 | |
*** roman_g has quit IRC | 17:47 | |
clarkb | thanks! | 17:47 |
clarkb | oh we also need to do the maintenance html. I can work on that in a bit | 17:50 |
fungi | we may want to keep an eye out for anything like https://bugs.chromium.org/p/gerrit/issues/detail?id=13701 | 17:51 |
fungi | supposedly newer jetty is causing folks to need to add 'RequestHeader set "X-Forwarded-Proto" expr=%{REQUEST_SCHEME}' to their apache reverse proxy configs | 17:51 |
clarkb | that should be easy enough to add assuming that our version of apache supports that | 17:52 |
clarkb | should we go ahead and add that to review-test? | 17:53 |
*** zaro48 has quit IRC | 17:53 | |
fungi | well, maybe we need to double-check that it's a problem for us first | 17:54 |
clarkb | we don't enable the plugin manager so can't test easily with the given example | 17:54 |
fungi | i'm not a fan of cargo-culting stuff like that unless we're sure it's necessary | 17:54 |
clarkb | I guess we can just browse around for a bit and see if we trip it | 17:54 |
clarkb | ++ | 17:54 |
clarkb | browsing changes seems fine | 17:54 |
clarkb | now to try searching | 17:54 |
*** zaro has joined #opendev | 17:55 | |
clarkb | searching also seems fine | 17:55 |
*** hamalq has quit IRC | 17:56 | |
*** hamalq has joined #opendev | 17:56 | |
*** fressi has joined #opendev | 17:58 | |
*** mlavalle has joined #opendev | 18:00 | |
*** mgoddard has quit IRC | 18:05 | |
clarkb | if anyone is able to replicate that issue on review-test let us know but I haven't succeeded so far | 18:05 |
fungi | infra-root: looks like ns1 has been hung since late utc tuesday. i'm investigating | 18:12 |
fungi | also apparently ns2 isn't responding on its ipv4 address (not sure how long, we seem to be monitoring it via ipv6 which is working fine as far as i can tell) | 18:17 |
openstackgerrit | Merged opendev/system-config master: Build new gerrit images https://review.opendev.org/763299 | 18:18 |
fungi | the oob console for ns1 shows the usual hung task messages on its console, though i was able to get it to initiate a soft reboot it seems | 18:20 |
clarkb | fungi: I can reach ns1 now and it seems to be running nsd | 18:23 |
clarkb | I guess the next thing is to sort out why ipv4 to ns2 is sad? | 18:23 |
fungi | yeah, i'm inspecting the logs on it | 18:23 |
*** dtantsur is now known as dtantsur|afk | 18:23 | |
fungi | looks like it was still logging ansible connections at 06:22:30 yesterday | 18:24 |
fungi | but that's where syslog abruptly ends | 18:24 |
clarkb | did it fill its disk? | 18:24 |
fungi | and it doesn't seem to have logged the current boot messages in syslog either | 18:25 |
clarkb | also are you looking at ns1 or ns2? | 18:25 |
fungi | nope, the fs is mostly empty | 18:25 |
fungi | ns1 | 18:25 |
fungi | i haven't started investigating the network issue for ns2 yet | 18:25 |
clarkb | I think we've seen that before on rax. Where the disk just goes away | 18:25 |
clarkb | and then the server gets really sad | 18:25 |
fungi | yeah, but i wonder why after a reboot nothing's logging to syslog | 18:26 |
clarkb | journalctl shows logs | 18:26 |
fungi | /dev/xvda1 on / type ext4 (rw,noatime,nobarrier,errors=remount-ro,data=ordered) | 18:26 |
clarkb | is rsyslog not running to slurp into the file? | 18:26 |
fungi | the rootfs still seems to be writeable | 18:27 |
clarkb | ya i don't see an rsyslog running | 18:27 |
clarkb | which will cause that to happen | 18:27 |
fungi | journalctl continued logging when syslog ended | 18:29 |
clarkb | fungi: yes aiui syslog relies on rsyslog being installed and it will slurp from journald into /var/log/syslog | 18:29 |
fungi | the last thing logged to syslog was ansible doing something with rsyslog | 18:29 |
fungi | <smoking gun> | 18:30 |
clarkb | corvus: unrelated to ^ have you seen https://zuul.opendev.org/t/openstack/build/571d8b35ec5f4857b7391437d080f45c/logs before? it looks like docker hub had a proper error trying to tag the 3.1 image | 18:30 |
clarkb | corvus: I don't think that is catastrophic for us because we don't intend on running 3.1 so we'll be fine on the older image | 18:30 |
clarkb | corvus: but if we can get it promoted that would be great | 18:31 |
fungi | Nov 18 06:22:30 ns1 python3[16348]: ansible-apt Invoked with name=rsyslog state=absent purge=True package=['rsyslog'] ... | 18:31 |
fungi | so we've got ansible explicitly uninstalling the rsyslog package? | 18:31 |
clarkb | hrm I recall ianw doing syslog things but don't think it was to uninstall it | 18:31 |
fungi | #status log rebooted ns1.opendev.org after it became unresponsive | 18:32 |
openstackstatus | fungi: finished logging | 18:32 |
clarkb | fungi: system-config/playbooks/roles/base/server/tasks/Debian.yaml we remove the package then reinstall it | 18:32 |
fungi | huh, i guess it logged the removal but not the reinstallation | 18:33 |
fungi | maybe it failed to start properly afterward | 18:33 |
clarkb | coincidence that it happened right at the time | 18:33 |
clarkb | ya | 18:33 |
clarkb | I think we should alnd the prescribed cleanup in that block though. Maybe when ianw's day starts | 18:33 |
clarkb | the server should self correct itself given ^ once our hourly jobs run I think | 18:33 |
fungi | well, no "right at the time" it's still unclear to me when the server broke | 18:33 |
clarkb | gotcha | 18:33 |
fungi | i'm trying to piece that together now looking for a gap in journalctl messages prior to the reboot | 18:34 |
fungi | journalctl was logging right up to the end | 18:35 |
fungi | seems like services were seeing connections coming in | 18:35 |
clarkb | corvus: actually looking at dockerhub all of them seem to have updated | 18:36 |
fungi | looking to see if maybe the rootfs got marked read-only at some point | 18:36 |
clarkb | I think my concern there is that maybe pabelanger's container work has broken something? it does look like the promote job for 3.1 is complaining about 2.14 for some reason | 18:36 |
clarkb | corvus: if you have a minute to sanity check those today that would be great. I'll see what I can find too | 18:37 |
fungi | ns2 is reachable for me over ipv4 now, not sure what was going on with it earlier | 18:40 |
fungi | it wasn't even responding via icmp echo when i was testing previously | 18:41 |
pabelanger | clarkb: was the job using build-container-image? | 18:41 |
clarkb | no its opendev-build-docker-image | 18:42 |
clarkb | but you're modifying roles too right? I just want to santiy check that something sin't broken and that our imges are still building properly before we do the upgrade tomorrow | 18:43 |
clarkb | it is incredibly difficult to map what docker hub shows you to anything the docker tools show you to determine if you've got a 1:1 match | 18:44 |
*** andrewbonney has quit IRC | 18:45 | |
clarkb | pabelanger: the thing that has me concerned is the job to promote the 3.1 image is complaining about the 2.14 image | 18:46 |
clarkb | which has my paranoia thinking: did we mixup the tags somehow | 18:47 |
fungi | looks like wmf also reported the same redirect problem with their gerrit and eclipse's too: https://bugs.chromium.org/p/gerrit/issues/detail?id=13705 | 18:48 |
fungi | really surprised we're not seeing it on review-test | 18:48 |
clarkb | ok I think I see what is going on | 18:49 |
clarkb | the last step in those jobs is to list all the tags and then clean out the obsolete tags | 18:49 |
clarkb | I think the race is that the 2.14 job deleted the 2.14 tag while the 3.1 job was listing tags and dockerhub broke | 18:50 |
clarkb | the actual promotion side of things seems to have been fine | 18:50 |
corvus | clarkb: i was just digging into that and come to the same conclusion | 18:50 |
clarkb | we should be fine in that case, and that can be somethign we clean up on the job side later :) | 18:50 |
corvus | they were within seconds of each other; haven't confirmed the sequence yet | 18:51 |
*** ralonsoh_ has quit IRC | 18:53 | |
corvus | the order is opposite what i expect, but there's still enough overlap for it to be a race, especially if there's a lot of locking/cdn stuff going on on docker's side. so i think that's the hypothesis we should go with | 18:55 |
corvus | probably should just put that in a retry | 18:55 |
clarkb | or maybe make it less greedy? I don't know if that is possible given the state we have | 18:55 |
corvus | well, it was just the listing that failed | 18:56 |
fungi | so a retry might work there? | 18:56 |
clarkb | oh right | 18:56 |
fungi | oh, you suggested a retry | 18:56 |
fungi | yeah, makes sense | 18:56 |
corvus | yeah. mind you, if we get past the listing working, we could end up with the same issue then moved to the actual delete stage. <shrug> | 18:56 |
corvus | either way, it's not terribly important. we could also just fail_when:false | 18:57 |
ianw | fungi: catching up; was the syslog ok in the end? | 18:58 |
fungi | ianw: no, for some reason rsyslogd didn't start successfully when ansible removed and reinstalled it, and it also didn't start on reboot | 18:58 |
fungi | i haven't checked yet to see why | 18:58 |
clarkb | fungi: ianw maybe it didn't reinstall | 18:58 |
ianw | clarkb: yeah, i want to do that cleanup, but wanted to get https://review.opendev.org/756605 into production, that got blocked because codesearch job failed, which led me to containerising it :) | 18:58 |
fungi | clarkb: bingo, looks like it's currently not installed on ns1 | 18:59 |
fungi | the last action logged in /var/log/dpkg.log was the uninstallation of rsyslog too | 18:59 |
fungi | and journalctl has truncated now... where do i find old journals? | 19:00 |
ianw | clarkb: if you could loop back on https://review.opendev.org/#/c/762960/ i started the server for it yesterday too. i think i'd like to get the gate around this fixed up one way or the other today so that's not an issue | 19:01 |
ianw | fungi: i'll check on the bridge side what it thought it did | 19:01 |
fungi | oh, nevermind, user error | 19:01 |
fungi | unfortunately, journalctl doesn't seem to record the ansible activity the way rsyslog did | 19:01 |
clarkb | ianw: yup I'll rereview that. We have a couple of gerrit upgrade prep things to do at this point: write maintenance.thml and get backup bolume mounted and initial sycn done | 19:02 |
clarkb | I'm about to find lunch then if I can sneak it in a bike ride since the sun decided to show up today | 19:02 |
clarkb | sounds like fungi will do the backup volume mounting, I'll work on the maintenance.html then hoepflly once things wind down I can rereview codesearch? | 19:02 |
fungi | ahh, okay i found in the journal where it logged the rsyslog package removal but it doesn't seem to have tried to reinstall it immediately | 19:05 |
ianw | fatal: [ns1.opendev.org]: UNREACHABLE! => { | 19:05 |
fungi | ianw: yeah it was unreachable for a while today | 19:06 |
fungi | not sure for how long | 19:06 |
*** sboyron has quit IRC | 19:07 | |
*** sboyron has joined #opendev | 19:07 | |
fungi | i see "Nov 19 06:24:24 ns1 python3[22116]: ansible-apt Invoked with state=present name=['at', 'git', 'logrotate', 'lvm2', 'openssh-server', 'parted', 'rsync', 'rsyslog', 'strace', 'tcpdump', 'wget'] ..." in the journal | 19:07 |
ianw | when i look in base.yaml.log -- it's almost like it's running twice. things are all mixed up | 19:08 |
clarkb | probably different hosts? | 19:08 |
fungi | looks like it removed rsyslog Nov 18 06:22:29 but didn't try to install it again until Nov 19 06:24:24 (and for whatever reason that didn't work either) | 19:09 |
ianw | "debconf: delaying package configuration, since apt-utils is not installed" | 19:10 |
ianw | i wonder if that's invovled | 19:10 |
fungi | oh, it also said to install rsyslog Nov 18 06:18:05 | 19:11 |
fungi | could it have gotten the install and remove steps backwards somehow? | 19:11 |
fungi | basically it installed at 06:18:05 but it was already installed so that presumably did nothing, then it removed at 06:22:29 | 19:12 |
fungi | and then didn't try to install again until the next day | 19:12 |
ianw | ns1.opendev.org : ok=29 changed=3 unreachable=1 failed=0 skipped=3 rescued=0 ignored=0 | 19:12 |
ianw | ns2.opendev.org : ok=30 changed=4 unreachable=1 failed=0 skipped=3 rescued=0 ignored=0 | 19:12 |
ianw | so it was both ok, and unreachable ? | 19:12 |
clarkb | those are task counts | 19:12 |
fungi | ianw: when was that? we also saw ipv4 was broken on ns2 earlier today | 19:12 |
clarkb | so 30 tasks ok but one was unreachablr thrn it skipped the rest aiui | 19:13 |
ianw | yeah, this is in the latest base run logs | 19:13 |
fungi | the removal and install attempts i see look like they probably happened from our daily periodic job | 19:13 |
fungi | given the times | 19:14 |
fungi | (and frequency) | 19:14 |
ianw | anyway, now this has run, we should stop it doing this. sorry, i planned for this to happen in the matter of a few hours -- but then the gate got broken by codesearch | 19:14 |
ianw | if i try and update the base yaml i think it will run the failing codesearch job too | 19:15 |
ianw | "this" being the reinstall | 19:15 |
openstackgerrit | Mohammed Naser proposed zuul/zuul-jobs master: added lgtm with basic docs https://review.opendev.org/763428 | 19:15 |
fungi | for whatever reason, rsyslog was able to be reinstalled successfully on ns2 | 19:15 |
fungi | though similarly, it seems it got uninstalled in the daily run on wednesday and then wasn't installed again until the daily run on thursday | 19:17 |
fungi | so ns2 had no rsyslog for ~24 hours | 19:17 |
fungi | i guess that's not what was intended | 19:17 |
ianw | no, looking at the base.yaml, it seems it does the purge but not the reinstall, thinking the host was unreachable | 19:18 |
ianw | fungi: what's your preference here? i don't think any of us have interest in updating the current puppet & codesearch server to their new releases so we can rule that out | 19:19 |
ianw | i can make the job non-voting, and propose a change to stop the purge here now the old config file is gone | 19:19 |
ianw | or we can merge https://review.opendev.org/763298 to remove the codesearch puppet | 19:20 |
ianw | and then merge a change to avoid the reinstall without gate changes | 19:20 |
fungi | not entirely sure i grok the interrelationship between these issues, but happy to prioritize the codesearch container reviews if that gets rsyslog working on ns1 (would it also be broken anywhere else?) | 19:21 |
ianw | it's just that the current testing of the codesearch job is broken, which will block the gate for anything that tries to run it, like base file updates | 19:21 |
ianw | fungi: as far as i can tell from the base logs, it was only ns1/2 that seemed to become unreachable for the reinstall step | 19:22 |
*** mgoddard has joined #opendev | 19:27 | |
*** sboyron has quit IRC | 19:32 | |
*** sboyron_ has joined #opendev | 19:33 | |
fungi | clarkb: do you think we need the backup volume to be ssd? i guess it might speed up the maintenance a little if we don't have to wait as long as for rsync to sata? | 19:35 |
clarkb | I dont think it is neccesary but may help? | 19:36 |
fungi | clarkb: and the idea is to just sync /home/gerrit2 into it, right? so i can get by with a 100gb volume as the data in there is less than that | 19:37 |
clarkb | well we'll sync two sents of gerrit2 homedirs into it | 19:37 |
clarkb | so it should be large enough for both copies | 19:38 |
clarkb | 2.13 and 2 | 19:38 |
clarkb | *2.13 and 2.16 | 19:38 |
fungi | got it, i'll make it 256gb then? | 19:38 |
clarkb | current is using less than half the exieting 256 right? | 19:38 |
fungi | yep, 93gb | 19:39 |
clarkb | if so then ya 256 is probably a good size | 19:39 |
*** mgoddard has quit IRC | 19:39 | |
clarkb | notedb grows things a bit but were snapshotting pre notedb | 19:39 |
clarkb | its about 15gb growth before we gc then 4gb after gc iirc | 19:39 |
fungi | do we want different logical volumes for the two copies or just separate paths? | 19:42 |
*** sboyron_ has quit IRC | 19:48 | |
fungi | assuming separate paths will work fine | 19:50 |
clarkb | just separate path is fine | 19:51 |
clarkb | also I think we only need to copy the review_site and the db backup | 19:51 |
clarkb | not all of the gerrit2 homedir. That may make it a bit quicker | 19:51 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: base: Remove rsyslogd reinstall https://review.opendev.org/763431 | 19:52 |
fungi | meh, it's already underway. but also the rest of the stuff is unlikely to change much during upgrade? | 19:52 |
clarkb | good point | 19:52 |
fungi | `sudo rsync -Sax --delete /home/gerrit2/ /mnt/2020-11-20_backups/2.13` is what's presently running | 19:53 |
*** sboyron has joined #opendev | 19:55 | |
*** hashar has quit IRC | 19:58 | |
clarkb | infra-root review.o.o:~clarkb/maintenance.html has a short blurb in it now | 20:06 |
clarkb | do we think we need to add anything more to that? | 20:06 |
fungi | your last closing paragraph tag is busted, otherwise lgtm | 20:07 |
clarkb | fixed | 20:08 |
fungi | yep, looks great | 20:14 |
clarkb | as far as timing goes for tomorrow I'm going to try and be at the keyboard by 14:30UTC | 20:14 |
clarkb | the schedule I've written doesn't have us doing anything until 1500 anyway so that should be plenty of time to wake up | 20:15 |
fungi | yeah, at most we'll send out a status notice an hour before or something to remind folks | 20:16 |
fungi | i can leave myself a reminder to send the reminder | 20:16 |
fungi | 2.13 rsync completed, i'm priming the copy for 2.16 now and will add these commands to the pad | 20:18 |
clarkb | fungi: thanks! | 20:18 |
clarkb | ianw: looking at https://review.opendev.org/#/c/762960/10..11/playbooks/roles/codesearch/templates/docker-compose.yaml.j2 the path in the dockercompose files is data not /data I think that means it is relative to the docker compose config dir? | 20:18 |
clarkb | ianw: I guess my question was why not do it as /var/hound/data or similar which we do with other containers | 20:18 |
clarkb | then when you bind mount it will be /var/run/data ? | 20:19 |
clarkb | ianw: also left one note about the jobs | 20:20 |
ianw | clarkb: i think that's saying "data volume at /var/run/data"? | 20:31 |
clarkb | oh a proper docker volume. I think we should avoid those | 20:31 |
clarkb | they get allocated out of a difficult to manage space and that makes it hard to supplmement with lvm and cinder | 20:32 |
clarkb | I think its best for our current use cases to use regular bind mounts | 20:32 |
ianw | it seemed appropriate in this case because the data is not ephemeral, but also not required to be outside the container | 20:32 |
clarkb | ianw: ya I've been using them with my nextcloud deployment at home and think it was the biggest mistake in that deployment | 20:32 |
clarkb | mostly because you can't say "use disk space from this location" easily | 20:33 |
clarkb | and since we rely on cinder a lot I think that may be important? | 20:33 |
clarkb | also consistency is nice. but maybe others are fine with that | 20:33 |
ianw | hrm, i mean i don't think this is going to expand. i can, but the config file is deliberately generated in the container, to keep it self-contained | 20:34 |
clarkb | you can bind mount the dir and still generate the config file right? | 20:34 |
clarkb | I might be missing how that affects volumes vs bind mounts | 20:34 |
ianw | yeah, i can, it just seems a bit unnecessary to have it outside the container context | 20:35 |
ianw | i dont' feel strongly. i can bind mind mount it in | 20:36 |
clarkb | I like the simplicity of bind moutns and we can move them and remount bigger fs's under them etc | 20:37 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Migrate codesearch site to container https://review.opendev.org/762960 | 20:41 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add codesearch.opendev.org server https://review.opendev.org/763298 | 20:41 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: base: Remove rsyslogd reinstall https://review.opendev.org/763431 | 20:42 |
*** zaro has quit IRC | 20:44 | |
*** zaro has joined #opendev | 20:47 | |
clarkb | fungi: is 2.16 sync done? | 20:49 |
*** sboyron has quit IRC | 20:49 | |
fungi | yep, 22m36s elapsed time | 20:51 |
fungi | so in theory any update will be faster than that | 20:51 |
clarkb | ++ | 20:51 |
fungi | i'm timing a nearly no-op update of both now to get an approximate lower bound | 20:51 |
clarkb | one thing I just checked was that the 2.16 bugfix branch has the notedb conversion improvement changes on it and it does | 20:51 |
fungi | oh good | 20:52 |
clarkb | I'm drawing a blank for other things we can check without redoing things we have already done. So I think I'll sneak in that bike ride as soon ast he curren rain passes | 20:52 |
fungi | and given that stable-3.2 has no new commits since two weeks ago i'm guessing the only difference to its corresponding bugfix branch is the security fixes | 20:52 |
clarkb | ianw: your stack lgtm I didn't approve anything given I'll be out on the bike and also distracted by gerrit | 20:53 |
clarkb | fungi: ya I checked that last night | 20:53 |
clarkb | fungi: also re the jetty thing I wonder if that is only on java 11 | 20:54 |
ianw | clarkb: thanks for reviews. i can babysit after ci and get gate back to working | 20:54 |
clarkb | we're doing our own java 8 builds and not seeing that | 20:54 |
fungi | ahh, could be | 20:54 |
fungi | i did already approve the containerization change | 20:55 |
fungi | nothing under that topic should break the existing codesearch anyway, that will cut over when someone updates openstack.org dns | 20:55 |
fungi | ianw: oh! you probably want to add the acme cname to openstack.org dns in advance or that's going to break cert issue? | 20:56 |
ianw | fungi: already did that :) | 20:56 |
fungi | you're smarter than i | 20:56 |
fungi | i didn't even think to check it until just now | 20:56 |
ianw | not smarter, just have made the mistake before | 20:57 |
fungi | _acme-challenge.codesearch.openstack.org is an alias for acme.opendev.org. | 20:57 |
fungi | perfect! | 20:57 |
openstackgerrit | Slawek Kaplonski proposed zuul/zuul-jobs master: [multi-node-bridge] Add script to configure connectivity https://review.opendev.org/762650 | 21:01 |
fungi | okay, mostly null re-up of the two rsync copies took 2m7s and 1m46s so that probably puts our lower bound around 2 minutes | 21:06 |
fungi | i'm going to estimate around 5 minutes for the 2.13 backup refresh and 10 minutes for the 2.16 backup refresh in our maintenance notes, just to have a ballpark figure | 21:07 |
fungi | i suppose i can time a mysqldump too, seems we don't have an estimate for that yet | 21:08 |
fungi | i've set myself reminders to do a status notice at 13:00 and again at 14:00. i'll go ahead and startmeeting in #opendev-maintenance at 13:00 as well so we can capture any last minute prep discussion | 21:15 |
clarkb | ++ thanks | 21:23 |
fungi | 9m16s, i'll put it down as 10m | 21:24 |
*** sboyron has joined #opendev | 21:27 | |
*** fressi has quit IRC | 21:40 | |
*** fressi has joined #opendev | 21:41 | |
openstackgerrit | Merged opendev/zone-opendev.org master: Add codesearch.opendev.org https://review.opendev.org/763297 | 22:01 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Revert "Build new gerrit images" https://review.opendev.org/763473 | 22:13 |
fungi | infra-root: ^ not sure if we want to do that at this point or not, but worth noting we can | 22:15 |
corvus | fungi: commit msg says 'most' which raises questions | 22:16 |
corvus | fungi: most means >= ones we care about? | 22:16 |
corvus | fungi: but not 2.14 and 2.15 which we are building? so i guess i'm confused | 22:17 |
fungi | corvus: well, we're never planning to expose 2.14 and 2.15 publicly, we're just running init in each of them temporarily | 22:19 |
corvus | fungi: so we don't really care which version of those we build and we're switching to the non-updated stable branches on those just to clean up the config along with the rest? | 22:20 |
fungi | i suppose i could have used more words in the commit message. i meant "most of the stable branches we're building (including 3.2 which we'll be exposing and 2.16 which we might roll back to) | 22:21 |
fungi | it was more so we didn't have to revert twice nor wait to get back onto stable-3.2 until stable-2.14 eventually gets those commits (if ever) | 22:22 |
fungi | i could amend it to be a partial revert and continue using the bugfix branches for 2.14 and 2.15. we expect to rip out all the entries <3.2 after upgrading anyway | 22:22 |
corvus | hashar merged a change into 2.16 recently (after the notedb fix) which may be merged up through other branches soon | 22:24 |
fungi | and stable-2.15 has also updated now as we've been talking | 22:24 |
corvus | it's a doc change | 22:25 |
corvus | we may get varying builds based on that if they're in the process of merging up, but i think we don't care and can just ignore it. | 22:25 |
corvus | fungi: +2 and as you can tell, i actually double checked everything :) | 22:26 |
openstackgerrit | Merged opendev/system-config master: Migrate codesearch site to container https://review.opendev.org/762960 | 22:26 |
fungi | thanks, and yeah i checked that commit from hashar as well thinking at first it might impact the upgrade process, but nope | 22:26 |
*** sboyron has quit IRC | 22:31 | |
fungi | stable-2.14 has updated now too | 22:53 |
*** DSpider has quit IRC | 22:55 | |
clarkb | fungi: how important do we think that is? eg should we land it right now and use those images or should we stick to the image we tested then land that after the upgrade? | 23:21 |
clarkb | I'm kinda leaning towards leaving it as is before the upgrade then we can land that as part of the changes we need to land after? but if people feel strongly the other way let me know | 23:23 |
clarkb | corvus: ^ do you have a preference? | 23:25 |
*** zaro has quit IRC | 23:29 | |
ianw | Failed to download remote objects and refs: error: file write error: No space left on device | 23:35 |
ianw | i think our nodepool builders are unhappy | 23:35 |
clarkb | I guess we were closer to hte edge with only 2 builders than I thought (we removed a lot of images but then I think we added a couple back in) | 23:36 |
clarkb | iirc f33 and centos-8-stream happeend after we condensed to 2? | 23:36 |
ianw | this failed in letsencrypt because we keep acme.sh on /opt | 23:36 |
ianw | it's 01 & 02 ... i've brought the container down while i look | 23:38 |
clarkb | ianw: the other thing that happens is we leak the mounts and consume a bunch of tmp space iirc | 23:39 |
ianw | in this case, there doesn't seem tob e any leaked mount | 23:39 |
clarkb | in the past what I've done is down the builders, disable the service, reboot, rm everything in dib_tmp, then enable the service and reboot | 23:39 |
clarkb | k | 23:39 |
clarkb | dib_tmp has a bunch of stuff in it fwiw. Running du on it now (this is on nb01 | 23:40 |
ianw | yeah, it's quicker to just rm it all and see what frees up :) | 23:40 |
clarkb | wfm if you want to do that | 23:41 |
clarkb | I've stopped the du | 23:41 |
openstackgerrit | Merged opendev/system-config master: Add codesearch.opendev.org server https://review.opendev.org/763298 | 23:42 |
ianw | the logs have all rotated out what started it | 23:45 |
clarkb | also it may be worth removing the cache and rebuilding it depending on whether or not we think some of that old distro stuff is in there in large stale quantities | 23:46 |
ianw | that's freed up 43gb | 23:48 |
clarkb | there is a decent chance we've just got too many imaes for 1tb now :/ | 23:49 |
ianw | 43gb is actually pretty tight, given the various formats we convert to | 23:50 |
clarkb | ya | 23:51 |
ianw | the vhd thing is ridiculous and writes it out about 3 times in total i think | 23:52 |
clarkb | the other thing is in theory they are supposed to balance out, but maybe we aren't doing that | 23:52 |
clarkb | or its just te new images filling our disks up | 23:52 |
*** hamalq has quit IRC | 23:53 | |
*** hamalq has joined #opendev | 23:54 | |
clarkb | thinking out loud here: what if we didn't keep the raw and vhd images on disk when building qcow2? we can convert the qcow2 to the others if need be | 23:56 |
clarkb | that would require nodepool hcnages I bet, but maybe that is a good optimization? | 23:56 |
clarkb | basically do all the uploads then trim | 23:56 |
fungi | clarkb: i'd be fine upgrading with the 3.2 we've tested running (note we haven't tested "upgrading" with it per se, nor with the other fixed intermediate images we built today), just wanting to make sure we can pull new commits shortly after we upgrade in case we run into any new bugs which get fixed upstream | 23:57 |
clarkb | fungi: ya I think we want to revert soon, but sticking with the images we've got at this point seems good until the upgrade is done | 23:58 |
fungi | wfm | 23:58 |
fungi | in theory the stable-3.2 branch is currently identical to what we built from, so it shouldn't make a difference barring problems arising on rebuild, so i'm fine waiting | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!