Thursday, 2020-11-19

clarkb	++	00:01
ianw	clarkb: thanks, responded to the commenty bits; will fill all the other stuff now	00:07
ianw	clarkb: oh, that was the other thing, i called it codesearch because it's pretty heavily configured to be our codesearch	00:10
ianw	like the container starts and writes out the config pulled from project-config projects. so it's not really a generic hound container	00:11
clarkb	ya I think we've done that with things like gitea too but still call it gitea?	00:11
ianw	fair enough	00:11
*** tosky has quit IRC		00:14
openstackgerrit	Merged openstack/project-config master: Revert "Disable limestone provider due to IPv4-less nodes" https://review.opendev.org/763254	00:25
openstackgerrit	Ian Wienand proposed opendev/system-config master: Migrate codesearch site to container https://review.opendev.org/762960	00:35
*** dmellado has quit IRC		01:03
*** dmellado has joined #opendev		01:04
*** hamalq has quit IRC		02:09
openstackgerrit	Merged opendev/system-config master: devel job: use ansible-core name https://review.opendev.org/763099	02:28
*** ysandeep\|holiday is now known as ysandeep\|off		02:59
*** d34dh0r53 has quit IRC		03:24
*** d34dh0r53 has joined #opendev		03:27
openstackgerrit	Ian Wienand proposed opendev/zone-opendev.org master: Add codesearch.opendev.org https://review.opendev.org/763297	03:32
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add codesearch.opendev.org server https://review.opendev.org/763298	03:34
ianw	ok, i've brought up a new codesearch server, and also added the acme-challenge cname in openstack.org to acme.opendev.org so it's cert can cover that too	03:38
openstackgerrit	Clark Boylan proposed opendev/system-config master: Build new gerrit images https://review.opendev.org/763299	04:02
*** raukadah is now known as chandankumar		04:03
*** ykarel has joined #opendev		04:19
*** jaicaa has quit IRC		04:56
*** jaicaa has joined #opendev		04:56
*** marios has joined #opendev		06:05
*** marios has quit IRC		06:15
*** marios has joined #opendev		06:18
*** hamalq has joined #opendev		06:25
*** jaicaa has quit IRC		06:37
*** jaicaa has joined #opendev		06:39
*** slaweq has joined #opendev		07:00
*** eolivare has joined #opendev		07:09
*** sboyron has joined #opendev		07:20
*** sboyron has quit IRC		07:23
*** sboyron has joined #opendev		07:23
*** marios is now known as marios\|ruck		07:37
*** DSpider has joined #opendev		07:39
*** ralonsoh has joined #opendev		07:43
*** bhagyashris\|off is now known as bhagyashris		07:55
*** hashar has joined #opendev		08:00
*** rpittau\|afk is now known as rpittau		08:03
*** andrewbonney has joined #opendev		08:10
*** roman_g has joined #opendev		08:15
*** icey has joined #opendev		08:17
*** hamalq has quit IRC		08:28
*** tosky has joined #opendev		08:44
*** lpetrut has joined #opendev		08:47
*** mgoddard has joined #opendev		08:58
*** ykarel_ has joined #opendev		09:04
*** ykarel has quit IRC		09:07
*** mlavalle has quit IRC		09:09
*** mlavalle has joined #opendev		09:12
*** icey has quit IRC		09:17
*** icey has joined #opendev		09:24
*** hamalq has joined #opendev		09:29
*** hamalq has quit IRC		09:34
*** dtantsur\|afk is now known as dtantsur		09:41
*** ykarel_ is now known as ykarel		09:57
*** hamalq has joined #opendev		10:07
*** hamalq has quit IRC		10:11
*** d34dh0r53 has quit IRC		10:21
*** hamalq has joined #opendev		10:27
*** hamalq has quit IRC		10:32
*** ykarel_ has joined #opendev		10:33
*** ykarel has quit IRC		10:35
*** icey has quit IRC		10:36
*** icey has joined #opendev		10:47
*** icey has quit IRC		10:52
*** icey has joined #opendev		11:03
*** hamalq has joined #opendev		11:09
*** hamalq has quit IRC		11:14
*** ykarel__ has joined #opendev		11:25
*** ykarel_ has quit IRC		11:28
*** tkajinam has quit IRC		11:29
*** tkajinam has joined #opendev		11:30
*** hamalq has joined #opendev		11:30
*** ykarel__ is now known as ykarel		11:33
*** hamalq has quit IRC		11:35
*** hamalq has joined #opendev		11:51
*** hamalq has quit IRC		11:56
openstackgerrit	Slawek Kaplonski proposed zuul/zuul-jobs master: [multi-node-bridge] Add script to configure connectivity https://review.opendev.org/762650	12:02
*** hamalq has joined #opendev		12:12
*** hamalq has quit IRC		12:16
*** hamalq has joined #opendev		12:34
*** kevinz has joined #opendev		12:36
*** hamalq has quit IRC		12:39
*** rpittau is now known as rpittau\|brb		13:16
*** hamalq has joined #opendev		13:40
dtantsur	hey folks! is it only me or viewing logs on https://zuul.opendev.org/t/openstack/build/ has become really slow recently?	13:45
*** hamalq has quit IRC		13:45
fungi	dtantsur: slowness in the logs tab, the summary tab or the console tab?	14:00
fungi	there was recent ansi color rendering which got added to the summary and console views which we have evidence to suggest is causing order of magnitude or greater increases in display time at least for summary and console	14:01
fungi	apparently it gets much worse the larger the ansible json is	14:01
*** ykarel_ has joined #opendev		14:05
dtantsur	fungi: pretty everything is quite slow for me, the summary takes seconds to open, firefox shows "this tab slows down your browser"	14:07
*** ykarel has quit IRC		14:07
*** lamt has joined #opendev		14:08
fungi	dtantsur: on all build results or just ones with lots of output? for example, this loads quickly for me: https://zuul.opendev.org/t/openstack/build/183b590240ab4527a2f6d5e3382d2a05	14:11
dtantsur	yep, this was pretty fast	14:15
dtantsur	probably our dsvm jobs then	14:15
fungi	we're talking in #zuul about reverting the ansi color rendering for now to continue working on it and get some better performance benchmarks before adding it back	14:15
fungi	so if that's the cause, you'll probably know some time today	14:15
*** auristor has quit IRC		14:20
*** auristor has joined #opendev		14:20
*** d34dh0r53 has joined #opendev		14:25
dtantsur	great! I'll check again	14:31
*** mgoddard has quit IRC		14:48
*** rpittau\|brb is now known as rpittau		14:51
dtantsur	also, could we remove the browser warning from the meetup, given that 1) firefox works okay nowadays, 2) chromium is currently broken?	15:04
dtantsur	s/meetup/meetpad/	15:05
*** mgoddard has joined #opendev		15:07
fungi	chromium's broken?	15:07
fungi	and yeah, we mainly added that warning because we were getting many firefox users saying they were unable to get jitsi to work for them, so we added that warning to reduce the number of requests for assistance	15:08
fungi	also supposedly the webrtc renderer in firefox performed worse, no idea if they've worked on improving that in recent months	15:09
dtantsur	fungi: there is a bug currently with chromium crashing on switching windows when webrtc is used	15:12
dtantsur	I think firefox has improved, but I have no data to back this statement	15:13
fungi	oh neat. i haven't witnessed that but i generally use chromium only for videoconferencing and keep it separate from my locked-down firefox with all the privacy usability extensions	15:13
fungi	dtantsur: any specific chromium versions impacted? i'm using the 83.0.4103.116-3.1 build in debian/unstable currently	15:14
dtantsur	fungi: https://bugzilla.redhat.com/show_bug.cgi?id=1895920	15:15
openstack	bugzilla.redhat.com bug 1895920 in chromium "Chromium 86 crashes on WebRTC videos when switching window" [Urgent,New] - Assigned to spotrh	15:15
dtantsur	I haven't dived into that, just using firefox	15:15
fungi	ahh, okay, so my chromium is fairly old apparently. that would probably explain why i haven't seen it	15:16
*** tosky has quit IRC		15:18
*** mlavalle has quit IRC		15:21
fungi	anybody else happen to know if firefox is working better with jitsi recently?	15:22
*** tosky has joined #opendev		15:22
*** tosky has quit IRC		15:26
*** elod has quit IRC		15:27
*** elod has joined #opendev		15:27
*** ykarel_ has quit IRC		15:29
*** tosky has joined #opendev		15:50
smcginnis	Just noticed today I am getting a publickey error when trying to "git review -d".	15:57
smcginnis	Tried SSHing into the port and see the error - debug1: send_pubkey_test: no mutual signature algorithm	15:58
smcginnis	Any recent changes that might be related to this?	15:58
clarkb	smcginnis: did you upgrade to fedora33?	15:58
smcginnis	Only recent change on my end, that I can think of, is upgrading to Fedora 33 and now having py39 as default.. yep	15:59
clarkb	thats the change then. openssh has deprecated using sha1 for hostkey exchanges. fedora33 has taken it a step further and disabled that. Our gerrit 2.13 ssh server doesn't do sha2 and you get that failure	15:59
clarkb	once we've upgraded that should go away. So hopefully next week it is a non issue. In the meantime you can do a host specific ssh config override to allow sha1 host key exchanges with gerrit	16:00
smcginnis	Ah... any examples of that I can use?	16:00
smcginnis	KexAlgorithms +diffie-hellman-group1-sha1 ?	16:01
clarkb	https://unix.stackexchange.com/a/340853	16:02
smcginnis	Thanks clarkb!	16:02
clarkb	I think	16:02
clarkb	I haven't had to do it myself.	16:02
smcginnis	I'll give it a shot.	16:02
smcginnis	And report back.	16:02
fungi	yeah, ought to just be able to add a review.opendev.org section to your ~/.ssh/config and set that for now	16:08
smcginnis	Looks like maybe that is the sshd option. Client side, I had to add PubkeyAcceptedKeyTypes +ssh-rsa	16:10
fungi	that sounds right	16:11
smcginnis	I'll post something to the ML in case anyone else upgrades to f33 before the gerrit upgrade.	16:11
clarkb	note the gerrit upgrade is supposed to start tomorrow :)	16:11
smcginnis	Yeah, small window.	16:11
smcginnis	Guess I can skip the ML. Not likely someone will decide to do it during the weekday.	16:12
clarkb	fwiw I did test that this error goes away with upgraded gerrit	16:12
clarkb	so pretty confident in that :) and not just hoping	16:13
fungi	infra-root: not sure if you saw earlier, but dtantsur linked to a bug about chromium 86 builds being broken with jitsi and similar videoconferencing tools (though it looks like maybe it's fixed in chromium 87). worth keeping an eye out for if people report problems with meetpad	16:15
clarkb	looks like it affects wayland and x11	16:16
clarkb	I'm up for doing a call with chrome later and seeing if it fails similarly	16:16
fungi	yep, and apparently downgrading to 85 is a "bad idea" because of a serious security vulnerability in it	16:16
fungi	supposedly people testing with chrome did not experience the same problems as with chromium	16:17
clarkb	fungi: yes chrom* has patched a number of vulnerabilities that are being exploited in the wild	16:17
clarkb	in like the last 2 weeks	16:17
clarkb	ah ya I see where chrome is reported to not be affected	16:18
fungi	related, he's requested we drop the warning about firefox not working with meetpad... apparently it does work at least to some degree (we knew that much) but i suppose it's worth revisiting whether the reduced support burden from people using firefox and seeing suboptimal behavior/performance justifies annoying the firefox users who have been able to make it work for them anyway	16:19
clarkb	fungi: I'm willing to test firefox too and at least try to quickly reproduce our previous experiences	16:20
clarkb	if we can't reproduce then we can drop the warning	16:20
*** hamalq has joined #opendev		16:20
*** hamalq has quit IRC		16:25
*** lpetrut has quit IRC		16:42
*** zaro69 has joined #opendev		16:51
*** rpittau is now known as rpittau\|afk		16:52
*** marios\|ruck is now known as marios\|out		16:57
*** marios\|out has quit IRC		17:00
*** hamalq has joined #opendev		17:01
*** d34dh0r53 has quit IRC		17:02
*** zaro69 has quit IRC		17:03
*** zaro95 has joined #opendev		17:05
*** zaro95 has quit IRC		17:06
*** d34dh0r53 has joined #opendev		17:06
*** zaro48 has joined #opendev		17:07
*** eolivare has quit IRC		17:09
*** fressi has quit IRC		17:22
*** ralonsoh_ has joined #opendev		17:27
*** ralonsoh has quit IRC		17:28
*** roman_g has quit IRC		17:43
*** roman_g has joined #opendev		17:44
*** roman_g has quit IRC		17:44
*** hamalq has quit IRC		17:45
clarkb	fungi: going through my pre upgrade notes, we did update prod to update the refs/meta/config perms right? Otherwise I'm not really finding much other than get those images rebuilt and published then work through the backup initialization/prep/testing	17:45
*** roman_g has joined #opendev		17:45
*** hamalq has joined #opendev		17:45
*** roman_g has quit IRC		17:45
clarkb	but please call out anything that we may have missed or should double check prior to tomorrow	17:45
fungi	yes, we (at least i think i) did	17:46
fungi	we can work on adding our backup volume next i suppose	17:46
fungi	i'll get that created and attached shortly	17:46
*** roman_g has joined #opendev		17:46
*** roman_g has quit IRC		17:47
clarkb	thanks!	17:47
clarkb	oh we also need to do the maintenance html. I can work on that in a bit	17:50
fungi	we may want to keep an eye out for anything like https://bugs.chromium.org/p/gerrit/issues/detail?id=13701	17:51
fungi	supposedly newer jetty is causing folks to need to add 'RequestHeader set "X-Forwarded-Proto" expr=%{REQUEST_SCHEME}' to their apache reverse proxy configs	17:51
clarkb	that should be easy enough to add assuming that our version of apache supports that	17:52
clarkb	should we go ahead and add that to review-test?	17:53
*** zaro48 has quit IRC		17:53
fungi	well, maybe we need to double-check that it's a problem for us first	17:54
clarkb	we don't enable the plugin manager so can't test easily with the given example	17:54
fungi	i'm not a fan of cargo-culting stuff like that unless we're sure it's necessary	17:54
clarkb	I guess we can just browse around for a bit and see if we trip it	17:54
clarkb	++	17:54
clarkb	browsing changes seems fine	17:54
clarkb	now to try searching	17:54
*** zaro has joined #opendev		17:55
clarkb	searching also seems fine	17:55
*** hamalq has quit IRC		17:56
*** hamalq has joined #opendev		17:56
*** fressi has joined #opendev		17:58
*** mlavalle has joined #opendev		18:00
*** mgoddard has quit IRC		18:05
clarkb	if anyone is able to replicate that issue on review-test let us know but I haven't succeeded so far	18:05
fungi	infra-root: looks like ns1 has been hung since late utc tuesday. i'm investigating	18:12
fungi	also apparently ns2 isn't responding on its ipv4 address (not sure how long, we seem to be monitoring it via ipv6 which is working fine as far as i can tell)	18:17
openstackgerrit	Merged opendev/system-config master: Build new gerrit images https://review.opendev.org/763299	18:18
fungi	the oob console for ns1 shows the usual hung task messages on its console, though i was able to get it to initiate a soft reboot it seems	18:20
clarkb	fungi: I can reach ns1 now and it seems to be running nsd	18:23
clarkb	I guess the next thing is to sort out why ipv4 to ns2 is sad?	18:23
fungi	yeah, i'm inspecting the logs on it	18:23
*** dtantsur is now known as dtantsur\|afk		18:23
fungi	looks like it was still logging ansible connections at 06:22:30 yesterday	18:24
fungi	but that's where syslog abruptly ends	18:24
clarkb	did it fill its disk?	18:24
fungi	and it doesn't seem to have logged the current boot messages in syslog either	18:25
clarkb	also are you looking at ns1 or ns2?	18:25
fungi	nope, the fs is mostly empty	18:25
fungi	ns1	18:25
fungi	i haven't started investigating the network issue for ns2 yet	18:25
clarkb	I think we've seen that before on rax. Where the disk just goes away	18:25
clarkb	and then the server gets really sad	18:25
fungi	yeah, but i wonder why after a reboot nothing's logging to syslog	18:26
clarkb	journalctl shows logs	18:26
fungi	/dev/xvda1 on / type ext4 (rw,noatime,nobarrier,errors=remount-ro,data=ordered)	18:26
clarkb	is rsyslog not running to slurp into the file?	18:26
fungi	the rootfs still seems to be writeable	18:27
clarkb	ya i don't see an rsyslog running	18:27
clarkb	which will cause that to happen	18:27
fungi	journalctl continued logging when syslog ended	18:29
clarkb	fungi: yes aiui syslog relies on rsyslog being installed and it will slurp from journald into /var/log/syslog	18:29
fungi	the last thing logged to syslog was ansible doing something with rsyslog	18:29
fungi	<smoking gun>	18:30
clarkb	corvus: unrelated to ^ have you seen https://zuul.opendev.org/t/openstack/build/571d8b35ec5f4857b7391437d080f45c/logs before? it looks like docker hub had a proper error trying to tag the 3.1 image	18:30
clarkb	corvus: I don't think that is catastrophic for us because we don't intend on running 3.1 so we'll be fine on the older image	18:30
clarkb	corvus: but if we can get it promoted that would be great	18:31
fungi	Nov 18 06:22:30 ns1 python3[16348]: ansible-apt Invoked with name=rsyslog state=absent purge=True package=['rsyslog'] ...	18:31
fungi	so we've got ansible explicitly uninstalling the rsyslog package?	18:31
clarkb	hrm I recall ianw doing syslog things but don't think it was to uninstall it	18:31
fungi	#status log rebooted ns1.opendev.org after it became unresponsive	18:32
openstackstatus	fungi: finished logging	18:32
clarkb	fungi: system-config/playbooks/roles/base/server/tasks/Debian.yaml we remove the package then reinstall it	18:32
fungi	huh, i guess it logged the removal but not the reinstallation	18:33
fungi	maybe it failed to start properly afterward	18:33
clarkb	coincidence that it happened right at the time	18:33
clarkb	ya	18:33
clarkb	I think we should alnd the prescribed cleanup in that block though. Maybe when ianw's day starts	18:33
clarkb	the server should self correct itself given ^ once our hourly jobs run I think	18:33
fungi	well, no "right at the time" it's still unclear to me when the server broke	18:33
clarkb	gotcha	18:33
fungi	i'm trying to piece that together now looking for a gap in journalctl messages prior to the reboot	18:34
fungi	journalctl was logging right up to the end	18:35
fungi	seems like services were seeing connections coming in	18:35
clarkb	corvus: actually looking at dockerhub all of them seem to have updated	18:36
fungi	looking to see if maybe the rootfs got marked read-only at some point	18:36
clarkb	I think my concern there is that maybe pabelanger's container work has broken something? it does look like the promote job for 3.1 is complaining about 2.14 for some reason	18:36
clarkb	corvus: if you have a minute to sanity check those today that would be great. I'll see what I can find too	18:37
fungi	ns2 is reachable for me over ipv4 now, not sure what was going on with it earlier	18:40
fungi	it wasn't even responding via icmp echo when i was testing previously	18:41
pabelanger	clarkb: was the job using build-container-image?	18:41
clarkb	no its opendev-build-docker-image	18:42
clarkb	but you're modifying roles too right? I just want to santiy check that something sin't broken and that our imges are still building properly before we do the upgrade tomorrow	18:43
clarkb	it is incredibly difficult to map what docker hub shows you to anything the docker tools show you to determine if you've got a 1:1 match	18:44
*** andrewbonney has quit IRC		18:45
clarkb	pabelanger: the thing that has me concerned is the job to promote the 3.1 image is complaining about the 2.14 image	18:46
clarkb	which has my paranoia thinking: did we mixup the tags somehow	18:47
fungi	looks like wmf also reported the same redirect problem with their gerrit and eclipse's too: https://bugs.chromium.org/p/gerrit/issues/detail?id=13705	18:48
fungi	really surprised we're not seeing it on review-test	18:48
clarkb	ok I think I see what is going on	18:49
clarkb	the last step in those jobs is to list all the tags and then clean out the obsolete tags	18:49
clarkb	I think the race is that the 2.14 job deleted the 2.14 tag while the 3.1 job was listing tags and dockerhub broke	18:50
clarkb	the actual promotion side of things seems to have been fine	18:50
corvus	clarkb: i was just digging into that and come to the same conclusion	18:50
clarkb	we should be fine in that case, and that can be somethign we clean up on the job side later :)	18:50
corvus	they were within seconds of each other; haven't confirmed the sequence yet	18:51
*** ralonsoh_ has quit IRC		18:53
corvus	the order is opposite what i expect, but there's still enough overlap for it to be a race, especially if there's a lot of locking/cdn stuff going on on docker's side. so i think that's the hypothesis we should go with	18:55
corvus	probably should just put that in a retry	18:55
clarkb	or maybe make it less greedy? I don't know if that is possible given the state we have	18:55
corvus	well, it was just the listing that failed	18:56
fungi	so a retry might work there?	18:56
clarkb	oh right	18:56
fungi	oh, you suggested a retry	18:56
fungi	yeah, makes sense	18:56
corvus	yeah. mind you, if we get past the listing working, we could end up with the same issue then moved to the actual delete stage. <shrug>	18:56
corvus	either way, it's not terribly important. we could also just fail_when:false	18:57
ianw	fungi: catching up; was the syslog ok in the end?	18:58
fungi	ianw: no, for some reason rsyslogd didn't start successfully when ansible removed and reinstalled it, and it also didn't start on reboot	18:58
fungi	i haven't checked yet to see why	18:58
clarkb	fungi: ianw maybe it didn't reinstall	18:58
ianw	clarkb: yeah, i want to do that cleanup, but wanted to get https://review.opendev.org/756605 into production, that got blocked because codesearch job failed, which led me to containerising it :)	18:58
fungi	clarkb: bingo, looks like it's currently not installed on ns1	18:59
fungi	the last action logged in /var/log/dpkg.log was the uninstallation of rsyslog too	18:59
fungi	and journalctl has truncated now... where do i find old journals?	19:00
ianw	clarkb: if you could loop back on https://review.opendev.org/#/c/762960/ i started the server for it yesterday too. i think i'd like to get the gate around this fixed up one way or the other today so that's not an issue	19:01
ianw	fungi: i'll check on the bridge side what it thought it did	19:01
fungi	oh, nevermind, user error	19:01
fungi	unfortunately, journalctl doesn't seem to record the ansible activity the way rsyslog did	19:01
clarkb	ianw: yup I'll rereview that. We have a couple of gerrit upgrade prep things to do at this point: write maintenance.thml and get backup bolume mounted and initial sycn done	19:02
clarkb	I'm about to find lunch then if I can sneak it in a bike ride since the sun decided to show up today	19:02
clarkb	sounds like fungi will do the backup volume mounting, I'll work on the maintenance.html then hoepflly once things wind down I can rereview codesearch?	19:02
fungi	ahh, okay i found in the journal where it logged the rsyslog package removal but it doesn't seem to have tried to reinstall it immediately	19:05
ianw	fatal: [ns1.opendev.org]: UNREACHABLE! => {	19:05
fungi	ianw: yeah it was unreachable for a while today	19:06
fungi	not sure for how long	19:06
*** sboyron has quit IRC		19:07
*** sboyron has joined #opendev		19:07
fungi	i see "Nov 19 06:24:24 ns1 python3[22116]: ansible-apt Invoked with state=present name=['at', 'git', 'logrotate', 'lvm2', 'openssh-server', 'parted', 'rsync', 'rsyslog', 'strace', 'tcpdump', 'wget'] ..." in the journal	19:07
ianw	when i look in base.yaml.log -- it's almost like it's running twice. things are all mixed up	19:08
clarkb	probably different hosts?	19:08
fungi	looks like it removed rsyslog Nov 18 06:22:29 but didn't try to install it again until Nov 19 06:24:24 (and for whatever reason that didn't work either)	19:09
ianw	"debconf: delaying package configuration, since apt-utils is not installed"	19:10
ianw	i wonder if that's invovled	19:10
fungi	oh, it also said to install rsyslog Nov 18 06:18:05	19:11
fungi	could it have gotten the install and remove steps backwards somehow?	19:11
fungi	basically it installed at 06:18:05 but it was already installed so that presumably did nothing, then it removed at 06:22:29	19:12
fungi	and then didn't try to install again until the next day	19:12
ianw	ns1.opendev.org : ok=29 changed=3 unreachable=1 failed=0 skipped=3 rescued=0 ignored=0	19:12
ianw	ns2.opendev.org : ok=30 changed=4 unreachable=1 failed=0 skipped=3 rescued=0 ignored=0	19:12
ianw	so it was both ok, and unreachable ?	19:12
clarkb	those are task counts	19:12
fungi	ianw: when was that? we also saw ipv4 was broken on ns2 earlier today	19:12
clarkb	so 30 tasks ok but one was unreachablr thrn it skipped the rest aiui	19:13
ianw	yeah, this is in the latest base run logs	19:13
fungi	the removal and install attempts i see look like they probably happened from our daily periodic job	19:13
fungi	given the times	19:14
fungi	(and frequency)	19:14
ianw	anyway, now this has run, we should stop it doing this. sorry, i planned for this to happen in the matter of a few hours -- but then the gate got broken by codesearch	19:14
ianw	if i try and update the base yaml i think it will run the failing codesearch job too	19:15
ianw	"this" being the reinstall	19:15
openstackgerrit	Mohammed Naser proposed zuul/zuul-jobs master: added lgtm with basic docs https://review.opendev.org/763428	19:15
fungi	for whatever reason, rsyslog was able to be reinstalled successfully on ns2	19:15
fungi	though similarly, it seems it got uninstalled in the daily run on wednesday and then wasn't installed again until the daily run on thursday	19:17
fungi	so ns2 had no rsyslog for ~24 hours	19:17
fungi	i guess that's not what was intended	19:17
ianw	no, looking at the base.yaml, it seems it does the purge but not the reinstall, thinking the host was unreachable	19:18
ianw	fungi: what's your preference here? i don't think any of us have interest in updating the current puppet & codesearch server to their new releases so we can rule that out	19:19
ianw	i can make the job non-voting, and propose a change to stop the purge here now the old config file is gone	19:19
ianw	or we can merge https://review.opendev.org/763298 to remove the codesearch puppet	19:20
ianw	and then merge a change to avoid the reinstall without gate changes	19:20
fungi	not entirely sure i grok the interrelationship between these issues, but happy to prioritize the codesearch container reviews if that gets rsyslog working on ns1 (would it also be broken anywhere else?)	19:21
ianw	it's just that the current testing of the codesearch job is broken, which will block the gate for anything that tries to run it, like base file updates	19:21
ianw	fungi: as far as i can tell from the base logs, it was only ns1/2 that seemed to become unreachable for the reinstall step	19:22
*** mgoddard has joined #opendev		19:27
*** sboyron has quit IRC		19:32
*** sboyron_ has joined #opendev		19:33
fungi	clarkb: do you think we need the backup volume to be ssd? i guess it might speed up the maintenance a little if we don't have to wait as long as for rsync to sata?	19:35
clarkb	I dont think it is neccesary but may help?	19:36
fungi	clarkb: and the idea is to just sync /home/gerrit2 into it, right? so i can get by with a 100gb volume as the data in there is less than that	19:37
clarkb	well we'll sync two sents of gerrit2 homedirs into it	19:37
clarkb	so it should be large enough for both copies	19:38
clarkb	2.13 and 2	19:38
clarkb	*2.13 and 2.16	19:38
fungi	got it, i'll make it 256gb then?	19:38
clarkb	current is using less than half the exieting 256 right?	19:38
fungi	yep, 93gb	19:39
clarkb	if so then ya 256 is probably a good size	19:39
*** mgoddard has quit IRC		19:39
clarkb	notedb grows things a bit but were snapshotting pre notedb	19:39
clarkb	its about 15gb growth before we gc then 4gb after gc iirc	19:39
fungi	do we want different logical volumes for the two copies or just separate paths?	19:42
*** sboyron_ has quit IRC		19:48
fungi	assuming separate paths will work fine	19:50
clarkb	just separate path is fine	19:51
clarkb	also I think we only need to copy the review_site and the db backup	19:51
clarkb	not all of the gerrit2 homedir. That may make it a bit quicker	19:51
openstackgerrit	Ian Wienand proposed opendev/system-config master: base: Remove rsyslogd reinstall https://review.opendev.org/763431	19:52
fungi	meh, it's already underway. but also the rest of the stuff is unlikely to change much during upgrade?	19:52
clarkb	good point	19:52
fungi	`sudo rsync -Sax --delete /home/gerrit2/ /mnt/2020-11-20_backups/2.13` is what's presently running	19:53
*** sboyron has joined #opendev		19:55
*** hashar has quit IRC		19:58
clarkb	infra-root review.o.o:~clarkb/maintenance.html has a short blurb in it now	20:06
clarkb	do we think we need to add anything more to that?	20:06
fungi	your last closing paragraph tag is busted, otherwise lgtm	20:07
clarkb	fixed	20:08
fungi	yep, looks great	20:14
clarkb	as far as timing goes for tomorrow I'm going to try and be at the keyboard by 14:30UTC	20:14
clarkb	the schedule I've written doesn't have us doing anything until 1500 anyway so that should be plenty of time to wake up	20:15
fungi	yeah, at most we'll send out a status notice an hour before or something to remind folks	20:16
fungi	i can leave myself a reminder to send the reminder	20:16
fungi	2.13 rsync completed, i'm priming the copy for 2.16 now and will add these commands to the pad	20:18
clarkb	fungi: thanks!	20:18
clarkb	ianw: looking at https://review.opendev.org/#/c/762960/10..11/playbooks/roles/codesearch/templates/docker-compose.yaml.j2 the path in the dockercompose files is data not /data I think that means it is relative to the docker compose config dir?	20:18
clarkb	ianw: I guess my question was why not do it as /var/hound/data or similar which we do with other containers	20:18
clarkb	then when you bind mount it will be /var/run/data ?	20:19
clarkb	ianw: also left one note about the jobs	20:20
ianw	clarkb: i think that's saying "data volume at /var/run/data"?	20:31
clarkb	oh a proper docker volume. I think we should avoid those	20:31
clarkb	they get allocated out of a difficult to manage space and that makes it hard to supplmement with lvm and cinder	20:32
clarkb	I think its best for our current use cases to use regular bind mounts	20:32
ianw	it seemed appropriate in this case because the data is not ephemeral, but also not required to be outside the container	20:32
clarkb	ianw: ya I've been using them with my nextcloud deployment at home and think it was the biggest mistake in that deployment	20:32
clarkb	mostly because you can't say "use disk space from this location" easily	20:33
clarkb	and since we rely on cinder a lot I think that may be important?	20:33
clarkb	also consistency is nice. but maybe others are fine with that	20:33
ianw	hrm, i mean i don't think this is going to expand. i can, but the config file is deliberately generated in the container, to keep it self-contained	20:34
clarkb	you can bind mount the dir and still generate the config file right?	20:34
clarkb	I might be missing how that affects volumes vs bind mounts	20:34
ianw	yeah, i can, it just seems a bit unnecessary to have it outside the container context	20:35
ianw	i dont' feel strongly. i can bind mind mount it in	20:36
clarkb	I like the simplicity of bind moutns and we can move them and remount bigger fs's under them etc	20:37
openstackgerrit	Ian Wienand proposed opendev/system-config master: Migrate codesearch site to container https://review.opendev.org/762960	20:41
openstackgerrit	Ian Wienand proposed opendev/system-config master: Add codesearch.opendev.org server https://review.opendev.org/763298	20:41
openstackgerrit	Ian Wienand proposed opendev/system-config master: base: Remove rsyslogd reinstall https://review.opendev.org/763431	20:42
*** zaro has quit IRC		20:44
*** zaro has joined #opendev		20:47
clarkb	fungi: is 2.16 sync done?	20:49
*** sboyron has quit IRC		20:49
fungi	yep, 22m36s elapsed time	20:51
fungi	so in theory any update will be faster than that	20:51
clarkb	++	20:51
fungi	i'm timing a nearly no-op update of both now to get an approximate lower bound	20:51
clarkb	one thing I just checked was that the 2.16 bugfix branch has the notedb conversion improvement changes on it and it does	20:51
fungi	oh good	20:52
clarkb	I'm drawing a blank for other things we can check without redoing things we have already done. So I think I'll sneak in that bike ride as soon ast he curren rain passes	20:52
fungi	and given that stable-3.2 has no new commits since two weeks ago i'm guessing the only difference to its corresponding bugfix branch is the security fixes	20:52
clarkb	ianw: your stack lgtm I didn't approve anything given I'll be out on the bike and also distracted by gerrit	20:53
clarkb	fungi: ya I checked that last night	20:53
clarkb	fungi: also re the jetty thing I wonder if that is only on java 11	20:54
ianw	clarkb: thanks for reviews. i can babysit after ci and get gate back to working	20:54
clarkb	we're doing our own java 8 builds and not seeing that	20:54
fungi	ahh, could be	20:54
fungi	i did already approve the containerization change	20:55
fungi	nothing under that topic should break the existing codesearch anyway, that will cut over when someone updates openstack.org dns	20:55
fungi	ianw: oh! you probably want to add the acme cname to openstack.org dns in advance or that's going to break cert issue?	20:56
ianw	fungi: already did that :)	20:56
fungi	you're smarter than i	20:56
fungi	i didn't even think to check it until just now	20:56
ianw	not smarter, just have made the mistake before	20:57
fungi	_acme-challenge.codesearch.openstack.org is an alias for acme.opendev.org.	20:57
fungi	perfect!	20:57
openstackgerrit	Slawek Kaplonski proposed zuul/zuul-jobs master: [multi-node-bridge] Add script to configure connectivity https://review.opendev.org/762650	21:01
fungi	okay, mostly null re-up of the two rsync copies took 2m7s and 1m46s so that probably puts our lower bound around 2 minutes	21:06
fungi	i'm going to estimate around 5 minutes for the 2.13 backup refresh and 10 minutes for the 2.16 backup refresh in our maintenance notes, just to have a ballpark figure	21:07
fungi	i suppose i can time a mysqldump too, seems we don't have an estimate for that yet	21:08
fungi	i've set myself reminders to do a status notice at 13:00 and again at 14:00. i'll go ahead and startmeeting in #opendev-maintenance at 13:00 as well so we can capture any last minute prep discussion	21:15
clarkb	++ thanks	21:23
fungi	9m16s, i'll put it down as 10m	21:24
*** sboyron has joined #opendev		21:27
*** fressi has quit IRC		21:40
*** fressi has joined #opendev		21:41
openstackgerrit	Merged opendev/zone-opendev.org master: Add codesearch.opendev.org https://review.opendev.org/763297	22:01
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Revert "Build new gerrit images" https://review.opendev.org/763473	22:13
fungi	infra-root: ^ not sure if we want to do that at this point or not, but worth noting we can	22:15
corvus	fungi: commit msg says 'most' which raises questions	22:16
corvus	fungi: most means >= ones we care about?	22:16
corvus	fungi: but not 2.14 and 2.15 which we are building? so i guess i'm confused	22:17
fungi	corvus: well, we're never planning to expose 2.14 and 2.15 publicly, we're just running init in each of them temporarily	22:19
corvus	fungi: so we don't really care which version of those we build and we're switching to the non-updated stable branches on those just to clean up the config along with the rest?	22:20
fungi	i suppose i could have used more words in the commit message. i meant "most of the stable branches we're building (including 3.2 which we'll be exposing and 2.16 which we might roll back to)	22:21
fungi	it was more so we didn't have to revert twice nor wait to get back onto stable-3.2 until stable-2.14 eventually gets those commits (if ever)	22:22
fungi	i could amend it to be a partial revert and continue using the bugfix branches for 2.14 and 2.15. we expect to rip out all the entries <3.2 after upgrading anyway	22:22
corvus	hashar merged a change into 2.16 recently (after the notedb fix) which may be merged up through other branches soon	22:24
fungi	and stable-2.15 has also updated now as we've been talking	22:24
corvus	it's a doc change	22:25
corvus	we may get varying builds based on that if they're in the process of merging up, but i think we don't care and can just ignore it.	22:25
corvus	fungi: +2 and as you can tell, i actually double checked everything :)	22:26
openstackgerrit	Merged opendev/system-config master: Migrate codesearch site to container https://review.opendev.org/762960	22:26
fungi	thanks, and yeah i checked that commit from hashar as well thinking at first it might impact the upgrade process, but nope	22:26
*** sboyron has quit IRC		22:31
fungi	stable-2.14 has updated now too	22:53
*** DSpider has quit IRC		22:55
clarkb	fungi: how important do we think that is? eg should we land it right now and use those images or should we stick to the image we tested then land that after the upgrade?	23:21
clarkb	I'm kinda leaning towards leaving it as is before the upgrade then we can land that as part of the changes we need to land after? but if people feel strongly the other way let me know	23:23
clarkb	corvus: ^ do you have a preference?	23:25
*** zaro has quit IRC		23:29
ianw	Failed to download remote objects and refs: error: file write error: No space left on device	23:35
ianw	i think our nodepool builders are unhappy	23:35
clarkb	I guess we were closer to hte edge with only 2 builders than I thought (we removed a lot of images but then I think we added a couple back in)	23:36
clarkb	iirc f33 and centos-8-stream happeend after we condensed to 2?	23:36
ianw	this failed in letsencrypt because we keep acme.sh on /opt	23:36
ianw	it's 01 & 02 ... i've brought the container down while i look	23:38
clarkb	ianw: the other thing that happens is we leak the mounts and consume a bunch of tmp space iirc	23:39
ianw	in this case, there doesn't seem tob e any leaked mount	23:39
clarkb	in the past what I've done is down the builders, disable the service, reboot, rm everything in dib_tmp, then enable the service and reboot	23:39
clarkb	k	23:39
clarkb	dib_tmp has a bunch of stuff in it fwiw. Running du on it now (this is on nb01	23:40
ianw	yeah, it's quicker to just rm it all and see what frees up :)	23:40
clarkb	wfm if you want to do that	23:41
clarkb	I've stopped the du	23:41
openstackgerrit	Merged opendev/system-config master: Add codesearch.opendev.org server https://review.opendev.org/763298	23:42
ianw	the logs have all rotated out what started it	23:45
clarkb	also it may be worth removing the cache and rebuilding it depending on whether or not we think some of that old distro stuff is in there in large stale quantities	23:46
ianw	that's freed up 43gb	23:48
clarkb	there is a decent chance we've just got too many imaes for 1tb now :/	23:49
ianw	43gb is actually pretty tight, given the various formats we convert to	23:50
clarkb	ya	23:51
ianw	the vhd thing is ridiculous and writes it out about 3 times in total i think	23:52
clarkb	the other thing is in theory they are supposed to balance out, but maybe we aren't doing that	23:52
clarkb	or its just te new images filling our disks up	23:52
*** hamalq has quit IRC		23:53
*** hamalq has joined #opendev		23:54
clarkb	thinking out loud here: what if we didn't keep the raw and vhd images on disk when building qcow2? we can convert the qcow2 to the others if need be	23:56
clarkb	that would require nodepool hcnages I bet, but maybe that is a good optimization?	23:56
clarkb	basically do all the uploads then trim	23:56
fungi	clarkb: i'd be fine upgrading with the 3.2 we've tested running (note we haven't tested "upgrading" with it per se, nor with the other fixed intermediate images we built today), just wanting to make sure we can pull new commits shortly after we upgrade in case we run into any new bugs which get fixed upstream	23:57
clarkb	fungi: ya I think we want to revert soon, but sticking with the images we've got at this point seems good until the upgrade is done	23:58
fungi	wfm	23:58
fungi	in theory the stable-3.2 branch is currently identical to what we built from, so it shouldn't make a difference barring problems arising on rebuild, so i'm fine waiting	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!