Tuesday, 2021-09-14

jinyuanliu_	hi	02:45
jinyuanliu_	https://review.opendev.org/SignInFailure,SIGN_IN,Contact+site+administrator	02:46
jinyuanliu_	I have newly registered an account. This error is reported when I log in. Does anyone know how to deal with it	02:47
*** ysandeep\|out is now known as ysandeep		05:11
*** odyssey4me is now known as Guest7196		05:47
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	07:11
*** jpena\|off is now known as jpena		07:23
*** hjensas is now known as hjensas\|afk		07:27
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	07:53
opendevreview	Oleg Bondarev proposed openstack/project-config master: Update grafana to reflect dvr-ha job is now voting https://review.opendev.org/c/openstack/project-config/+/805594	08:02
*** ykarel is now known as ykarel\|lunch		08:05
*** ysandeep is now known as ysandeep\|lunch		08:14
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	08:20
noonedeadpunk	hey there!	08:30
noonedeadpunk	It feels for me that git repos might be out of sync right now. https://paste.opendev.org/show/809294/	08:31
noonedeadpunk	and at the same time on other machine I get valid 768b8996ba4cb24eb2e5cd5dc149cd114186debd	08:31
*** ykarel\|lunch is now known as ykarel		09:24
ianw	noonedeadpunk: is the other machine coming from another ip address?	09:25
noonedeadpunk	yep	09:26
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	09:29
ianw	d615a7da 15:30:25.611 [1904d04c] push ssh://git@gitea03.opendev.org:222/openstack/cinder.git	09:31
ianw	infra-root: ^ that to me looks like a stuck replication from gerrit, and matches the repo noonedeadpunk is cloning	09:32
ianw	there are a few others as well, some older, up to sep 9 is the oldest	09:32
ianw	i've killed that process, i don't really have time for a deep debug at this point	09:43
ianw	https://paste.opendev.org/show/809297/ <- i have killed those process which all seem stuck. push queues seem empty now	09:50
ianw	we should keep an eye to see if more are getting stuck	09:51
noonedeadpunk	git pull worked for my faulty machine	09:55
noonedeadpunk	thanks!	09:55
*** odyssey4me is now known as Guest7209		10:02
*** ysandeep\|lunch is now known as ysandeep		10:08
iurygregory	opendev folks, we started to see ironic jobs failing because of reno 3.4.0... I'm wondering if anyone saw the same problem in other projects (e.g https://zuul.opendev.org/t/openstack/build/6f39928da6a04d2ab0a64258d7309bfa )	10:38
iurygregory	maybe an issue with pip?	11:01
*** dviroel\|out is now known as dviroel		11:10
rosmaita	iurygregory: seeing that on a lot of different jobs, usually see that when the pypi mirror is outdated	11:13
*** jpena is now known as jpena\|lunch		11:24
iurygregory	rosmaita, so we just wait to do some recheck?	11:36
*** ysandeep is now known as ysandeep\|brb		11:46
*** ysandeep\|brb is now known as ysandeep		11:54
rosmaita	iurygregory: other than reporting it here (not sure the infra team can do anything about the mirrors), not sure what else we can do	12:04
iurygregory	I asked in #openstack-infra also =)	12:19
*** jpena\|lunch is now known as jpena		12:20
ykarel	Clark[m], fungi, ^ happened again	12:25
ykarel	some mirrors affected like pip install --index-url=https://mirror.mtl01.iweb.opendev.org/pypi/simple reno==3.4.0	12:25
ykarel	i ran PURGE as you suggested last time but doesn't seem to help, may be running wrongly	12:26
ykarel	ran curl -XPURGE https://mirror.mtl01.iweb.opendev.org/pypi/simple/reno	12:26
ykarel	seems to work now	12:27
ykarel	this time i ran without reno may be that worked?	12:27
ykarel	curl -v -XPURGE https://mirror.mtl01.iweb.opendev.org/pypi/simple	12:27
ykarel	iurygregory, both the failures you shared were on mirror.mtl01.iweb.opendev.org, which i see working now	12:30
ykarel	have you seen on other mirrors too?	12:30
ykarel	if not can try recheck and see if all good now	12:30
iurygregory	ykarel, I haven't check other jobs, I will put recheck and see how it goes	13:09
ykarel	ack	13:09
*** lbragstad_ is now known as lbragstad		13:15
fungi	ykarel: it's not our mirrors which need to be purged, it's pypi's cdn	13:19
fungi	iurygregory: ^	13:19
fungi	we're just proxying through whatever pypi's serving, and their cdn seems to sometimes serve obsolete content around montreal canada	13:19
ykarel	fungi, yes correct, i messed the words	13:19
fungi	no, i mean you're calling curl to purge our mirror which won't do anything, you need to purge that url from pypi's cdn	13:20
fungi	which should cause fastly to re-cache it from the correct content (hopefully)	13:20
ykarel	ohkk not aware about that	13:20
ykarel	i just ran above curl and it worked somehow	13:21
ykarel	may be it was just timing	13:21
fungi	rosmaita: also we have no pypi mirror to get outdated, we just proxy pypi	13:22
fungi	ykarel: yes, from what we've seen in the past it's intermittent and may recur	13:22
fungi	my suspicion is that pypi still maintains a fallback mirror of their primary site which they instruct fastly to pull from if it has network connectivity issues reaching the main site, and that fallback mirror may be stale, and for whatever reason the fastly cdn endpoints near montreal have frequent connectivity issues and wind up serving content from the fallback site	13:23
rosmaita	fungi: ok, good to know	13:23
ykarel	fungi, ack	13:25
fungi	it used to be the case that the mirroring method pypi used to populate their fallback site didn't create the necessary metadata to make python_requires work, so we'd see incorrect versions of packages selected. it seems like they solved that, but possible that fallback could be very behind in replication or something	13:26
fungi	so instead we're just ending up getting old indices sometimrs	13:26
rosmaita	i'm getting an openstack-tox-docs failure during the pdf build ... sphinx-build-pdf.log tells me "Command for 'pdflatex' gave return code 1, Refer to 'doc-cinder.log' for details" ... but i can't find doc-cinder.log in https://zuul.opendev.org/t/openstack/build/f45e84debbae44449bf6d98cc0807d68/logs ... where should i be looking?	13:27
fungi	it's possible the job isn't configured to collect that log	13:27
rosmaita	oh	13:27
fungi	looking to see if i can tell where it would be written on the node	13:28
tosky	https://47bdf347398a802ddc78-6d4960ba8e43184f8d8ec59d1a3f8e83.ssl.cf2.rackcdn.com/760178/13/check/openstack-tox-docs/f45e84d/sphinx-build-pdf.log	13:28
tosky	rosmaita: ^^	13:28
fungi	ahh, it just has a different filename	13:28
rosmaita	is that the same log?	13:28
fungi	dunno, do pdf builds work locally for you? if they break similarly you can compare the log content	13:28
rosmaita	fungi: guess i will have to check	13:29
fungi	looks like it contains the same error referring you to the other log, so i think it's not	13:29
fungi	it probably wrote it at /home/zuul/src/opendev.org/openstack/cinder/doc/build/pdf/doc-cinder.log	13:30
iurygregory	fungi, ack	13:37
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	13:38
Clark[m]	fungi: rosmaita: https://zuul.opendev.org/t/openstack/build/f45e84debbae44449bf6d98cc0807d68/log/job-output.txt#2013 shows the file it redirected to which is the one tosky linked to	13:42
fungi	Clark[m]: yeah, check the end of that file, it refers you to the other file rosmaita is looking for	13:43
fungi	maybe it thinks it's referring to itself by a different name?	13:43
fungi	or maybe it contains all the same contents as the other file it references and is a superset?	13:44
Clark[m]	Well the redirected content is what the build outputs. I guess the build could write a separate log file the jobs don't know to collect. In the past the stout and stderr have been enough to debug those though iir	13:44
Clark[m]	fungi: should we trigger replication for the repos ianw had to clean up old replication tasks to be sure they are caught up?	13:46
fungi	Clark[m]: probably, i didn't see ianw mention having triggered a full replication	13:46
fungi	i'll do that in a sec	13:47
fungi	`gerrit show-queue` indicates it's still caught up at least, so no new hung tasks	13:48
Clark[m]	fungi: ykarel: our proxies should proxy the purge requests so I expect those curl commands work. But as fungi points out it creates confusion over where the issue lies. To be very very clear we are only giving the jobs what pypi.org has served to us.	13:49
Clark[m]	Ya those stale ones could be fallout from when we did the server migrations if those weren't as graceful as we expected	13:49
Clark[m]	I don't know if the timing lines up for that or not	13:50
fungi	roughly 1.8k tasks queued now	13:50
fungi	up to 7.8k now	13:51
fungi	10k...	13:51
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	13:52
Clark[m]	I think the total for a full replication is around 17k	13:52
Clark[m]	It should go quickly for repos that are up to date	13:53
fungi	topped out a little over 18k and now falling	13:53
fungi	we're down around 16k tasks now	14:29
fungi	i guess we're looking at 4-5 hours for it to finish at this pace	14:30
clarkb	I wonder if the bw between sjc1 and ymq is lower than it was between dfw and sjc1	14:34
clarkb	fungi: fwiw we can replicate specific repos which goes much quicker. In this case probably a good idea to do everything anyway	14:35
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	14:38
clarkb	fungi: thinking out loud here about lists.o.o reboots. I think step 0 is confirm that auto apt updates haven't replaced our existing boot tools with newer versions including the decompressed kernel. Reboot on what we've got and confirm that works consistently. Then we can replace the decompressed kernel with the compressed kernel and try rebooting again. If that works we are good and	14:40
clarkb	don't really need to do anything more. If that doesn't work we recover with the decompressed kernel and put package pins in place and plan server replacement	14:40
fungi	yep	14:42
fungi	that should suffice	14:42
clarkb	jinyuanliu_ isn't here anymore but chances are an existing account has the same email address as the one they just tried to login with and gerrit caught that as a conflict and prevent login from proceeding	14:46
clarkb	they should either use a different email address or login to the existing account	14:46
fungi	skimming last modified times in /boot on lists.o.o it looks like nothing has been touched since our edits	14:48
clarkb	cool. I still need to load ssh keys for the day. But I guess we can look at rebooting after meetings	14:52
fungi	sgtm	14:55
clarkb	fungi: I just double checked and /boot/grub/grub.cfg shows the 5.4.0-84 kernel as the first entry, that kernel file in /boot is the decompressed version (much larger than the others), and /boot/grub/menu.lst still lists the chain load option first	14:56
clarkb	I agree those files all seem to be as we had them on the previous boot	14:56
*** ykarel is now known as ykarel\|away		15:02
clarkb	fungi: should we status notice the delay in git replication?	15:03
clarkb	something like #status notice Gerrit is currently undergoing a full replication sync to gitea and you may notice gitea contents are stale for a few hours.	15:04
*** ysandeep is now known as ysandeep\|dinner		15:11
fungi	we can, though it's probably not going to be that noticeable. we're already down to ~14k remaining	15:15
clarkb	fungi: https://www.mail-archive.com/grub-devel@gnu.org/msg30984.html I'm not longer hopeful the chainloaded bootlaoder is any better	15:17
clarkb	https://lists.gnu.org/archive/html/grub-devel/2020-04/msg00198.html too	15:18
clarkb	fungi: in that case maybe we should focus on some combo of pinning the kernel package, doing the apt post install hook thing from that forum suggestion, replacing the server	15:21
fungi	yeah, that seems like the best we can probably arrange	15:22
fungi	it's still unclear to me from those posts why chainloading the bootloader still doesn't provide lz4 decompression, yet somehow grub in pvhvm guests can decompress them just fine	15:23
clarkb	fungi: I suspect the chainloader may not hand off and execute grub2 instead it knows how to read grub2 configs and then negotiate running the kernel directly like pv wants	15:25
fungi	mmm, i see. yeah there's mention of the kernel file being handed off to the hypervisor, so i suppose it's being handed off compressed and it's up to the hypervisor whether it supports that compression	15:26
fungi	in which case i wonder why the chainloading was even needed	15:27
clarkb	fungi: the only thing I can think of is maybe I got the menu.lst wrong? But menu.lst isn't going to be reliably updated anymore is it?	15:28
fungi	right, i guess rackspace's external bootloader will want to parse menu.lst (though maybe it would parse grub.cfg if menu.lst wasn't there)	15:29
clarkb	based on what I've read I think we should do a reboot to double check it works reliably as is. Then either pin the kernel package or put in place the post install hack. Then start planning a replacement server?	15:32
clarkb	https://lists.xenproject.org/archives/html/xen-users/2020-05/msg00023.html "I think we lost most of them to KVM already anyway :(" Not going to lie it seems like maybe problems like this are a big reason for that	15:32
fungi	maybe this is a good time to try to combine the two ml servers into one, and/or a more urgent push for mm3 migration	15:35
fungi	worth discussing in today's meeting	15:36
clarkb	ya its on the agenda to discuss this stuff	15:36
mordred	"rackspace's external bootloader" ... do I even want to know?	15:37
clarkb	mordred: basically xen is weird and it can't reliably boot ubuntu focal anymore in pv? mode	15:38
fungi	mordred: it's how xen pv works, the bootloader is external to the server image	15:38
clarkb	mordred: unlike kvm xen isn't running the bootloader if we understand things correcntly. Instead its finding the kernel and running it directly	15:38
clarkb	kvm is like your laptop and runs the actual bootloader avoiding all of these issues	15:38
clarkb	fungi: we can also yolo go for it and see if it can do the compressed file though I'm fairly certain it can't at this point	15:39
fungi	i think xen pvhvm works like kvm though	15:39
clarkb	fungi: ya our pvhvm instances seem fine	15:39
fungi	clarkb: no, i agree with you after digging into those ml threads	15:39
mordred	wow. also - we have non-pvhvm instances? I thought we just used that for everytihng ... but in any case, just wow	15:41
clarkb	mordred: lists.openstack.org is our oldest instance as we haev upgraded it in place to preserve ip reputation for smtp	15:42
clarkb	I think it predates rax offering pvhvm	15:42
mordred	ahhhhhh yes	15:42
clarkb	fungi: the extract tool and the postinstall script don't seem too terrible if we just want to put those in place for now as a CYA manuever while we plan to replace the server?	15:42
clarkb	New suggestion, reboot on current state to ensure it is stable. Then install extract-vmlinux and the kernel postinstall.d script. Then start planning to replace the server	15:43
clarkb	fungi: ^ does that seem reasonable to you and if so do you think we need to try and ansible the extract-vmlinux and postinst.d stuff or just do it by hand?	15:44
fungi	wiki.o.o may be of a similar vintage to lists (though it's actually a rebuild from an image of the original so not technically an old instance, just an instance on an old flavor)	15:44
fungi	mordred: lists.o.o has been continuously in-place upgraded since it was created with ubuntu precise (12.04 lts) some time in 2013	15:45
fungi	clarkb: yes, that plan sounds solid	15:45
clarkb	cool in that case should we do a reboot nowish?	15:46
fungi	we can probably by-hand manage the kernel decompression and either put a hold on the current kernel package or just plan to fix it with a recovery boot if unattended-upgrades puts us on a newer kernel and then rackspace reboots the instance for some reason	15:47
clarkb	ya that is an option for us as well	15:47
clarkb	the script from linux is in lists.o.o:~clarkb/kernel-stuff if we want to extract it by hand	15:48
fungi	and yes, i'm able to do the reboot now if you're ready	15:48
fungi	and at the ready to do the recovery boot if needed	15:49
clarkb	fungi: ya I'm ready. My meeting is over	15:49
clarkb	fungi: will you push the button? or should I?	15:50
fungi	i will	15:51
fungi	sorry, had to switch rooms, pushing the button now	15:52
clarkb	ok	15:53
fungi	looks like it finished shutting down	15:54
clarkb	and now it isn't coming back :/	15:54
clarkb	I guess it is a good thing to know this isn't reliable. But would be really nice to know why	15:55
clarkb	oh wait it pings now	15:55
fungi	it's booting	15:55
clarkb	huh is it just REALLY slow?	15:55
fungi	fsck ran for a bit	15:55
clarkb	ah	15:55
fungi	also the chainloading may delay it if we didn't take out the keypress timeouts	15:56
fungi	15:56:05 up 1 min, 2 users, load average: 2.89, 0.81, 0.28	15:56
fungi	Linux lists 5.4.0-84-generic #94-Ubuntu SMP Thu Aug 26 20:27:37 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux	15:56
clarkb	and it is on the 5.4. kernel	15:56
clarkb	ok rebooting is reliable but slow.	15:56
fungi	ftr, it's not our slowest rebootnig server	15:56
fungi	but yeah it did take a bit	15:57
fungi	so looking at mm3 options, we could either go the distro package route for something stable (focal has mm 3.2.2) or the semi-official docker container route (which has options for latest mm 3.3.4 or rolling snapshots of upstream revision control)	15:58
clarkb	I just updated the lists upgrade etherpad with reboot and potential future options. I guess we discuss what we want to do next in our meeting	15:59
fungi	the up side to distro packages, in theory, is we get backported security fixes but fairly stable featureset until the next distro release upgrade. with docker containers we get a version independent of the running distro but end up consuming new versions a lot more frequently if we want security fixes	15:59
clarkb	Do we want to remove lists.o.o from the emergency file before we meeeting? The two things to consider there are we should probably remove the cached ansible facts file for that instance on bridge and it will run autoremove of packages which may remove our 4.x kernels but we seem to be able to boot 5.4 now	16:00
clarkb	fungi: one thing I have really enjoyed with things like gitea where we consume upstream with containers is that we can update frequently and keep the deltas as small as possible	16:00
clarkb	letting things sit for years results in scaryness	16:00
fungi	yeah, i think it's fine. do we want a manual ansible run or just let the deploy jobs do their thing on their own time?	16:01
clarkb	fungi: the deploy jobs already ran so we should probably remove it then manually run the playbook	16:01
clarkb	well deplyo jobs ran for the fixup change I mean	16:01
fungi	do we need to remove cached facts before taking it out of the emergency list?	16:01
clarkb	fungi: yes I think we need to do that otherwise some of our option selections might select xenial options	16:01
clarkb	/var/cache/ansible/facts/lists.openstack.org appears to be the file?	16:02
clarkb	I'm going to find breakfast but can help with that stuff after but if you want to go ahead feel free	16:03
fungi	ahh, we can just delete that file i guess. i was looking through documentation to find out how to clear cached facts	16:04
fungi	hah, first hit on a web search was https://docs.openstack.org/openstack-ansible/12.2.6/install-guide/ops-troubleshooting-ansiblecachedfacts.html	16:04
fungi	that at least seems to confirm your suggested method	16:05
*** ysandeep\|dinner is now known as ysandeep		16:08
fungi	clarkb: after a bit more looking around, i managed to convince myself deleting that file should do what we want, so removed it	16:11
opendevreview	Martin Kopec proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/ https://review.opendev.org/c/openstack/project-config/+/765787	16:12
fungi	and removed the lists.openstack.org entry from the emergency disable list	16:12
Clark[m]	Cool I guess next up is running the playbook? I should be back at the keyboard in 20-30 minutes	16:13
fungi	don't rush. i'll get things queued up in a root screen session on bridge.o.o	16:15
fungi	i've queued up a command to run the base playbook first, presumably that's what we want to start with	16:16
Clark[m]	++	16:18
Clark[m]	That is what does autoremove fwiw	16:19
*** jpena is now known as jpena\|off		16:26
opendevreview	Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031	16:27
fungi	hard to say for sure since we just reset it with a reboot, but memory consumption on lists.o.o looks slightly lessened after the upgrade than it was prior	16:28
fungi	just based on comparing the memory usage graph in cacti to the same day last week	16:29
clarkb	fungi: I've joined the screen	16:35
clarkb	I guess I'm ready if you are	16:36
fungi	ready for me to flip the big red switch then?	16:36
*** artom_ is now known as artom		16:36
clarkb	ya I think we run base, then check what it did (if anything) to /boot contents	16:36
clarkb	then we run the service lists playbook and restart mailman and apache services to be sure they are happy with slightly updated contents?	16:37
fungi	starting now	16:38
clarkb	the exim conf task seems to have nooped as expected	16:40
fungi	yeah, so far so good	16:40
clarkb	the older kernels have remained and we did't end up with a recompressed new kernel	16:41
clarkb	I think that is a happy base run	16:41
fungi	so the service-lists playbook next? is that the correct one?	16:41
clarkb	checking	16:41
clarkb	infra-prod-service-lists job runs service-lists.yaml	16:42
clarkb	yes I believe that command is correct	16:42
fungi	i mean, that's a file anyway, i was able to tab-complete it	16:42
fungi	okay, running now	16:42
*** ysandeep is now known as ysandeep\|out		16:42
clarkb	it is updating the things I expect it to update and skipping creation of lists because they arlready eixst if I read the log properly	16:44
fungi	yep, that's my take on the output	16:45
clarkb	the updates for airships init script, apache config and the global mm_cfg.py all look as expected to me. Will just have to restart services and ensure they function when ansible is done I think	16:45
fungi	yep, i'll do that via ssh to the lists server	16:46
fungi	i'm already logged in there	16:46
fungi	yay, completed without errors	16:46
clarkb	yup ansible looked good I think	16:46
fungi	so should i test the mailman service restarts then?	16:46
clarkb	fungi: maybe restart apache first and we check apache for each of the list domains then we can restart mailman-opendev and send an email to lists.opendev.org?	16:47
fungi	sure	16:47
clarkb	then if that is happy restart the other 4	16:47
fungi	apache is fully restarted now	16:48
clarkb	I can browse all 5 list domains via the web and the lists seem to line up with the domain	16:48
clarkb	if that looks good to you I think we are ready to restart mailman-opendev and send a quick test email to it (just a response to your existing thread on service-discuss?)	16:49
fungi	i browsed from the root page of each of the 5 sites all the way to specific archived messages, for some list on them, so lgtm	16:50
fungi	restarting mailman-opendev now	16:51
fungi	restarted. the list owned processes dropped from 45 to 36 and then went back up to 45 again	16:51
fungi	sending a reply now	16:52
clarkb	ya I see 9 new processes	16:52
fungi	sent	16:53
fungi	it's in the archive	16:54
fungi	and i received a copy	16:54
clarkb	I see it in my inbox too	16:54
fungi	i'll proceed with restarting the other four sets of mailman services now	16:54
clarkb	++	16:54
fungi	all of them have been cleanly restarted	16:56
clarkb	I see an appropriate number of new processes	16:56
fungi	i confirmed the 9 processes for each went away and returned	16:56
clarkb	gerrit tasks queue down to 9.8k	16:58
fungi	so nearly halfway done	16:59
fungi	i need to take a quick break, and then i'll try to spend a few minutes digging deeper on the current state of containerized mm3 deployments	17:01
clarkb	I think we're done with lists for now with next steps to be discussed in hte meeting. I'm going to context switch to catching up on some zuul related changes I've been reviewing	17:01
clarkb	fungi: thanks for the help!	17:01
fungi	almost done with lists? we still have some test servers and autoholds we need to clean up, right?	17:02
fungi	clarkb: mind if i delete autohold 0000000332 "Clarkb checking ansible against focal lists.o.o"	17:03
clarkb	fungi: go ahead	17:03
fungi	done	17:04
clarkb	fungi: I think there is an older hold for lists too that can be cleaned up if it is still there	17:04
fungi	looks like the manually booted test servers are already cleaned up	17:05
clarkb	the one I had originally when checking the upgrade from xenial to focal	17:05
fungi	clarkb: nope, just gerrit revert and gitea upgrade	17:05
fungi	no other lists autohosts	17:05
fungi	autoholds	17:06
clarkb	I guess I cleaned that one up after the lists.kc.io upgrade since that was a better check	17:06
fungi	looks that way	17:06
fungi	on the rackspace side of things, i'll clean up the old image we made for lists.o.o in may, but keep the one from last weekend	17:07
clarkb	++	17:08
fungi	and done	17:09
clarkb	fungi: should we let westernkansasnews know they have likely been owned? Their WP install is what these phsihing list owner emails link back to	17:15
clarkb	(otehrwise why return to them?)	17:16
*** artom_ is now known as artom		17:22
fungi	i receive countless phishing messages to openstack-discuss-owner, not sure it's worth my time to follow up on every one. i just delete them	17:22
clarkb	ya I'm just noticing that a legit organization (they have a wikipedia entry so they are real!) seems to maybe be compromised.	17:24
fungi	skimming the mm3 docs, and the documentation for the container images, this doesn't look too onerous. three containers (for the core listserv+rest api, the web archiver/search index, and the administrative webui). the web components are uwsgi listening on a high-numbered port and expects an external webserver, so fits well with our usual deployment model. similarly the core container expects to	17:26
fungi	communicate with an mta on the system	17:26
fungi	they have postfix and exim4 config examples for the mta, only nginx example for the webserver but apache shouldn't be hard to adapt to it	17:26
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	17:27
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	17:32
opendevreview	Martin Kopec proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/ https://review.opendev.org/c/openstack/project-config/+/765787	18:07
clarkb	down to about 7k tasks now	18:09
opendevreview	Clark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement https://review.opendev.org/c/opendev/infra-specs/+/804122	18:32
clarkb	infra-root ^ it seems everyone was largely happy with the last patchset of that chagne. I updated it based on some of the feedback I got. I expect this si mergeable in the near future if you can take a look to rereview it quickly	18:32
corvus	clarkb: i replied on a change there; i'm a little unclear if i should leave a -1 or +1. maybe you can read the comment and let me know :)	18:45
clarkb	corvus: hrm ya good point. I'm thinking maybe we evaluate both as we boostrap and then commit to one before turning off cacti?	18:48
clarkb	corvus: I can update the text in there to be more explict along those lines if that makes sense to you	18:48
corvus	ok. tbh, if we want to consider node_exporter, i'd do it now and not spend any wasted effort on snmp_exporter. like, it's a bit "why do it once when we can do it twice?"	18:49
fungi	we're down to ~4k tasks remaining in the gerrit queue now	18:49
clarkb	corvus: if we want to not bother with extra evaluation then I would probably say stick with snmp	18:51
fungi	is there a good summary on the benefits of node_exporter over snmp_exporter? or is it just a way to get rid of yet one more uncontainerized daemon on our servers?	18:51
clarkb	fungi: node_exporter is a bit more precanned for its gatherign and graphing	18:51
clarkb	fungi: with snmp we'll have to define the mibs we want to grab and then define graphs for them	18:51
clarkb	the downside to node_exporter is you have to run an additional service and without docker doing that for node_exporter is likely to be difficult and we run some services without docker	18:52
fungi	i'll read up on it since i have no idea what precanned means in this sense. net-snmpd seems nicely precanned already	18:52
corvus	yeah, i think that's a fair summary of the trade off	18:52
clarkb	fungi: the service already knows the sorts of stuff you want to report because it is geared towards server performance metrics	18:52
clarkb	snmp is var more generic and you have to configure the snmp exporter to grab what you want	18:52
clarkb	I'll revert the bit about node_exporter since I think the safest course for us is snmp exporter	18:53
corvus	i think my concern is that if the plan is "get node_exporter running once to try it out" we've done 98% of what's needed for "get node_exporter running everywhere" and we should just do that instead of snmp. the advantage of snmp is we don't have to do any node_exporter work. :)	18:53
fungi	ahh, so the summary is that node_exporter is more opinionated and decides what you're likely to want rather than letting you choose?	18:53
corvus	so if we do both, we'll defeat the advantages. :)	18:53
clarkb	fungi: I think it lets you choose too but yes comes with good defaults	18:53
opendevreview	Clark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement https://review.opendev.org/c/opendev/infra-specs/+/804122	18:53
clarkb	But for our environment I strongly suspect we need the flexibiltiy of something like snmp	18:54
clarkb	since running docker everywhere is not always going to be good for us	18:54
corvus	node_exporter has a lot of collectors and i think it can be extended	18:54
fungi	oh, i think i get it. it's not that net-snmpd is the problem, it's that snmp_exporter requires fiddling?	18:54
clarkb	fungi: yes	18:54
clarkb	fungi: you have to tell the prometheus snmp collector what to gather and how often to gather it and where to store it. Then you ahve to tell grafana to pull that data out and render it	18:55
fungi	because prometheus is push-based as opposed to pull-based, so the configuration is on the pushing agent side not the polling server side	18:55
corvus	i thought docker-everywhere was a goal?	18:55
clarkb	corvus: I don't think it ever was? iirc we explicitly didn't use docker for things like the dns servers	18:55
fungi	i can't imagine every single daemon on every server will be in a container though, there's bound to be a line somewhere	18:55
clarkb	fungi: the constraint would be more along the lines of does every server run a dockerd that can have node_exporter running in it	18:56
fungi	for example we've so far considered it simple enough to rely on distro packages of apache rather than requiring an apache container on every webserver	18:56
corvus	running node_exporter everywhere doesn't seem like it should be too much of a challenge? like, there isn't a server where we can't run docker?	18:56
clarkb	corvus: currently there isn't a place where we can't run docker but we have chosen not to on some	18:56
clarkb	the afs infrastructure is without docker and the dns servers are the ones I can think of immediately (mailman too but we're talking about changing that above)	18:57
fungi	i don't really see a problem with deciding to deploy something in docker on every server now that we have orchestration for that though	18:57
corvus	fungi: for node_exporter: prometheus main server will poll node_exporter on leaf-node server. pretty similar to an snmp architecture. just the agent is node_exporter instead of netsnmpd	18:57
clarkb	I guess in today's meeting I'll ask people to specifically think about whether or not they think node exporter with docker everywhere is worthwhile	18:57
clarkb	and to leave comments on the spec with what they decide	18:58
corvus	fungi: for snmp_exporter: prometheus main server will poll snmp_exporter on main server which will poll snmpd on leaf-node servers.	18:58
fungi	is the concern with the additional overhead of dockerd on some of the servers?	18:58
clarkb	fungi: I think that concern is minimal. For me it was more that we had explicitly decided not to run docker as it wasn't necessary for certain services like dns and afs	18:58
fungi	oh, so prometheus does poll, like mrtg/cacti?	18:59
clarkb	and if we want to gather metrics from those services we would need to add docker everywhere	18:59
clarkb	in either case we have to modify firewall rules so that doesn't count against either option	18:59
fungi	and i just doesn't speak snmp, so needs a translating layer to turn its calls into snmp queries?	18:59
tristanC	you can also run the node exporter without docker, the project even publish static binary for many architecture as part of their release	18:59
corvus	fungi: yes it polls	18:59
clarkb	tristanC: I don't think we would do it taht way.	18:59
clarkb	tristanC: that would be worse than running it in docker imo	18:59
clarkb	(because now you need some additional system to keep it up to date etc)	19:00
mordred	and we're already pretty well set up to run docker in places	19:00
mordred	a positive about the polling is that it makes some of the firewall rules simpler - just allow from the prom server on all the endpoints (static data) rather than needing to open the firewall on teh cacti server for each endpoint. I mean - we have that complexity implemented so it's not an issue, but it's a place where a change to a service also impacts the config management of the cacti so there's an overlap	19:03
corvus	mordred: cacti polls snmp on the leaf-node servers, so it's the same firewall situation	19:04
corvus	(we don't use snmp traps, which would require inbound connections to cacti)	19:04
mordred	corvus: oh - what am I thinking about where we need to open the firewall rules centrally for each leaf node?	19:04
mordred	am I just making up stuff in my head?	19:05
corvus	i think that may be it? :) we do have to add cacti graphs for every host...	19:05
fungi	so then is the difference that snmp_exporter would run on the prometheus server and connect to net-snmpd on every system while node_exporter would run on the individual systems in place of net-snmpd and prometheus would query it remotely via its custom protocol?	19:05
corvus	but i reckon we'd probably end up adding grafana dashboards for every host too	19:06
corvus	fungi: yes (note the custom protocol is http carrying plain text in "key:value" form)	19:07
fungi	got it, so node_exporter basically runs a private httpd on some specific port we would allow access to similar to how we control access to the snmpd service today	19:08
corvus	yeah; i assume we'd pick some value we could use consistently across all hosts. maybe TCP 1061 ;)	19:09
corvus	(that wouldn't interfere with service-level prometheus servers also running on the host)	19:10
fungi	oh, so it doesn't have a well-known port assigned	19:10
fungi	we just pick something	19:10
fungi	wfm	19:10
fungi	this new internet where ip has been replaced by http is still somehow foreign to me ;)	19:11
corvus	fungi: prometheus servers can relay data from other prometheus servers, so there's a prometheus network topology on top of the http layer too :)	19:13
fungi	wow	19:14
clarkb	fungi: also applications like gerrit, gitea, and zuul can expose a port from within themselves to report to prometheus polls	19:14
fungi	i see, and then prometheus knows the app-specific metrics endpoints to collect from as well as the system endpoint. that makes for rather a lot of sockets in some cases i'm betting	19:15
mnaser	corvus, fungi: i'm jumping into this but there is a 'registry' of exporter ports	19:49
mnaser	it ain't IANA but.. https://github.com/prometheus/prometheus/wiki/Default-port-allocations	19:50
fungi	mnaser: oh cool	19:50
mnaser	node_exporter is 9100, etc, that is also indirectly a nice 'list' of exporters you can look at :p	19:50
corvus	good, so as long as services aren't abusing 9100 that should be fine, and i agree we should try to follow that if we go that route	19:51
* fungi concurs wholeheartedly		19:51
mnaser	also, fungi, you mentioned some ipv6 reachability issues over the past .. while, we finally (i believe) got to the bottom of this, could you let me know if you're still seeing those issues	19:51
clarkb	mnaser: the last email we got about failed backups due to ipv6 issues was on the 10th	19:52
fungi	mnaser: i can test again in just a moment	19:52
clarkb	mnaser: seems it has been happier the last few days	19:52
fungi	but yeah, if we're no longer getting notified of backup failures, that's a good sign it's fixed	19:52
fungi	ianw: ^ good news!	19:52
mnaser	wee, awesome. backlogs work! :P	19:53
fungi	mnaser: do you feel like the fix probably took effect on friday or saturday? if so, that coincides with our resumption of backups	19:53
mnaser	fungi: the fix would have been applied by 12pm pt on saturday	19:54
fungi	mnaser: sounds like a correlation to me then	19:54
ianw	fungi / mnaser: debug1: connect to address 2604:e100:1:0:f816:3eff:fe83:a5e5 port 22: Network is unreachable	19:55
mnaser	aw darn	19:56
ianw	so it looks like it's falling back to ipv4, which is making it work	19:56
fungi	aha	19:56
ianw	which is better(?) than before when it just hung? :)	19:56
mnaser	ah dang, i think i have an idea here what's happening	19:56
mnaser	i bet its because the bgp announcement is not being picked up	19:56
mnaser	since the asn that announces that route is the same as all of our regions	19:56
mnaser	and we don't usually install an ipv6 static route	19:56
mnaser	so it makes sense that it cant find a route, gr	19:57
clarkb	fungi: for the kernel pin we have to craft a special file and stick it in a dir under /etc right? The thing that I always get lost on is what goes in the special file (priorities etc)	19:58
clarkb	fungi: maybe if you do the pin I can take a look at the file afterward and ask qusetions about the semantics of the thing?	19:58
fungi	clarkb: nah, we can just echo the package name followed by the word "hold" and pipe that through dpkg --set-selections	19:59
clarkb	fungi: TIL	19:59
fungi	first step is getting the package name right	20:00
clarkb	fungi: in that case I guess just share the command you end up running and I'll feel filled in	20:00
ianw	mnaser: yeah, that was what i was going to say ... not! :) i mean of course just let us/me know if i can do anything to help!	20:00
fungi	clarkb: `dpkg -S /boot/vmlinuz-5.4.0-84-generic` reports the linux-image-5.4.0-84-generic package is what installs that file, so it's what we want to hold	20:01
clarkb	fungi: that makes sense	20:01
fungi	echo linux-image-5.4.0-84-generic hold \| sudo dpkg --set-selections	20:01
fungi	i ran that just now	20:02
clarkb	and then if we dpkg -l it should show a hold attribute on the package listing?	20:02
fungi	if you `dpkg -l linux-image-5.4.0-84-generic` you'll see the desired state is reported as "hold" instead of "install"	20:02
fungi	now apt-get and other tools will refuse to replace that package version until the hold flag is manually removed	20:03
mnaser	ianw: thanks, that error helps :)	20:03
clarkb	fungi: will it still update the kernels and install them in grub?	20:03
fungi	if we wanted to revert it, we'd just say install instead of hold on the command i ran earlier	20:03
clarkb	fungi: if so that hold may not be sufficient because we're relying on the first grub entry aiui	20:03
clarkb	maybe we walso hold linux-generic and linux-image-generic ?	20:04
fungi	clarkb: good point, we can also hold the virtual package	20:04
fungi	i've held both of those too now	20:05
fungi	not normally necessary but as you note, with kernel packages they make a new one for each version	20:05
clarkb	there is also linux-image-virtual but it just installs linux-image-generic so I think we're good	20:05
fungi	yes, the problem is with new package names, new versions of the existing package of the same name will be blocked	20:06
fungi	kernel packages are special in that regard	20:06
fungi	most packages don't have version-specific names	20:06
clarkb	right	20:06
clarkb	ok time to eat lunch.	20:08
opendevreview	Marco Vaschetto proposed openstack/diskimage-builder master: Allowing use local image https://review.opendev.org/c/openstack/diskimage-builder/+/809009	20:31
clarkb	fungi: looking at https://opendev.org/opendev/system-config/src/branch/master/playbooks/gitea-rename-tasks.yaml I think what we want for updating the metadata is a task at the very end of that list of tasks that looks up the metadata from somewhere and applies it	20:32
clarkb	https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea-git-repos/library/gitea_create_repos.py#L141-L178 is how we do that in normal projcet creation	20:32
clarkb	and that is called if we set the always update flag. I suppose another option here is to just run the manage projects playbook with always update set?	20:33
clarkb	as a distinct step of the rename process rather than trying to collapse it all. The problem with that is it is likely to be much slower than collapsing it down to only the projects that are renamed	20:33
clarkb	https://www.brendangregg.com/blog/2014-05-09/xen-feature-detection.html might be interesting to others (ran into it when digging around to see if there is any documentation on converting from pv to pvhvm)	20:46
clarkb	I've used that to confirm we are PV mode on lists and HVM elsewhere	20:47
clarkb	Xen does support converting from pv to pvhvm. Not clear if rax/nova do	20:50
clarkb	Looks like the vm_mode metadata in openstack (which you can use to select pv or hvm) is an image attribute. This might be what makes ti tricky	20:52
clarkb	looking at the image properties I don't see the vm_mode set though	20:54
opendevreview	Luciano Lo Giudice proposed openstack/project-config master: Add the cinder-netapp charm to Openstack charms https://review.opendev.org/c/openstack/project-config/+/809012	20:55
clarkb	I'm not finding anything definitive on the internet saying there is a preexisting process for openstack or rackspace. I guess we might have to file a ticket and ask?	20:57
clarkb	Looking at https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md to get a sense of what bionic node exporter vs focal node exporter if using packages looks like and they appear to report metrics under different names	21:16
clarkb	0.18.1 includes a number of systemd related performance updates too	21:17
clarkb	I suppose it is possible to use the distro packages but then when we upgrade servers metric names will change on us	21:17
clarkb	if we deploy latest release with docker we'd avoid that as we could have a consistent version and it seems tehy have avoided changing names like that on the 1.0 release series	21:18
clarkb	I think if the distro packages were at least 1.0 it wouldn't be as big of a deal	21:21
clarkb	there are ~5 tasks in the gerrit task queue from the great big replication that haven't completed	21:23
clarkb	I wonder if we should kill them and then attempt replication for those specific repos	21:24
clarkb	I guess give them another half hour and if they dno't compelte then stop them and reenqueue specifically for those repos	21:25
clarkb	makes me wonder if we're potentially having the ipv6 issues in the other direction now	21:26
clarkb	since the replication plugin should fail then retry with a new task entry iirc	21:27
clarkb	I'm actually going to go ahead and do the deb-liberasurecode task stop and restart it since that repo is not used anymore aiui	21:30
clarkb	ya it completed when I did that. I'll go through the others and just reenqueue them	21:32
*** dviroel is now known as dviroel\|out		21:36
clarkb	that is done. Seems like that worked to clear things out and then also ensure replciation ended up running	21:36
fungi	interesting, so that suggests we're getting some hung tasks, maybe on the order of 0.03% of the time	21:43
fungi	rare, but not so rare that we wouldn't notice at our volume	21:44
clarkb	ya. Most were to gitea03 (3 total) and one each to gitea01 and gitea06	21:44
clarkb	I suspect it is an issue with creating a network connection similar to the gitea01 backups	21:47
clarkb	since we should retry and create new tasks if we fail, but that doesn't seem to have happened	21:48
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	22:35
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638	22:36
ianw	clarkb: after killing the ones last night, i'm pretty sure i saw the replication restart	22:47
clarkb	ianw: huh I don't think I saw that here but may be it did	22:48
clarkb	it could've retried quicker than I could relist	22:48
ianw	that was my thinking behind not setting off a full replication. but anyway it would be good to know why these seem stuck permanently	22:49
ianw	you'd think there'd be a timeout	22:49
clarkb	I wonder if network level connection is just sitting there and not doing anything to return a failure but also not completing a SYN ACK SYNACK successfully	22:50
clarkb	similar to how ansible will sometimes ssh forever	22:50
opendevreview	Merged openstack/diskimage-builder master: Fix debian-minimal security repos https://review.opendev.org/c/openstack/diskimage-builder/+/806188	23:16

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!