Monday, 2021-08-09

tristanC	mordred: yeah, and the error is not happening with sf's gerrit, so perhaps it is a network issue somewhere between the bot and opendev's gerrit. I've a change to enable keepalive which we can use if this happens again with the eavesdrop host.	00:15
mordred	I'm curious what it is. It's SO regular	00:33
opendevreview	Merged opendev/system-config master: Replace callback_whitelist with callback_enabled https://review.opendev.org/c/opendev/system-config/+/803047	04:05
*** marios is now known as marios\|ruck		05:32
*** ykarel_ is now known as ykarel		05:43
*** jpena\|off is now known as jpena		07:35
*** rpittau\|afk is now known as rpittau		07:52
*** ykarel is now known as ykarel\|lunch		08:32
*** ykarel\|lunch is now known as ykarel		10:06
opendevreview	Ananya proposed opendev/elastic-recheck master: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/802866	10:12
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897	10:13
opendevreview	Ananya proposed opendev/elastic-recheck master: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803898	10:18
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897	10:27
*** jpena is now known as jpena\|lunch		11:39
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897	11:40
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897	11:44
*** jpena\|lunch is now known as jpena		12:32
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897	13:11
fungi	ugh, apparently the ipv6 address for the listserv has once again been listed on the spamhaus xbl	13:16
fungi	seems to have started sometime around or maybe a bit before 2021-08-06 07:50 utc	13:18
fungi	i'll get started on the delisting request again	13:18
fungi	oh, looking at the entry for https://check.spamhaus.org/listed/?searchterm=2001:4800:780e:510:3bc3:d7f6:ff04:b736 it says "2001:4800:780e:510::/64 is making SMTP connections that leads us to believe it is misconfigured. [...] Please correct your HELO 'localhost.localdomain'"	13:31
fungi	i think the problem is they're blanketing the entire /64	13:31
fungi	so it could be any server in that address range	13:31
fungi	i last delisted it on 2021-08-03 at 15:51:05 so it took roughly 2.5 days to be re-added	13:47
fungi	#status log Requested Spamhaus XBL delisting for the IPv6 /64 CIDR containing the address of the lists.openstack.org server (seems to have been re-listed around 2021-08-06 07:00 UTC, roughly 2.5 days after the previous de-listing)	13:49
opendevstatus	fungi: finished logging	13:49
fungi	i suppose we can open a ticket with rackspace requesting they talk to spamhaus about carrying individual /128 entries from that /64, but i have no idea whether they care. in the past, rackspace's guidance has been to smarthost through their mailservers instead (which also get added to blocklists on a fairly regular basis, from what i've seen)	13:51
fungi	in the past we've talked about not moving or rebuilding that server in order to preserve its long-standing reputation, but that doesn't seem to be much help at the moment	13:52
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803932	14:10
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803932	14:12
mhu1	Hello, a while ago a spec revamping authentication on opendev services was discussed, where can I check the status for this? ie has anything been officially decided yet? Any action items on this?	14:17
opendevreview	Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803932	14:21
fungi	mhu1: the spec was approved and is published at https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html but we're seeking volunteers to work through implementing	14:32
*** ykarel is now known as ykarel\|away		14:48
opendevreview	Sorin Sbârnea proposed opendev/elastic-recheck master: Minor CI fixes https://review.opendev.org/c/opendev/elastic-recheck/+/803934	14:48
clarkb	has anyone looked at dequeing the hour deploy buildset that is holding our deploy builds in queueing states?	14:49
clarkb	I think the issue is the 0000000 hash doesn't play nice with that pipeline? I can do it after some breakfast if others haven't looked at it yet	14:49
fungi	i haven't looked yet, been extinguishing other fires (listserv spamhaus delisting, afs docs volume with failing vos release)	14:50
clarkb	fungi: re spamhaus corvus did mention we can force exim to use ipv4	14:51
clarkb	alright let me find some breakfast then I can take a look at the zuul queues	14:52
fungi	ahh, yeah i suppose we could just avoid ipv6 outbound delivery if we have too	14:53
fungi	thanks for the reminder	14:53
*** marios\|ruck is now known as marios		15:06
*** marios is now known as marios\|ruck		15:08
clarkb	we've got a number of various infra-prod related things queued up. I'm going to try and dequeue them in such a manner that no jobs suddenly start running because I'm not sure what order is appropriate. Instead I'll attemp to remoev all of them then the next hourly run can start up at 16:00 UTC	15:14
clarkb	there is one change queued up but all it wants to run is install-ansible and that is run by the hourly job	15:14
fungi	yeah, so dequeue the newest first back to the oldest i guess	15:15
fungi	seems like these are fallout from the reenqueue at the last restart	15:16
fungi	possibly the result of reenqueuing previously reenqueued items even	15:16
fungi	(there were several restarts, right?)	15:16
clarkb	ya there was at least 2 restarts back to back	15:17
clarkb	ok thats done. The hourly run in 40 minutes should do the ansible config update that landed in 803047 for us	15:19
clarkb	I checked the chagne its a minor update to address a deprecation warning that won't be the end of the world if this somehow doesn't apply in 40 minutes	15:20
clarkb	we can investigate further at that point	15:20
fungi	sounds good, thanks for the cleanup!	15:20
*** jpena is now known as jpena\|off		15:32
opendevreview	Clark Boylan proposed opendev/system-config master: WIP Prepare gitea 1.15.0 upgrade https://review.opendev.org/c/opendev/system-config/+/803231	15:45
clarkb	we have a new RC to test	15:46
*** marios\|ruck is now known as marios		16:11
*** rpittau is now known as rpittau\|afk		16:26
clarkb	infra-root now I'm thinking something else is up with the hourly deploy jobs. They have been queued up for half an hour without any progress	16:30
clarkb	corvus: ^ fallout from the updates that were made? Maybe we broke semaphores somehow?	16:30
clarkb	Semaphore /zuul/semaphores/openstack/infra-prod-playbook can not be released for 26339974ead441c68415f5ee9aa8ffff-infra-prod-install-ansible because the semaphore is not held	16:35
clarkb	I think the issue is grabbing the semaphote	16:35
clarkb	*semaphore. We should have periodic cleanup of those semaphores though right?	16:35
clarkb	_semaphore_cleanup_interval = IntervalTrigger(minutes=60, jitter=60)	16:36
corvus	clarkb: i think that message is a red herring (which we should probably find and clean up). i say that based on seeing it before. i don't want to exclude it as a cause, but not the first place i'd look.	16:37
corvus	i'll start digging in	16:37
clarkb	I did the removal of all the other buildests about an hour and 15 minutes ago which should be more than enough for the semaphore cleanup to have run since	16:37
clarkb	corvus: ok thanks	16:37
corvus	clarkb: oh, i didn't see that we're waiting on a semaphore... that's different. i retract that. ;)	16:38
corvus	so let's revise that to say: there are potentially a lot of red-herring log messages about releasing semaphores. but one of them may be relevant. :/	16:39
*** marios is now known as marios\|out		16:39
clarkb	corvus: 260e702e93d645d9b627b2569a1258e1 is the event that triggered the buildset if that helps	16:39
clarkb	for the most recent 16:00 UTC enqueue	16:39
corvus	fd02b512c5134c409d460e3365a0f170-infra-prod-remote-puppet-else is the current holder	16:40
corvus	that appears to have been active on aug 7	16:41
clarkb	and possibly went away in the restarts?	16:42
corvus	queue item fd02b512c5134c409d460e3365a0f170 was last spotted at 2021-08-07 17:43:21,552	16:44
corvus	clarkb: so yes -- that's between my 2 restarts on saturday	16:45
corvus	the auto cleanup should have gotten it	16:45
clarkb	corvus: I guess the thing to look at is why semaphore cleanups aren't cleaning it up. ya	16:45
clarkb	`grep 'semaphore cleanup' /var/log/zuul/debug.log` shows no results but looking at the scheduler it seems that we should log that string when starting semaphore cleanups and when they fail	16:46
corvus	yes, i also don't see any apsched log lines -- we may need to adjust our log config to get them?	16:47
corvus	nothing is holding the semaphore cleanup lock in zk, so it should be able to run	16:49
corvus	clarkb: oh! what if the lack of reconfiguration on startup means the initial cleanup never runs?	16:50
clarkb	corvus: it looks like we call startCleanup() from scheduler.start(). Would that be impacted by ^?	16:50
corvus	clarkb: maybe we should move to #zuul	16:51
clarkb	++	16:51
corvus	clarkb: prime() is called after start() from the cli wrapper	17:01
corvus	oh, sorry that was old; i didn't manage the channel switch well :)	17:02
corvus	anyway -- as mentioned in #zuul, i think i can repair our current issue without a restart; i will do so now	17:02
clarkb	thanks	17:04
corvus	2021-08-09 17:03:58,592 ERROR zuul.zk.SemaphoreHandler: Releasing leaked semaphore /zuul/semaphores/openstack/infra-prod-playbook held by fd02b512c5134c409d460e3365a0f170-infra-prod-remote-puppet-else	17:05
corvus	looks like we're gtg	17:05
clarkb	the job has started	17:05
clarkb	fungi: if you have time today https://review.opendev.org/c/opendev/system-config/+/803367/ is the last of my gitea cleanup changes prior to the gitea 1.15.0 rc upgrade change. This one is probably the largest of the changes but converts us over to using only the rest api for gitea management	17:56
clarkb	in the process it simplifies some stuff that was split out into multiple requests previously whcih should in theory mean it makes us run quicker too	17:56
clarkb	With that landed we should be good to upgrade to 1.15.0 whenever it releases	17:57
fungi	clarkb: i approved it, but as commented we likely want to keep an eye on the next project creation changes go through just in case our testing coverage isn't as good as we think	18:34
clarkb	yup	18:34
fungi	config-core: ^ heads up	18:34
clarkb	and thanks	18:34
clarkb	I've removed my WIP from https://review.opendev.org/c/opendev/system-config/+/803374 as it seems gerrit is happily running with the mariadb connector since we restarted it	18:44
clarkb	That chagne does a bit more cleanup and is quite a bit more involved now so deserves careful review, but we shouldn't be limited by gerrit itself	18:44
opendevreview	Merged opendev/system-config master: Update gitea project creation to only use the REST API https://review.opendev.org/c/opendev/system-config/+/803367	19:27
clarkb	ianw: I think you can approve https://review.opendev.org/c/openstack/project-config/+/803411 when ready	19:35
*** dviroel is now known as dviroel\|brb		20:06
thomasb06	fungi: hey. at the moment, i'm trying to understand your patch: https://bugs.launchpad.net/neutron/+bug/1784259	20:39
thomasb06	what was happening?	20:39
fungi	thomasb06: i didn't have a patch on that bug, just comments about inapplicability for issuing a security advisory	20:42
fungi	maybe better to discuss in either #openstack-neutron or #openstack-security depending on the nature of your question	20:43
*** dviroel\|brb is now known as dviroel		20:50
opendevreview	Clark Boylan proposed opendev/system-config master: Accomodate zuul's new key management system https://review.opendev.org/c/opendev/system-config/+/803992	21:35
clarkb	corvus: ^ that starts to capture the opendev updates necessary to deal with the new key stuff	21:35
corvus	clarkb: lgtm, 1 quick thought	21:41
clarkb	corvus: good idea, do you think that should go in a separate file from the typical export location? or just overwrite what we typically have?	21:43
corvus	clarkb: i was thinking typical location	21:44
clarkb	ok	21:44
corvus	should be fine -- after all, the backups have older versions	21:44
opendevreview	Clark Boylan proposed opendev/system-config master: Accomodate zuul's new key management system https://review.opendev.org/c/opendev/system-config/+/803992	21:46
*** dviroel is now known as dviroel\|out		21:47
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir, add a symlink back https://review.opendev.org/c/openstack/diskimage-builder/+/804000	23:01
clarkb	corvus: I've started learning a bit about prometheus and how we might deploy it via a bunch of reading. So far I think what I've got is we deploy it with docker-compose like most other services and then bind mount in its config as well as the location the config says to store the tsdb contents. I want to say you were suggesting that we just use the built in tsdb implementation?	23:03
clarkb	corvus: what exporter were you suggesting we use to collect system metrics similar to what cacti does? Is it node-exporter?	23:04
clarkb	The other bit that isn't quite clear t ome is if the expression evaluator web service thing is included in a default install or if that runs alongside as a separate thing, But I think that is less important at startup	23:06
tristanC	clarkb: here is how i'm running it to collect gerritbot stats: https://paste.opendev.org/show/807966/	23:25
clarkb	tristanC: thanks. I see now the prom/prometheus image specifies a config path and a storage tsdb path on the command line so those need to be provided. Also it exposes port 9090 which I'm going to guess is the dashboadr thing	23:27
tristanC	clarkb: yes, then you can add it to grafana by using such datasource config: https://paste.opendev.org/show/807966/	23:28
clarkb	tristanC: that was the same paste again :)	23:28
tristanC	arg, i meant the next one: https://paste.opendev.org/show/807967/ :-)	23:29
clarkb	thanks that helps	23:29
tristanC	clarkb: you're welcome, i'd be happy to help with the setup, is the spec already proposed?	23:30
corvus	clarkb: if we want to use snmp, there's an snmp exporter	23:30
corvus	clarkb: that can be a single instance, so we could re-use our existing snmp configuration without running a separate collector	23:31
clarkb	tristanC: no I'm still putting things together into my head first. Then next step is writing the spec down. Hopefully tomorrow	23:31
clarkb	corvus: oh nice we can run the snmp exporter adjacent to prometheus then we only need to update firewall rules	23:31
corvus	clarkb: yep	23:32
clarkb	tristanC: I'll ping you when I push it up	23:32
clarkb	tristanC: but I think rough idea right now is deploy a new instance with a large volume for the tsdb content. Run prometheus using the prom/prometheus image from dockhub using docker-compose. And now the bit about snmp has clicked so we'll run the snmp exporter adjacent to that	23:33
clarkb	and then we can also mix in specific application scraping like for gerritbot	23:33
corvus	and zuul etc	23:33
tristanC	to collect general host metric, there is also the node_exporter you can run in a container too	23:33
clarkb	tristanC: yup I asked about that one above. But that has to run on each host right?	23:34
tristanC	yes	23:34
clarkb	ya I think it will be easier to stick with snmp at least while we spin stuff up. Not everything has docker installed on it etc	23:34
tristanC	it doesn't have to be all in one too, for example, once gerritbot has the endpoint (with 803125 ), i can run the monitoring from my setup	23:36
clarkb	right but I'm saying that for node metric collection needing to deploy a container across all of our instances and modify firewall rules for that will be a fairly big undertaking. It will be much simpler to start with snmp which can run centrally and do the polling since we already have snmp configured	23:37
tristanC	sure that sounds good. Another thing to consider is the alertmanager configuration: it would be nice to have custom route, where interested party could subscribe to some alerts.	23:50
clarkb	Looking at things it seems the graphing implementation in prometheus may be more like the one in graphite. If we want graphs like cacti produces we'll need to set up our grafana to talk to prometheus and give it a bunch of config to render those grpahs. Not a big deal but want to understand the scope of the work here	23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!