Monday, 2021-08-09

tristanCmordred: yeah, and the error is not happening with sf's gerrit, so perhaps it is a network issue somewhere between the bot and opendev's gerrit. I've a change to enable keepalive which we can use if this happens again with the eavesdrop host.00:15
mordredI'm curious what it is. It's SO regular00:33
opendevreviewMerged opendev/system-config master: Replace callback_whitelist with callback_enabled
*** marios is now known as marios|ruck05:32
*** ykarel_ is now known as ykarel05:43
*** jpena|off is now known as jpena07:35
*** rpittau|afk is now known as rpittau07:52
*** ykarel is now known as ykarel|lunch08:32
*** ykarel|lunch is now known as ykarel10:06
opendevreviewAnanya proposed opendev/elastic-recheck master: Run elastic-recheck in container
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
opendevreviewAnanya proposed opendev/elastic-recheck master: Run elastic-recheck in container
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
*** jpena is now known as jpena|lunch11:39
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
*** jpena|lunch is now known as jpena12:32
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
fungiugh, apparently the ipv6 address for the listserv has once again been listed on the spamhaus xbl13:16
fungiseems to have started sometime around or maybe a bit before 2021-08-06 07:50 utc13:18
fungii'll get started on the delisting request again13:18
fungioh, looking at the entry for it says "2001:4800:780e:510::/64 is making SMTP connections that leads us to believe it is misconfigured. [...] Please correct your HELO 'localhost.localdomain'"13:31
fungii think the problem is they're blanketing the entire /6413:31
fungiso it could be any server in that address range13:31
fungii last delisted it on 2021-08-03 at 15:51:05 so it took roughly 2.5 days to be re-added13:47
fungi#status log Requested Spamhaus XBL delisting for the IPv6 /64 CIDR containing the address of the server (seems to have been re-listed around 2021-08-06 07:00 UTC, roughly 2.5 days after the previous de-listing)13:49
opendevstatusfungi: finished logging13:49
fungii suppose we can open a ticket with rackspace requesting they talk to spamhaus about carrying individual /128 entries from that /64, but i have no idea whether they care. in the past, rackspace's guidance has been to smarthost through their mailservers instead (which also get added to blocklists on a fairly regular basis, from what i've seen)13:51
fungiin the past we've talked about not moving or rebuilding that server in order to preserve its long-standing reputation, but that doesn't seem to be much help at the moment13:52
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
mhu1Hello, a while ago a spec revamping authentication on opendev services was discussed, where can I check the status for this? ie has anything been officially decided yet? Any action items on this?14:17
opendevreviewAnanya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container
fungimhu1: the spec was approved and is published at but we're seeking volunteers to work through implementing14:32
*** ykarel is now known as ykarel|away14:48
opendevreviewSorin Sb├órnea proposed opendev/elastic-recheck master: Minor CI fixes
clarkbhas anyone looked at dequeing the hour deploy buildset that is holding our deploy builds in queueing states?14:49
clarkbI think the issue is the 0000000 hash doesn't play nice with that pipeline? I can do it after some breakfast if others haven't looked at it yet14:49
fungii haven't looked yet, been extinguishing other fires (listserv spamhaus delisting, afs docs volume with failing vos release)14:50
clarkbfungi: re spamhaus corvus did mention we can force exim to use ipv414:51
clarkbalright let me find some breakfast then I can take a look at the zuul queues14:52
fungiahh, yeah i suppose we could just avoid ipv6 outbound delivery if we have too14:53
fungithanks for the reminder14:53
*** marios|ruck is now known as marios15:06
*** marios is now known as marios|ruck15:08
clarkbwe've got a number of various infra-prod related things queued up. I'm going to try and dequeue them in such a manner that no jobs suddenly start running because I'm not sure what order is appropriate. Instead I'll attemp to remoev all of them then the next hourly run can start up at 16:00 UTC15:14
clarkbthere is one change queued up but all it wants to run is install-ansible and that is run by the hourly job15:14
fungiyeah, so dequeue the newest first back to the oldest i guess15:15
fungiseems like these are fallout from the reenqueue at the last restart15:16
fungipossibly the result of reenqueuing previously reenqueued items even15:16
fungi(there were several restarts, right?)15:16
clarkbya there was at least 2 restarts back to back15:17
clarkbok thats done. The hourly run in 40 minutes should do the ansible config update that landed in 803047 for us15:19
clarkbI checked the chagne its a minor update to address a deprecation warning that won't be the end of the world if this somehow doesn't apply in 40 minutes15:20
clarkbwe can investigate further at that point15:20
fungisounds good, thanks for the cleanup!15:20
*** jpena is now known as jpena|off15:32
opendevreviewClark Boylan proposed opendev/system-config master: WIP Prepare gitea 1.15.0 upgrade
clarkbwe have a new RC to test15:46
*** marios|ruck is now known as marios16:11
*** rpittau is now known as rpittau|afk16:26
clarkbinfra-root now I'm thinking something else is up with the hourly deploy jobs. They have been queued up for half an hour without any progress16:30
clarkbcorvus: ^ fallout from the updates that were made? Maybe we broke semaphores somehow?16:30
clarkbSemaphore /zuul/semaphores/openstack/infra-prod-playbook can not be released for 26339974ead441c68415f5ee9aa8ffff-infra-prod-install-ansible because the semaphore is not held16:35
clarkbI think the issue is grabbing the semaphote16:35
clarkb*semaphore. We should have periodic cleanup of those semaphores though right?16:35
clarkb_semaphore_cleanup_interval = IntervalTrigger(minutes=60, jitter=60)16:36
corvusclarkb: i think that message is a red herring (which we should probably find and clean up).  i say that based on seeing it before.  i don't want to exclude it as a cause, but not the first place i'd look.16:37
corvusi'll start digging in16:37
clarkbI did the removal of all the other buildests about an hour and 15 minutes ago which should be more than enough for the semaphore cleanup to have run since16:37
clarkbcorvus: ok thanks16:37
corvusclarkb: oh, i didn't see that we're waiting on a semaphore... that's different.  i retract that.  ;)16:38
corvusso let's revise that to say: there are potentially a lot of red-herring log messages about releasing semaphores.  but one of them may be relevant.  :/16:39
*** marios is now known as marios|out16:39
clarkbcorvus: 260e702e93d645d9b627b2569a1258e1 is the event that triggered the buildset if that helps16:39
clarkbfor the most recent 16:00 UTC enqueue16:39
corvusfd02b512c5134c409d460e3365a0f170-infra-prod-remote-puppet-else is the current holder16:40
corvusthat appears to have been active on aug 716:41
clarkband possibly went away in the restarts?16:42
corvusqueue item fd02b512c5134c409d460e3365a0f170 was last spotted at 2021-08-07 17:43:21,55216:44
corvusclarkb: so yes -- that's between my 2 restarts on saturday16:45
corvusthe auto cleanup should have gotten it16:45
clarkbcorvus: I guess the thing to look at is why semaphore cleanups aren't cleaning it up. ya16:45
clarkb`grep 'semaphore cleanup' /var/log/zuul/debug.log` shows no results but looking at the scheduler it seems that we should log that string when starting semaphore cleanups and when they fail16:46
corvusyes, i also don't see any apsched log lines -- we may need to adjust our log config to get them?16:47
corvusnothing is holding the semaphore cleanup lock in zk, so it should be able to run16:49
corvusclarkb: oh!  what if the lack of reconfiguration on startup means the initial cleanup never runs?16:50
clarkbcorvus: it looks like we call startCleanup() from scheduler.start(). Would that be impacted by ^?16:50
corvusclarkb: maybe we should move to #zuul16:51
corvusclarkb: prime() is called after start() from the cli wrapper17:01
corvusoh, sorry that was old; i didn't manage the channel switch well :)17:02
corvusanyway -- as mentioned in #zuul, i think i can repair our current issue without a restart; i will do so now17:02
corvus2021-08-09 17:03:58,592 ERROR zuul.zk.SemaphoreHandler: Releasing leaked semaphore /zuul/semaphores/openstack/infra-prod-playbook held by fd02b512c5134c409d460e3365a0f170-infra-prod-remote-puppet-else17:05
corvuslooks like we're gtg17:05
clarkbthe job has started17:05
clarkbfungi: if you have time today is the last of my gitea cleanup changes prior to the gitea 1.15.0 rc upgrade change. This one is probably the largest of the changes but converts us over to using only the rest api for gitea management17:56
clarkbin the process it simplifies some stuff that was split out into multiple requests previously whcih should in theory mean it makes us run quicker too17:56
clarkbWith that landed we should be good to upgrade to 1.15.0 whenever it releases17:57
fungiclarkb: i approved it, but as commented we likely want to keep an eye on the next project creation changes go through just in case our testing coverage isn't as good as we think18:34
fungiconfig-core: ^ heads up18:34
clarkband thanks18:34
clarkbI've removed my WIP from as it seems gerrit is happily running with the mariadb connector since we restarted it18:44
clarkbThat chagne does a bit more cleanup and is quite a bit more involved now so deserves careful review, but we shouldn't be limited by gerrit itself18:44
opendevreviewMerged opendev/system-config master: Update gitea project creation to only use the REST API
clarkbianw: I think you can approve when ready19:35
*** dviroel is now known as dviroel|brb20:06
thomasb06fungi: hey. at the moment, i'm trying to understand your patch:
thomasb06what was happening?20:39
fungithomasb06: i didn't have a patch on that bug, just comments about inapplicability for issuing a security advisory20:42
fungimaybe better to discuss in either #openstack-neutron or #openstack-security depending on the nature of your question20:43
*** dviroel|brb is now known as dviroel20:50
opendevreviewClark Boylan proposed opendev/system-config master: Accomodate zuul's new key management system
clarkbcorvus: ^ that starts to capture the opendev updates necessary to deal with the new key stuff21:35
corvusclarkb: lgtm, 1 quick thought21:41
clarkbcorvus: good idea, do you think that should go in a separate file from the typical export location? or just overwrite what we typically have?21:43
corvusclarkb: i was thinking typical location21:44
corvusshould be fine -- after all, the backups have older versions21:44
opendevreviewClark Boylan proposed opendev/system-config master: Accomodate zuul's new key management system
*** dviroel is now known as dviroel|out21:47
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir, add a symlink back
clarkbcorvus: I've started learning a bit about prometheus and how we might deploy it via a bunch of reading. So far I think what I've got is we deploy it with docker-compose like most other services and then bind mount in its config as well as the location the config says to store the tsdb contents. I want to say you were suggesting that we just use the built in tsdb implementation?23:03
clarkbcorvus: what exporter were you suggesting we use to collect system metrics similar to what cacti does? Is it node-exporter?23:04
clarkbThe other bit that isn't quite clear t ome is if the expression evaluator web service thing is included in a default install or if that runs alongside as a separate thing, But I think that is less important at startup23:06
tristanCclarkb: here is how i'm running it to collect gerritbot stats:
clarkbtristanC: thanks. I see now the prom/prometheus image specifies a config path and a storage tsdb path on the command line so those need to be provided. Also it exposes port 9090 which I'm going to guess is the dashboadr thing23:27
tristanCclarkb: yes, then you can add it to grafana by using such datasource config:
clarkbtristanC: that was the same paste again :)23:28
tristanCarg, i meant the next one: :-)23:29
clarkbthanks that helps23:29
tristanCclarkb: you're welcome, i'd be happy to help with the setup, is the spec already proposed?23:30
corvusclarkb: if we want to use snmp, there's an snmp exporter23:30
corvusclarkb: that can be a single instance, so we could re-use our existing snmp configuration without running a separate collector23:31
clarkbtristanC: no I'm still putting things together into my head first. Then next step is writing the spec down. Hopefully tomorrow23:31
clarkbcorvus: oh nice we can run the snmp exporter adjacent to prometheus then we only need to update firewall rules23:31
corvusclarkb: yep23:32
clarkbtristanC: I'll ping you when I push it up23:32
clarkbtristanC: but I think rough idea right now is deploy a new instance with a large volume for the tsdb content. Run prometheus using the prom/prometheus image from dockhub using docker-compose. And now the bit about snmp has clicked so we'll run the snmp exporter adjacent to that23:33
clarkband then we can also mix in specific application scraping like for gerritbot23:33
corvusand zuul etc23:33
tristanCto collect general host metric, there is also the node_exporter you can run in a container too23:33
clarkbtristanC: yup I asked about that one above. But that has to run on each host right?23:34
clarkbya I think it will be easier to stick with snmp at least while we spin stuff up. Not everything has docker installed on it etc23:34
tristanCit doesn't have to be all in one too, for example, once gerritbot has the endpoint (with 803125 ), i can run the monitoring from my setup23:36
clarkbright but I'm saying that for node metric collection needing to deploy a container across all of our instances and modify firewall rules for that will be a fairly big undertaking. It will be much simpler to start with snmp which can run centrally and do the polling since we already have snmp configured23:37
tristanCsure that sounds good. Another thing to consider is the alertmanager configuration: it would be nice to have custom route, where interested party could subscribe to some alerts.23:50
clarkbLooking at things it seems the graphing implementation in prometheus may be more like the one in graphite. If we want graphs like cacti produces we'll need to set up our grafana to talk to prometheus and give it a bunch of config to render those grpahs. Not a big deal but want to understand the scope of the work here23:57

Generated by 2.17.2 by Marius Gedminas - find it at!