tristanC | mordred: yeah, and the error is not happening with sf's gerrit, so perhaps it is a network issue somewhere between the bot and opendev's gerrit. I've a change to enable keepalive which we can use if this happens again with the eavesdrop host. | 00:15 |
---|---|---|
mordred | I'm curious what it is. It's SO regular | 00:33 |
opendevreview | Merged opendev/system-config master: Replace callback_whitelist with callback_enabled https://review.opendev.org/c/opendev/system-config/+/803047 | 04:05 |
*** marios is now known as marios|ruck | 05:32 | |
*** ykarel_ is now known as ykarel | 05:43 | |
*** jpena|off is now known as jpena | 07:35 | |
*** rpittau|afk is now known as rpittau | 07:52 | |
*** ykarel is now known as ykarel|lunch | 08:32 | |
*** ykarel|lunch is now known as ykarel | 10:06 | |
opendevreview | Ananya proposed opendev/elastic-recheck master: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/802866 | 10:12 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 10:13 |
opendevreview | Ananya proposed opendev/elastic-recheck master: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803898 | 10:18 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 10:27 |
*** jpena is now known as jpena|lunch | 11:39 | |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 11:40 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 11:44 |
*** jpena|lunch is now known as jpena | 12:32 | |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 13:11 |
fungi | ugh, apparently the ipv6 address for the listserv has once again been listed on the spamhaus xbl | 13:16 |
fungi | seems to have started sometime around or maybe a bit before 2021-08-06 07:50 utc | 13:18 |
fungi | i'll get started on the delisting request again | 13:18 |
fungi | oh, looking at the entry for https://check.spamhaus.org/listed/?searchterm=2001:4800:780e:510:3bc3:d7f6:ff04:b736 it says "2001:4800:780e:510::/64 is making SMTP connections that leads us to believe it is misconfigured. [...] Please correct your HELO 'localhost.localdomain'" | 13:31 |
fungi | i think the problem is they're blanketing the entire /64 | 13:31 |
fungi | so it could be any server in that address range | 13:31 |
fungi | i last delisted it on 2021-08-03 at 15:51:05 so it took roughly 2.5 days to be re-added | 13:47 |
fungi | #status log Requested Spamhaus XBL delisting for the IPv6 /64 CIDR containing the address of the lists.openstack.org server (seems to have been re-listed around 2021-08-06 07:00 UTC, roughly 2.5 days after the previous de-listing) | 13:49 |
opendevstatus | fungi: finished logging | 13:49 |
fungi | i suppose we can open a ticket with rackspace requesting they talk to spamhaus about carrying individual /128 entries from that /64, but i have no idea whether they care. in the past, rackspace's guidance has been to smarthost through their mailservers instead (which also get added to blocklists on a fairly regular basis, from what i've seen) | 13:51 |
fungi | in the past we've talked about not moving or rebuilding that server in order to preserve its long-standing reputation, but that doesn't seem to be much help at the moment | 13:52 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803932 | 14:10 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803932 | 14:12 |
mhu1 | Hello, a while ago a spec revamping authentication on opendev services was discussed, where can I check the status for this? ie has anything been officially decided yet? Any action items on this? | 14:17 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Run elastic-recheck in container https://review.opendev.org/c/opendev/elastic-recheck/+/803932 | 14:21 |
fungi | mhu1: the spec was approved and is published at https://docs.opendev.org/opendev/infra-specs/latest/specs/central-auth.html but we're seeking volunteers to work through implementing | 14:32 |
*** ykarel is now known as ykarel|away | 14:48 | |
opendevreview | Sorin Sbârnea proposed opendev/elastic-recheck master: Minor CI fixes https://review.opendev.org/c/opendev/elastic-recheck/+/803934 | 14:48 |
clarkb | has anyone looked at dequeing the hour deploy buildset that is holding our deploy builds in queueing states? | 14:49 |
clarkb | I think the issue is the 0000000 hash doesn't play nice with that pipeline? I can do it after some breakfast if others haven't looked at it yet | 14:49 |
fungi | i haven't looked yet, been extinguishing other fires (listserv spamhaus delisting, afs docs volume with failing vos release) | 14:50 |
clarkb | fungi: re spamhaus corvus did mention we can force exim to use ipv4 | 14:51 |
clarkb | alright let me find some breakfast then I can take a look at the zuul queues | 14:52 |
fungi | ahh, yeah i suppose we could just avoid ipv6 outbound delivery if we have too | 14:53 |
fungi | thanks for the reminder | 14:53 |
*** marios|ruck is now known as marios | 15:06 | |
*** marios is now known as marios|ruck | 15:08 | |
clarkb | we've got a number of various infra-prod related things queued up. I'm going to try and dequeue them in such a manner that no jobs suddenly start running because I'm not sure what order is appropriate. Instead I'll attemp to remoev all of them then the next hourly run can start up at 16:00 UTC | 15:14 |
clarkb | there is one change queued up but all it wants to run is install-ansible and that is run by the hourly job | 15:14 |
fungi | yeah, so dequeue the newest first back to the oldest i guess | 15:15 |
fungi | seems like these are fallout from the reenqueue at the last restart | 15:16 |
fungi | possibly the result of reenqueuing previously reenqueued items even | 15:16 |
fungi | (there were several restarts, right?) | 15:16 |
clarkb | ya there was at least 2 restarts back to back | 15:17 |
clarkb | ok thats done. The hourly run in 40 minutes should do the ansible config update that landed in 803047 for us | 15:19 |
clarkb | I checked the chagne its a minor update to address a deprecation warning that won't be the end of the world if this somehow doesn't apply in 40 minutes | 15:20 |
clarkb | we can investigate further at that point | 15:20 |
fungi | sounds good, thanks for the cleanup! | 15:20 |
*** jpena is now known as jpena|off | 15:32 | |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Prepare gitea 1.15.0 upgrade https://review.opendev.org/c/opendev/system-config/+/803231 | 15:45 |
clarkb | we have a new RC to test | 15:46 |
*** marios|ruck is now known as marios | 16:11 | |
*** rpittau is now known as rpittau|afk | 16:26 | |
clarkb | infra-root now I'm thinking something else is up with the hourly deploy jobs. They have been queued up for half an hour without any progress | 16:30 |
clarkb | corvus: ^ fallout from the updates that were made? Maybe we broke semaphores somehow? | 16:30 |
clarkb | Semaphore /zuul/semaphores/openstack/infra-prod-playbook can not be released for 26339974ead441c68415f5ee9aa8ffff-infra-prod-install-ansible because the semaphore is not held | 16:35 |
clarkb | I think the issue is grabbing the semaphote | 16:35 |
clarkb | *semaphore. We should have periodic cleanup of those semaphores though right? | 16:35 |
clarkb | _semaphore_cleanup_interval = IntervalTrigger(minutes=60, jitter=60) | 16:36 |
corvus | clarkb: i think that message is a red herring (which we should probably find and clean up). i say that based on seeing it before. i don't want to exclude it as a cause, but not the first place i'd look. | 16:37 |
corvus | i'll start digging in | 16:37 |
clarkb | I did the removal of all the other buildests about an hour and 15 minutes ago which should be more than enough for the semaphore cleanup to have run since | 16:37 |
clarkb | corvus: ok thanks | 16:37 |
corvus | clarkb: oh, i didn't see that we're waiting on a semaphore... that's different. i retract that. ;) | 16:38 |
corvus | so let's revise that to say: there are potentially a lot of red-herring log messages about releasing semaphores. but one of them may be relevant. :/ | 16:39 |
*** marios is now known as marios|out | 16:39 | |
clarkb | corvus: 260e702e93d645d9b627b2569a1258e1 is the event that triggered the buildset if that helps | 16:39 |
clarkb | for the most recent 16:00 UTC enqueue | 16:39 |
corvus | fd02b512c5134c409d460e3365a0f170-infra-prod-remote-puppet-else is the current holder | 16:40 |
corvus | that appears to have been active on aug 7 | 16:41 |
clarkb | and possibly went away in the restarts? | 16:42 |
corvus | queue item fd02b512c5134c409d460e3365a0f170 was last spotted at 2021-08-07 17:43:21,552 | 16:44 |
corvus | clarkb: so yes -- that's between my 2 restarts on saturday | 16:45 |
corvus | the auto cleanup should have gotten it | 16:45 |
clarkb | corvus: I guess the thing to look at is why semaphore cleanups aren't cleaning it up. ya | 16:45 |
clarkb | `grep 'semaphore cleanup' /var/log/zuul/debug.log` shows no results but looking at the scheduler it seems that we should log that string when starting semaphore cleanups and when they fail | 16:46 |
corvus | yes, i also don't see any apsched log lines -- we may need to adjust our log config to get them? | 16:47 |
corvus | nothing is holding the semaphore cleanup lock in zk, so it should be able to run | 16:49 |
corvus | clarkb: oh! what if the lack of reconfiguration on startup means the initial cleanup never runs? | 16:50 |
clarkb | corvus: it looks like we call startCleanup() from scheduler.start(). Would that be impacted by ^? | 16:50 |
corvus | clarkb: maybe we should move to #zuul | 16:51 |
clarkb | ++ | 16:51 |
corvus | clarkb: prime() is called after start() from the cli wrapper | 17:01 |
corvus | oh, sorry that was old; i didn't manage the channel switch well :) | 17:02 |
corvus | anyway -- as mentioned in #zuul, i think i can repair our current issue without a restart; i will do so now | 17:02 |
clarkb | thanks | 17:04 |
corvus | 2021-08-09 17:03:58,592 ERROR zuul.zk.SemaphoreHandler: Releasing leaked semaphore /zuul/semaphores/openstack/infra-prod-playbook held by fd02b512c5134c409d460e3365a0f170-infra-prod-remote-puppet-else | 17:05 |
corvus | looks like we're gtg | 17:05 |
clarkb | the job has started | 17:05 |
clarkb | fungi: if you have time today https://review.opendev.org/c/opendev/system-config/+/803367/ is the last of my gitea cleanup changes prior to the gitea 1.15.0 rc upgrade change. This one is probably the largest of the changes but converts us over to using only the rest api for gitea management | 17:56 |
clarkb | in the process it simplifies some stuff that was split out into multiple requests previously whcih should in theory mean it makes us run quicker too | 17:56 |
clarkb | With that landed we should be good to upgrade to 1.15.0 whenever it releases | 17:57 |
fungi | clarkb: i approved it, but as commented we likely want to keep an eye on the next project creation changes go through just in case our testing coverage isn't as good as we think | 18:34 |
clarkb | yup | 18:34 |
fungi | config-core: ^ heads up | 18:34 |
clarkb | and thanks | 18:34 |
clarkb | I've removed my WIP from https://review.opendev.org/c/opendev/system-config/+/803374 as it seems gerrit is happily running with the mariadb connector since we restarted it | 18:44 |
clarkb | That chagne does a bit more cleanup and is quite a bit more involved now so deserves careful review, but we shouldn't be limited by gerrit itself | 18:44 |
opendevreview | Merged opendev/system-config master: Update gitea project creation to only use the REST API https://review.opendev.org/c/opendev/system-config/+/803367 | 19:27 |
clarkb | ianw: I think you can approve https://review.opendev.org/c/openstack/project-config/+/803411 when ready | 19:35 |
*** dviroel is now known as dviroel|brb | 20:06 | |
thomasb06 | fungi: hey. at the moment, i'm trying to understand your patch: https://bugs.launchpad.net/neutron/+bug/1784259 | 20:39 |
thomasb06 | what was happening? | 20:39 |
fungi | thomasb06: i didn't have a patch on that bug, just comments about inapplicability for issuing a security advisory | 20:42 |
fungi | maybe better to discuss in either #openstack-neutron or #openstack-security depending on the nature of your question | 20:43 |
*** dviroel|brb is now known as dviroel | 20:50 | |
opendevreview | Clark Boylan proposed opendev/system-config master: Accomodate zuul's new key management system https://review.opendev.org/c/opendev/system-config/+/803992 | 21:35 |
clarkb | corvus: ^ that starts to capture the opendev updates necessary to deal with the new key stuff | 21:35 |
corvus | clarkb: lgtm, 1 quick thought | 21:41 |
clarkb | corvus: good idea, do you think that should go in a separate file from the typical export location? or just overwrite what we typically have? | 21:43 |
corvus | clarkb: i was thinking typical location | 21:44 |
clarkb | ok | 21:44 |
corvus | should be fine -- after all, the backups have older versions | 21:44 |
opendevreview | Clark Boylan proposed opendev/system-config master: Accomodate zuul's new key management system https://review.opendev.org/c/opendev/system-config/+/803992 | 21:46 |
*** dviroel is now known as dviroel|out | 21:47 | |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir, add a symlink back https://review.opendev.org/c/openstack/diskimage-builder/+/804000 | 23:01 |
clarkb | corvus: I've started learning a bit about prometheus and how we might deploy it via a bunch of reading. So far I think what I've got is we deploy it with docker-compose like most other services and then bind mount in its config as well as the location the config says to store the tsdb contents. I want to say you were suggesting that we just use the built in tsdb implementation? | 23:03 |
clarkb | corvus: what exporter were you suggesting we use to collect system metrics similar to what cacti does? Is it node-exporter? | 23:04 |
clarkb | The other bit that isn't quite clear t ome is if the expression evaluator web service thing is included in a default install or if that runs alongside as a separate thing, But I think that is less important at startup | 23:06 |
tristanC | clarkb: here is how i'm running it to collect gerritbot stats: https://paste.opendev.org/show/807966/ | 23:25 |
clarkb | tristanC: thanks. I see now the prom/prometheus image specifies a config path and a storage tsdb path on the command line so those need to be provided. Also it exposes port 9090 which I'm going to guess is the dashboadr thing | 23:27 |
tristanC | clarkb: yes, then you can add it to grafana by using such datasource config: https://paste.opendev.org/show/807966/ | 23:28 |
clarkb | tristanC: that was the same paste again :) | 23:28 |
tristanC | arg, i meant the next one: https://paste.opendev.org/show/807967/ :-) | 23:29 |
clarkb | thanks that helps | 23:29 |
tristanC | clarkb: you're welcome, i'd be happy to help with the setup, is the spec already proposed? | 23:30 |
corvus | clarkb: if we want to use snmp, there's an snmp exporter | 23:30 |
corvus | clarkb: that can be a single instance, so we could re-use our existing snmp configuration without running a separate collector | 23:31 |
clarkb | tristanC: no I'm still putting things together into my head first. Then next step is writing the spec down. Hopefully tomorrow | 23:31 |
clarkb | corvus: oh nice we can run the snmp exporter adjacent to prometheus then we only need to update firewall rules | 23:31 |
corvus | clarkb: yep | 23:32 |
clarkb | tristanC: I'll ping you when I push it up | 23:32 |
clarkb | tristanC: but I think rough idea right now is deploy a new instance with a large volume for the tsdb content. Run prometheus using the prom/prometheus image from dockhub using docker-compose. And now the bit about snmp has clicked so we'll run the snmp exporter adjacent to that | 23:33 |
clarkb | and then we can also mix in specific application scraping like for gerritbot | 23:33 |
corvus | and zuul etc | 23:33 |
tristanC | to collect general host metric, there is also the node_exporter you can run in a container too | 23:33 |
clarkb | tristanC: yup I asked about that one above. But that has to run on each host right? | 23:34 |
tristanC | yes | 23:34 |
clarkb | ya I think it will be easier to stick with snmp at least while we spin stuff up. Not everything has docker installed on it etc | 23:34 |
tristanC | it doesn't have to be all in one too, for example, once gerritbot has the endpoint (with 803125 ), i can run the monitoring from my setup | 23:36 |
clarkb | right but I'm saying that for node metric collection needing to deploy a container across all of our instances and modify firewall rules for that will be a fairly big undertaking. It will be much simpler to start with snmp which can run centrally and do the polling since we already have snmp configured | 23:37 |
tristanC | sure that sounds good. Another thing to consider is the alertmanager configuration: it would be nice to have custom route, where interested party could subscribe to some alerts. | 23:50 |
clarkb | Looking at things it seems the graphing implementation in prometheus may be more like the one in graphite. If we want graphs like cacti produces we'll need to set up our grafana to talk to prometheus and give it a bunch of config to render those grpahs. Not a big deal but want to understand the scope of the work here | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!