*** ykarel|away is now known as ykarel | 04:24 | |
*** sshnaidm is now known as sshnaidm|afk | 04:39 | |
*** ysandeep|away is now known as ysandeep | 05:10 | |
*** jpena|off is now known as jpena | 07:39 | |
*** ykarel is now known as ykarel|lunch | 09:24 | |
*** ykarel|lunch is now known as ykarel | 10:33 | |
*** sshnaidm|afk is now known as sshnaidm | 10:53 | |
*** jpena is now known as jpena|lunch | 11:27 | |
zbr | did we had any recent changes made to logstash instance? it stopped working a couple of days ago. | 11:44 |
---|---|---|
zbr | http://status.openstack.org/elastic-recheck/ -- apparently ALL_FAILS_QUERY may become broken? | 11:45 |
zbr | interesting bit is that I see "There were no results because no indices were found that match your selected time span" for any timespan *shorter* than 7 days, like 2 days. | 11:49 |
*** jpena|lunch is now known as jpena | 12:27 | |
*** ministry is now known as __ministry | 13:10 | |
*** ysandeep is now known as ysandeep|afk | 13:52 | |
pmatulis | i got a merge for a doc project but the guide encountered a build problem. anyone here have visibility to that? | 14:10 |
pmatulis | https://review.opendev.org/c/openstack/charm-deployment-guide/+/798273 | 14:15 |
clarkb | zbr: we have made zero changes as far as I know the whole thing is in deep hibernation because it needs major updates | 14:40 |
clarkb | zbr: if there are no results for data less than 7 days then the processing pipeline has probably crashed | 14:42 |
*** ysandeep|afk is now known as ysandeep | 14:43 | |
zbr | prob crashed somewhere between 2-7 days based on what I see. | 14:49 |
pmatulis | can we fire it up again? :) | 14:53 |
*** ysandeep is now known as ysandeep|away | 15:02 | |
*** ykarel is now known as ykarel|away | 15:08 | |
clarkb | pmatulis: I believe your issue is distinct to the one zbr has pointed out | 15:28 |
clarkb | fungi: ^ I suspect pmatulis is hitting a problem similar to the one you are debugging for cinder-specs | 15:30 |
clarkb | fungi: do we have a writeup on that we can point people to? or maybe prior art for fixes on the type of fix up we want to use for that? I think all of that got finalized while I was out the other week | 15:30 |
clarkb | pmatulis: your issue is being debugging in #opendev in another context (cinder-specs) | 15:33 |
clarkb | btu I believe they are the same underlying problem | 15:33 |
pmatulis | ok, looking over there | 15:33 |
zbr | clarkb: i am aware that the email thread related to logstrack ended in limbo, as I really doubt infra team will have time to address is, I wonder if I could takeover and attempt a manual upgrade | 15:33 |
clarkb | zbr: we are happy for someone else to run such a system but we wouldn't keep running the resources for it if it isn't managed through our normal systems | 15:35 |
clarkb | (and even then its questionable if we should keep running the system at all due to its large resource consumption) | 15:35 |
fungi | it needs to be redesigned with more modern technologies and in a more efficient manner | 15:36 |
zbr | imho the large resource consumption is related on its outdated state, likely being able to scale it down and run more efficient | 15:36 |
fungi | it accounts for half of all the ram consumed by our control plane servers today | 15:36 |
clarkb | zbr: I don't know that that is actually the case | 15:36 |
clarkb | we feed it upwards of a billion records a day. Its a huge database | 15:36 |
clarkb | upgrading elasticsearch will hopefully make that better resource wise but not majorly | 15:37 |
zbr | i have no desire to manage manually for long term, but only to perform the upgrade manually and have playbooks to keep it running after. | 15:37 |
clarkb | But to answer you question, no we wouldn't hand over the resources for manual management. If we people want to work to update the systems through our configuration management we can discuss what that would look like | 15:37 |
zbr | as we said, rdo team already has playbooks to deploy it, so i should be able to reuse some of them | 15:37 |
clarkb | but also anyone can run such a system if they choose and do so manually or not. The data is public | 15:38 |
clarkb | This is our preference since we want to avoid getting saddled with this same situation in a few years when it gets outdated again and there isn't sufficient aid for keeping it running | 15:38 |
fungi | similar to stackalytics, we tried (and failed) to get that moved into our infrastructure at one point, but ultimately there's nothing it needs privileged access to so any organization can run it for us on the public internet | 15:39 |
fungi | and in so doing, run it however they wish | 15:39 |
zbr | clarkb: how often did you successfully upgrade a distro using configuration management? | 15:41 |
clarkb | zbr: we've done it multiple times recently with other services. | 15:41 |
clarkb | review is in the process of getting similar treatment | 15:42 |
clarkb | (but I did all of zuul + nodepool + zookeeper as an example) | 15:42 |
zbr | my lucky guess is that re-deploying from zero using current versions is likely less effort than attempted ugprade, clarkb WDYT? | 15:45 |
clarkb | zbr: when we say "upgrade" we typically mean redeploy on newer software and migrate state/data as necessary. In the case of elasticsearhc I don't think we would bother migrating any data because it is already highly ephemeral. However, you cannot simply redeploy using the current software or configuration management whcih is why we are in the position we are in | 15:47 |
clarkb | it needs to be redone | 15:47 |
zbr | I wonder if a slimmed down version that runs on a single host would not be ok. if i remember well that is what rdo has. | 15:49 |
clarkb | if there are people interested in figuring that out we are willing to discuss what it looks like and whether or not it is possible for us to keep hosting it with that work done. But it is hard to have that conversation without volunteers and a bit of info on resource sizing and likely future upgrade paths | 15:49 |
clarkb | zbr: if that can handle the data thrown at it I don't see why not. History says this is unlikely to be the case though | 15:49 |
zbr | maybe we should be very picky about how much data we allow it to be fed, setting fixed limits per job. | 15:50 |
clarkb | again all of that is possible, but it requires someone to do the work. That isn't something that exists already (and is actually somewhat difficult to design in a fair way) | 15:51 |
zbr | instead of attempting to index as much as possible, trying to index only what we consider as being meaningful. | 15:51 |
clarkb | for the record we already discard all debug level logs | 15:51 |
clarkb | if we didn't our input size would be like 10x | 15:51 |
zbr | indeed, getting the design on this right is very hard, in fact I am more inclined to believe is the kind of task that can easily become a "full time job" | 15:52 |
fungi | zbr: what was pitched to the board in the meeting a few hours ago is that it's at least two full-time jobs minimum, probably more | 15:53 |
clarkb | fungi: I had tried to stay awake for the baord meeting but was falling asleep on the couch well before it started :/ | 15:54 |
clarkb | so I gave up and went to bed | 15:54 |
fungi | the tc made a plea to the openinfra foundation board of directors to find some people within their organizations who can be dedicated to running a replacement for this | 15:54 |
fungi | the board members are going to mull it over and hopefully try to find people | 15:54 |
clarkb | right as far as headcount goes you probably need ~1 person just to keep the massive database happy. Then you also need another person feeding the machine that "scans" the databse | 15:55 |
clarkb | you can share the responsibilities but when looked at from an effort perspective the combination (and one is not useful without the other) should be ~2 individuals | 15:55 |
clarkb | elasticsearch02 and elasticsearch06 were not running elasticsearch processes. I have restarted the processes there | 16:11 |
clarkb | we also ended up with tons of extra shards because elasticsearch02 is where we run the cleanup cron. The cleanup cron has been manually triggered by me now. I'll check on the indexers shortly | 16:11 |
fungi | with two cluster members out, did we end up corrupting the entire existing dataset? | 16:12 |
clarkb | fungi: I think we just stopped properly updating it at that time ya | 16:13 |
clarkb | (also looks like our indexing bug that gives timestamps for the far future is resulting in a number of new indexes we don't want) | 16:13 |
clarkb | geard was not running on logstash.o.o and has been restarted (it probably crashed once its queues got too large since nothing was reducing their size) | 16:14 |
clarkb | looks like some of the indexers actually stayed up but many did not. I'm getting those restarted as well | 16:16 |
clarkb | indexers that were not running hvae been started. That was the vast majority (probably 80%) | 16:29 |
*** jpena is now known as jpena|off | 16:49 | |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Correct targets for afsdocs_secret-deploy-guide https://review.opendev.org/c/openstack/project-config/+/798932 | 16:54 |
*** anbanerj|rover is now known as frenzy_friday | 17:07 | |
opendevreview | Jeremy Stanley proposed openstack/project-config master: Correct branch path in afsdocs_secret-releasenotes https://review.opendev.org/c/openstack/project-config/+/798938 | 17:45 |
opendevreview | Merged openstack/project-config master: Correct targets for afsdocs_secret-deploy-guide https://review.opendev.org/c/openstack/project-config/+/798932 | 18:01 |
*** gfidente is now known as gfidente|afk | 18:11 | |
clarkb | bike rides always produce great ideas: re resource usage we can do an experiment and shutdown half the processing pipeline to see if we need them still | 19:29 |
clarkb | for the elasticsearch servers themselves cacti gives us an idea of how much free space we have if we want to scale those down and I looked not too long ago and we can't easily scale those down due to disk space requirements | 19:29 |
clarkb | for the indexer pipeline we can monitor the delay in indexing things (e-r reports this) | 19:30 |
clarkb | for elasticsearch servers if others want to watch it via cacti we have a total of 6TB of storage across the cluster. 1TB per server. Because we run with a single replica we need to have 1TB of disk space of free headroom to accomodate losing any one member of the cluster | 19:30 |
clarkb | that means our real ceiling there is 5TB even though we have 6TB total | 19:31 |
opendevreview | Merged openstack/project-config master: Correct branch path in afsdocs_secret-releasenotes https://review.opendev.org/c/openstack/project-config/+/798938 | 19:42 |
clarkb | fungi: zbr ^ does turning off half of the indexer pipeline and doing binary search for what we need from there make sense to you? If so I can start that | 19:46 |
opendevreview | Kendall Nelson proposed openstack/ptgbot master: Add Container Image Build https://review.opendev.org/c/openstack/ptgbot/+/798025 | 20:35 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!