hashar | fungi: I got another hack for `git-review` to let it shows some colors when pushing a change https://review.opendev.org/c/opendev/git-review/+/914000 :) | 06:47 |
---|---|---|
hashar | the remote send back some message and cgit colors some of them , but that only happens when run interactively (stdin or stderr is a tty) | 06:48 |
opendevreview | Antoine Musso proposed opendev/git-review master: Show remote side colors when color is enabled https://review.opendev.org/c/opendev/git-review/+/914000 | 06:49 |
hashar | :) | 06:49 |
opendevreview | Tobias Rydberg proposed opendev/irc-meetings master: We have decided to adjust meeting time to 0700 during summer time. https://review.opendev.org/c/opendev/irc-meetings/+/914940 | 08:06 |
naman | hello | 11:19 |
naman | anyone online here ?? | 11:20 |
fungi | not sure what's going on, but in the middle of the openstack coordinated release we've been getting stalled by near-constant "management events" (i guess pipeline reconfigurations?) taking upwards of 40 minutes to clear | 13:27 |
fungi | maybe it's lots of job configuration changes merging, though that seems unlikely. i wonder if there's ongoing changes in some external repo we're watching that's triggering them | 13:28 |
fungi | i don't really know what to look for in the scheduler debug logs to isolate the cause | 13:28 |
fungi | there's not that many tenant reconfigurations in the log | 13:30 |
fungi | one at 09:59:42 because of an update to openstack/magnum and another at 11:16:44 from an update to nebulous/nebulous | 13:31 |
fungi | oh, though the latter was for the nebulous tenant | 13:31 |
fungi | so yeah, only one reconfiguration for the openstack tenant during the relevant time period | 13:32 |
fungi | i wonder what else those per-pipeline management events could signify | 13:32 |
fungi | because trigger and result events just seem to keep stacking up until the management event clears, which seems to take ages | 13:33 |
frickler | I'm also seeing semi regular errors in zuul debug logs due to the replication mentioned in #zuul https://opendev.org/opendev/sandbox/commit/1c3e7447e7046e2bb6d7ae364e4f4c83090c185a , not sure if that can affect even processing | 13:44 |
fungi | future improvement idea: move opendev/sandbox to a dedicated "sandbox" tenant in zuul | 13:47 |
Clark[m] | fungi: reconfigurations should be pretty clearly logged if that is what is doing it. I don't think that dependency in supercedent pipeline should affect zuul that way but you can dequeue it manually to check | 13:53 |
Clark[m] | You seem to have touched on this in the release channel a little but my hunch is it has to do with all the ref creates events for the tags. Each of those tages represents a zuul config I think | 13:54 |
Clark[m] | Though that shouldn't be an different than previous releases | 13:54 |
fungi | right now the release-post pipeline has a new management event, which seems to mean we wait half an hour for its status to update and anything else there to start running | 13:55 |
fungi | er, sorry, release pipeline this time | 13:56 |
fungi | this has been occurring over and over in different pipelines though | 13:56 |
fungi | okay, now it's up to 2 management events for the release pipeline | 13:56 |
Clark[m] | fungi: is there an event I'd in the logs associated with those events that you can trace to see where the time is lost? | 14:01 |
fungi | those did clear, but it's still not processing all the results (been sitting at 0 trigger, 0 management, 13 results for a while | 14:01 |
Clark[m] | And have we evicted the sandbox change that may be problematic? | 14:02 |
fungi | i haven't looked at the sandbox change, not sure what that's all about | 14:02 |
frickler | Clark[m]: the changes were merged, you would need to merge the revert | 14:02 |
fungi | or whether it needs to be preserved in the pipeline for inspection | 14:03 |
frickler | would it make sense to take a stacktrace of zuul schedulers at this point in time? | 14:03 |
Clark[m] | frickler no I mean evict it from the zuul queues if it is still there | 14:03 |
fungi | okay, release pipeline is moving again now | 14:03 |
frickler | Clark[m]: yes, it is in post, so not change but ref. we could try to evict it unless corvus1 want to do some looking at it first | 14:03 |
Clark[m] | fungi: someone was being "helpful" trying to reproduce a bug they found in zuul via the sandbox repo | 14:03 |
Clark[m] | It apparently got an event stuck in the queues and we could in theory manually dequeue it if we were concerned it is related | 14:04 |
Clark[m] | But also just tailing scheduler logs during the pauses should give hints as to where the time is lost. Might be noisy but you'd see a lot of git merge events for reconfigurations for example | 14:05 |
frickler | it seems they also added the failing job to the periodic pipeline, so it would likely repeat anyway. I just don't want to break the running release jobs, they still do seem to be progressing | 14:05 |
frickler | fwiw I'm still getting no results on the builds/buildsets pages unless I add some filter like for a job or project | 14:40 |
frickler | that also only affects the openstack tenant | 14:41 |
fungi | it does seem to spin for a while like its retrieving data though | 14:42 |
frickler | oh, there's an error in the zuul web log: 2024-04-03 14:41:41,066 ERROR cherrypy.error.139740297355920: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (3024, 'Query execution was interrupted, maximum statement execution time exceeded') | 14:42 |
fungi | fun! | 14:42 |
frickler | seems zuul show show an error in the UI for that | 14:42 |
frickler | *should show | 14:43 |
frickler | MAX_EXECUTION_TIME(30000) - I think that will be 30s? | 14:44 |
Clark[m] | Ya js uses milliseconds iirc | 14:45 |
frickler | maybe we need to somehow archive old builds? or just add some proper indices? | 14:46 |
clarkb | frickler: was that iwth a specific query? | 14:49 |
fungi | clarkb: just go to https://zuul.opendev.org/t/openstack/builds with no query params | 14:49 |
clarkb | ok so no query paramters result in that. I wonder what changed | 14:50 |
clarkb | fwiw the order of magnitude of log entries for sandbox-publish-docs is low enough that I don't think we're in a tight loop adding failed builds to the db due to that | 14:52 |
frickler | clarkb: the timeout was without a query | 14:53 |
fungi | i'm wondering if zuul could be getting flooded with internal events or something to cause its queue and tenant event processing to just get drowned out resource-wise | 14:58 |
clarkb | we do appear to be getting crawled possibly in a manner similar to what we've seen with gitea | 14:59 |
fungi | oh, like zuul-web could be bombarding the db with queries and we're hanging on writes? | 15:00 |
fungi | though i wouldn't expect that to block at the db level | 15:00 |
*** blarnath is now known as d34dh0r53 | 15:00 | |
clarkb | no, but they are querying things liek change logs which may also do db lookups? | 15:00 |
fungi | load average on the schedulers is very low too | 15:00 |
fungi | oh, but maybe the trove instance is overloaded | 15:01 |
clarkb | ya its a db load issue I think | 15:01 |
clarkb | web is just a simple passthrough to fetch that data | 15:01 |
clarkb | I don't think we're running the apache filter of user agents on zuul web's apache instances btu I think we should | 15:02 |
clarkb | but that may still be an orthogonal issue | 15:02 |
clarkb | if you look at apache logs there is a python requests user agent hitting the 500 error for the timeout consistently for some time | 15:06 |
fungi | that could certainly be related to the query timeouts we're seeing too | 15:06 |
fungi | either caused by, or causing | 15:06 |
clarkb | I think that is our culprit and won't be impact by our existing user agent filter (and we probably don't want it to be until we understand better) | 15:06 |
clarkb | seems to have started sometime yesterday | 15:06 |
clarkb | its in a tight loop failing about twice a minute (which aligns with our 30 second timeout) | 15:07 |
clarkb | no nevermind its older than that | 15:08 |
clarkb | (my greps are getting better) | 15:08 |
clarkb | it started with the 500 errors on Sunday (March 31) | 15:10 |
clarkb | it seems before that it was iterating through 100 builds at a time (arg why do you need to rescan over and over again) | 15:10 |
clarkb | but now it gets the 500 and retries the first request over and over again in a loop | 15:10 |
fungi | could it be ci-log-processor or whatever it's called? | 15:10 |
fungi | but yeah, it's probably some phd student doing "research" like we've seen knock gerrit offline in the past | 15:11 |
clarkb | [2024-03-30 07:18:01.736] ok I think this is when it starts based on the first instance of a 500 error in the apache log on zuul02 | 15:12 |
clarkb | and I think zuul-web was restarted on march 30 at 07:13 | 15:12 |
frickler | just for reference it seems we have about 20M builds in 4M buildsets. that sounds to me like something mysql should still be able to handle reasonably | 15:12 |
clarkb | I'm beginning to suspect a regression in zuul after our automatic weekend upgrade | 15:13 |
fungi | yeah, unless someone's pounding it with queries maybe | 15:13 |
clarkb | corvus1: ^ fyi | 15:13 |
fungi | but yes, the timing is conspicuous | 15:13 |
clarkb | `zgrep '/api/tenant/openstack/builds?' zuul-ssl-access.log.4 | grep 'HTTP/1.1" 500' | head` to find the first occurence that I see | 15:13 |
clarkb | I wonder if that may be related to the slow event processing too (a bug affecting zuul in both ways I mean) | 15:14 |
fungi | without more data, it seems at least possible that they're linked | 15:14 |
frickler | checking git log https://review.opendev.org/c/zuul/zuul/+/913875 spring to my eye | 15:14 |
clarkb | I think there are ~3 things going on that we should consider addressing 1) good bots that will respect robots.txt crawling data that probably shouldn't be crawled we should write a robots.txt to refine what if anything gets crawled 2) bad bots that use weird user agents that we should apply our existing apache ruleset to block on zuul's apache servers and 3) a likely zuul regression | 15:20 |
clarkb | after the weekly upgrade leading to timeouts trying to load builds from the db and possibly other related slowness | 15:20 |
clarkb | 1) and 2) are straight forward and may even help the situation with 3) by reducing total queries | 15:21 |
frickler | I'm no mysql expert but could that query block the table for the 30s while it is running and thus cause executor updates like event processing be blocked? | 15:21 |
clarkb | frickler: generally I would expect read locks on rows to not interfere with write locks, btu maybe we don't have those locks properly set up in the orm and it is being conservative and safe? | 15:22 |
clarkb | I think we're in no wrose a spot than we were since the weekend so I'll take a break here for morning things like food and shower. I can work on 1) and/or 2) when I get back | 15:23 |
clarkb | frickler: I would agree that chagne you identified is suspicious | 15:26 |
corvus1 | i'll take a look at that query | 15:58 |
*** gthiemon1e is now known as gthiemonge | 15:59 | |
clarkb | unrelated to everything else rpm's man page doesn't have ascii -'s it has ‐'s instead. This is very disappointing | 16:17 |
frickler | ah, wasn't there a big discussion about what's typographically correct recently? | 16:28 |
fungi | not sure about recently, but it's been coming up in *roff related discussions for at least 3 decades i can remember, likely closer to 5 | 16:30 |
opendevreview | Clark Boylan proposed opendev/system-config master: Apply apache us filter to Zuul https://review.opendev.org/c/opendev/system-config/+/914985 | 16:34 |
clarkb | ya it came up somewhere maybe hn | 16:35 |
clarkb | I would've expected it to be a solved problem in man pages for popular utilities like rpm though | 16:35 |
clarkb | 914985 is 2) from above | 16:35 |
clarkb | 1) I realized is a bit more tricky because we have to decide what i anything we awnt to be crawlable. I'm tempted to say nothing should be? | 16:35 |
clarkb | but at the very least crawling our log files seems like unnecessary bw for our cloud providers | 16:36 |
fungi | there's no reference material on zuul.o.o, so i'm inclined to say there's nothing worth indexing by search engines at least | 16:37 |
fungi | also the ability of unvetted actors to use it to publish content/urls in some locations probably means search engine indexing creates an attractive nuisance for spammers and scammers | 16:38 |
clarkb | ack I'll make a blanket please don't index rule | 16:39 |
*** cid__ is now known as cid | 16:55 | |
*** jroll00 is now known as jroll0 | 16:55 | |
*** Reverbverbverb_ is now known as Reverbverbverb | 16:55 | |
*** gmann_ is now known as gmann | 16:55 | |
clarkb | I think we lost gerritbot | 16:58 |
clarkb | remote: https://review.opendev.org/c/opendev/system-config/+/914989 Add robots.txt to Zuul web [NEW] | 16:58 |
clarkb | Thats 1) and 2) proposed. reviews welcome | 17:00 |
corvus1 | regarding the database issue: i think i have two half-solutions that add up to a whole one: first, we can add some index hints to the query, that improves the runtime for queries locally for me with mysql 8.x. they do not improve the runtime in all cases on the opendev production database. also, they might actually produce the wrong data on opendev's production mysql. so i think we should also move to a mysql 8.x trove instance. we can do | 17:11 |
corvus1 | that relatively quickly, though it will require a small downtime (maybe 20 mins?) for the final copy step. | 17:11 |
clarkb | we did have a small outage previously where the db wasn't available and zuul mostly kept working but things weren't recorded properly obviously | 17:13 |
clarkb | I guess we can decdie if we want to shutdown zuul or go through that half working state again? | 17:13 |
corvus1 | true, we could actually just decide to lose a few minutes of data | 17:13 |
clarkb | corvus1: also I guess we want to land the index updates first? Any idea what caused the change? is it the change frickler identified? | 17:14 |
corvus1 | we would just end up with missing builds. | 17:14 |
clarkb | and if it is that chagne I guess my next thought is "is that change valid if it has such a big impact" | 17:14 |
clarkb | re downtown or missing builds I'll defer to fungi and frickler as they aer more involved in the openstack release process today. I suspect we can tolerate that elsewhere | 17:14 |
corvus1 | clarkb: that change gave us correct results at the expense of speed; so a revert is a reasonable solution for opendev, but not a good long term solution | 17:15 |
clarkb | I see | 17:15 |
corvus1 | adding the index hints would give us potentially incorrect results in opendev, and not necessarily speed up all the queries we need to | 17:16 |
clarkb | but with a more modern mysql it would both be correct and relatively quick (at least compared to human wall clock scale?) | 17:16 |
corvus1 | i think the incorrect results from adding the index hints on old mysql would be more visible than anything we've seen yet -- the builds page would show the first 100 builds instead of the last 100. despite the very explicit "ORDER BY ... DESC" in the query. | 17:17 |
corvus1 | so i think in opendev, we want mysql 8 first, then the index hints. | 17:18 |
clarkb | ++ that does seem like it wouldbe problematic for the common use case of that search page | 17:18 |
corvus1 | and yes, once we have both, at least in my local testing, the builds and buildset queries should both be nearly instantaneous. | 17:18 |
clarkb | so ya no objectiosn from me. I think the main thing is to coordinate with fungi and frickler to ensure we minimize impact to the openstack release process (not sure if anything is left for that) | 17:19 |
corvus1 | agree; i'll spin up a trove mysql 8 to prep, but await more input before proceeding past that | 17:20 |
fungi | sorry, had to step away for a moment but catching back up... | 17:22 |
fungi | openstack release activities have wound down now, so i'm not concerned about a few missing build results if we go that route | 17:23 |
fungi | also, unless we stop gerrit at the same time, having zuul offline would mean missing events, which isn't necessarily any better | 17:25 |
fungi | i *think* rackspace lets you do in-place mysql upgrades, but a cut-over to an already running and populated second instance might be faster | 17:26 |
fungi | what's the order of operations for swapping out the trove instance in that case? | 17:26 |
fungi | is zuul going to reread the configuration for the dburi or is there still some sort of restart needed? | 17:27 |
clarkb | fungi: I think its prevent access from zuul to existing trove instance, dump db, restore db into new instance, update zuul to point at new instance | 17:27 |
corvus1 | start mysql 8; import data; restart zuul-web with new dburi | 17:27 |
clarkb | thats a good question I think zuul may need a restart to pick up the ini config | 17:27 |
corvus1 | if we're not worried about losing data, we don't have to prevent writes | 17:28 |
corvus1 | we will end up with incomplete data, but not inconsistent data in the new db, so that should be fine | 17:28 |
corvus1 | remote: https://review.opendev.org/c/zuul/zuul/+/914990 Restore mysql index hint [NEW] <-- the index hints change | 17:29 |
clarkb | ah ok I wasn't sure if it would be broken/inconsistent | 17:29 |
clarkb | and I don't think we're too worried about losing data. It has happened before and impact was minimal (almost no one noticed even) | 17:30 |
corvus1 | the relations are like a tree, we'll just be missing leaf nodes, and updates to buildsets to say they are complete | 17:30 |
fungi | yeah, that's exactly what we observed in the past | 17:30 |
corvus1 | (technically, there is a cycle in the relations graph, but there is still a definite order they are created in, so it doesn't affect this issue) | 17:30 |
fungi | someone was looking at an affected build result and confused that zuul thought it had never completed | 17:31 |
fungi | which was what led me to discover the db outage | 17:31 |
corvus1 | heh our db is called "zuul-v3" maybe i should call the new one "zuul-v10" | 17:34 |
clarkb | nice | 17:34 |
fungi | we coulda had a v8 | 17:34 |
clarkb | zuul the hot rod or zuul the tomato juice? | 17:35 |
fungi | why can't it be both? | 17:35 |
clarkb | hot rod powered tomato blender | 17:37 |
corvus1 | do we want more ram? currently 4g | 17:37 |
clarkb | performance has been fine I think. But maybe not super fast at time | 17:38 |
clarkb | more ram can't hurt I guess | 17:38 |
corvus1 | their internal monitoring thing just has the last 6 hours and shows all memory used | 17:38 |
clarkb | probably a bit of an abnormal state in the last 6 hours but also good enough indication to me we should bump it | 17:39 |
corvus1 | there's no separate cache page metric, so it could be fine | 17:39 |
corvus1 | but <shrug> i'm also inclined to maybe bump it up to 8 | 17:39 |
clarkb | ++ | 17:39 |
fungi | i think it will always show all memory used | 17:41 |
corvus1 | when we move to self-hosted, that will be a nice benefit | 17:41 |
fungi | it basically tells mysqld to use up to the amount available, and that will quickly end up as query cache | 17:41 |
corvus1 | fun fact: no query cache in mysql 8 (but there are other kinds of index caches -- some of which persist on disk!) | 17:42 |
fungi | interesting | 17:43 |
corvus1 | oh i just noticed in the web ui we can enable the root user (but rax may not support us if we do). just filing that bit of info away | 17:44 |
fungi | i suppose people who want an actual query cache just inline a separate proxy daemon these days | 17:44 |
corvus1 | the zuul-v3 server has some server parameters set, but i think the defaults in mysql 8 are equivalent or better, so i don't think we need a custom config | 17:56 |
corvus1 | (zuul-v3 has character_set_server=utf8 (which is an alias for utf8mb3) and collation_server=utf8_bin and wait_timeout=28800) | 17:57 |
corvus1 | the defaults in mysql 8 are utf8mb4 and that same wait_timeout value | 17:57 |
corvus1 | so i think our forward-compatible config has done its job :) | 17:57 |
fungi | sounds good | 17:59 |
fungi | i guess utf8mb4 could make some tables larger, but i doubt we're short on disk | 17:59 |
clarkb | ++ to utf8mb4 | 18:02 |
corvus1 | i timed a mysqldump: 16m | 19:32 |
corvus1 | when i finish these potstickers, i'll time a restore | 19:33 |
clarkb | thanks! | 19:34 |
*** carloss_ is now known as carloss | 19:46 | |
fungi | clarkb: looks like https://review.opendev.org/914989 isn't quite working. maybe we need a carve-out in the proxy? | 20:26 |
fungi | i'm heading out to run some errands and grab dinner but will be back in an hour or so | 20:27 |
clarkb | ya I wasn't sure if the mod rewrite proxying would work with the location directive like that | 20:27 |
clarkb | I guess not | 20:27 |
clarkb | I see that etherpad works differently I'll update | 20:27 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add robots.txt to Zuul web https://review.opendev.org/c/opendev/system-config/+/914989 | 20:31 |
clarkb | something like that maybe? | 20:31 |
corvus1 | the mysqldump is going to create the tables with utf8mb3 instead of mb4; i'm thinking we should go ahead and sed the dump file to change that. | 20:36 |
clarkb | corvus1: thinking out loud here: that should be safe beacuse we're using mysqldump which isn't a binary backup | 20:50 |
clarkb | and we can probably test that process ahead of time? | 20:50 |
corvus1 | yeah, i'm currently restoring an unaltered dump; can repeat with the mb4 version | 20:51 |
corvus1 | the restore is not fast | 20:51 |
fungi | i'm not actually here at the moment, but sounds good to me | 20:52 |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: chore: refactor setting buildset_registry fact https://review.opendev.org/c/zuul/zuul-jobs/+/915025 | 20:57 |
corvus1 | i think the restore is going to be like 90 minutes; the dump was about 15, throw in a few more mins for munging the dump file and we're probably looking at just under 2 hours for the whole operation. | 21:03 |
clarkb | sorry stepped away for a bit. I wonder if that is long enough we should schedule ti or try to do over a weekend? | 21:45 |
clarkb | if we do do it over a weeekdn we may want to stop the auto upgrade playbook from running? Just throwing thoughts out there | 21:45 |
fungi | or i can stick around friday afternoon your time | 22:20 |
*** dmitriis is now known as Guest15 | 22:38 | |
corvus1 | yeah, if we can limp along for a bit, then perhaps a non-weekday would be good. i can plan on doing it saturday. | 22:42 |
clarkb | I can likely be around on saturday | 22:46 |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: chore: refactor setting buildset_registry fact https://review.opendev.org/c/zuul/zuul-jobs/+/915025 | 22:47 |
fungi | same | 22:48 |
fungi | also related, topic:zuul-ua-filter lgtm now | 22:49 |
fungi | landing that might take some pressure off | 22:50 |
corvus1 | what's the intent there? would it block folks running requests-based tools? (also zuul-client? :) | 23:05 |
corvus1 | or is it just blocking known bad uas? | 23:06 |
fungi | the idea is that we've got a corpus of known made-up/nonsensical user agent strings often used by botfarms that completely ignore robots.txt | 23:06 |
corvus1 | got it | 23:07 |
fungi | the same ones are presently being filtered for gitea and gerrit i think | 23:07 |
fungi | because they flood us with repeated/duplicate requests for computationally-intensive queries, bringing those services to their knees at times | 23:08 |
fungi | i think the bulk of that set was taken from a filter mwf uses to keep the load on wikipedia in check | 23:08 |
fungi | s/mwf/wmf/ | 23:09 |
fungi | and when clarkb was going through the apache logs on the zuul-web servers, he saw quite a few of them trying to scrape builds and buildsets query urls | 23:12 |
fungi | which were probably contributing to the overall db query load | 23:14 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!