Wednesday, 2024-04-03

hashar	fungi: I got another hack for `git-review` to let it shows some colors when pushing a change https://review.opendev.org/c/opendev/git-review/+/914000 :)	06:47
hashar	the remote send back some message and cgit colors some of them , but that only happens when run interactively (stdin or stderr is a tty)	06:48
opendevreview	Antoine Musso proposed opendev/git-review master: Show remote side colors when color is enabled https://review.opendev.org/c/opendev/git-review/+/914000	06:49
hashar	:)	06:49
opendevreview	Tobias Rydberg proposed opendev/irc-meetings master: We have decided to adjust meeting time to 0700 during summer time. https://review.opendev.org/c/opendev/irc-meetings/+/914940	08:06
naman	hello	11:19
naman	anyone online here ??	11:20
fungi	not sure what's going on, but in the middle of the openstack coordinated release we've been getting stalled by near-constant "management events" (i guess pipeline reconfigurations?) taking upwards of 40 minutes to clear	13:27
fungi	maybe it's lots of job configuration changes merging, though that seems unlikely. i wonder if there's ongoing changes in some external repo we're watching that's triggering them	13:28
fungi	i don't really know what to look for in the scheduler debug logs to isolate the cause	13:28
fungi	there's not that many tenant reconfigurations in the log	13:30
fungi	one at 09:59:42 because of an update to openstack/magnum and another at 11:16:44 from an update to nebulous/nebulous	13:31
fungi	oh, though the latter was for the nebulous tenant	13:31
fungi	so yeah, only one reconfiguration for the openstack tenant during the relevant time period	13:32
fungi	i wonder what else those per-pipeline management events could signify	13:32
fungi	because trigger and result events just seem to keep stacking up until the management event clears, which seems to take ages	13:33
frickler	I'm also seeing semi regular errors in zuul debug logs due to the replication mentioned in #zuul https://opendev.org/opendev/sandbox/commit/1c3e7447e7046e2bb6d7ae364e4f4c83090c185a , not sure if that can affect even processing	13:44
fungi	future improvement idea: move opendev/sandbox to a dedicated "sandbox" tenant in zuul	13:47
Clark[m]	fungi: reconfigurations should be pretty clearly logged if that is what is doing it. I don't think that dependency in supercedent pipeline should affect zuul that way but you can dequeue it manually to check	13:53
Clark[m]	You seem to have touched on this in the release channel a little but my hunch is it has to do with all the ref creates events for the tags. Each of those tages represents a zuul config I think	13:54
Clark[m]	Though that shouldn't be an different than previous releases	13:54
fungi	right now the release-post pipeline has a new management event, which seems to mean we wait half an hour for its status to update and anything else there to start running	13:55
fungi	er, sorry, release pipeline this time	13:56
fungi	this has been occurring over and over in different pipelines though	13:56
fungi	okay, now it's up to 2 management events for the release pipeline	13:56
Clark[m]	fungi: is there an event I'd in the logs associated with those events that you can trace to see where the time is lost?	14:01
fungi	those did clear, but it's still not processing all the results (been sitting at 0 trigger, 0 management, 13 results for a while	14:01
Clark[m]	And have we evicted the sandbox change that may be problematic?	14:02
fungi	i haven't looked at the sandbox change, not sure what that's all about	14:02
frickler	Clark[m]: the changes were merged, you would need to merge the revert	14:02
fungi	or whether it needs to be preserved in the pipeline for inspection	14:03
frickler	would it make sense to take a stacktrace of zuul schedulers at this point in time?	14:03
Clark[m]	frickler no I mean evict it from the zuul queues if it is still there	14:03
fungi	okay, release pipeline is moving again now	14:03
frickler	Clark[m]: yes, it is in post, so not change but ref. we could try to evict it unless corvus1 want to do some looking at it first	14:03
Clark[m]	fungi: someone was being "helpful" trying to reproduce a bug they found in zuul via the sandbox repo	14:03
Clark[m]	It apparently got an event stuck in the queues and we could in theory manually dequeue it if we were concerned it is related	14:04
Clark[m]	But also just tailing scheduler logs during the pauses should give hints as to where the time is lost. Might be noisy but you'd see a lot of git merge events for reconfigurations for example	14:05
frickler	it seems they also added the failing job to the periodic pipeline, so it would likely repeat anyway. I just don't want to break the running release jobs, they still do seem to be progressing	14:05
frickler	fwiw I'm still getting no results on the builds/buildsets pages unless I add some filter like for a job or project	14:40
frickler	that also only affects the openstack tenant	14:41
fungi	it does seem to spin for a while like its retrieving data though	14:42
frickler	oh, there's an error in the zuul web log: 2024-04-03 14:41:41,066 ERROR cherrypy.error.139740297355920: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (3024, 'Query execution was interrupted, maximum statement execution time exceeded')	14:42
fungi	fun!	14:42
frickler	seems zuul show show an error in the UI for that	14:42
frickler	*should show	14:43
frickler	MAX_EXECUTION_TIME(30000) - I think that will be 30s?	14:44
Clark[m]	Ya js uses milliseconds iirc	14:45
frickler	maybe we need to somehow archive old builds? or just add some proper indices?	14:46
clarkb	frickler: was that iwth a specific query?	14:49
fungi	clarkb: just go to https://zuul.opendev.org/t/openstack/builds with no query params	14:49
clarkb	ok so no query paramters result in that. I wonder what changed	14:50
clarkb	fwiw the order of magnitude of log entries for sandbox-publish-docs is low enough that I don't think we're in a tight loop adding failed builds to the db due to that	14:52
frickler	clarkb: the timeout was without a query	14:53
fungi	i'm wondering if zuul could be getting flooded with internal events or something to cause its queue and tenant event processing to just get drowned out resource-wise	14:58
clarkb	we do appear to be getting crawled possibly in a manner similar to what we've seen with gitea	14:59
fungi	oh, like zuul-web could be bombarding the db with queries and we're hanging on writes?	15:00
fungi	though i wouldn't expect that to block at the db level	15:00
*** blarnath is now known as d34dh0r53		15:00
clarkb	no, but they are querying things liek change logs which may also do db lookups?	15:00
fungi	load average on the schedulers is very low too	15:00
fungi	oh, but maybe the trove instance is overloaded	15:01
clarkb	ya its a db load issue I think	15:01
clarkb	web is just a simple passthrough to fetch that data	15:01
clarkb	I don't think we're running the apache filter of user agents on zuul web's apache instances btu I think we should	15:02
clarkb	but that may still be an orthogonal issue	15:02
clarkb	if you look at apache logs there is a python requests user agent hitting the 500 error for the timeout consistently for some time	15:06
fungi	that could certainly be related to the query timeouts we're seeing too	15:06
fungi	either caused by, or causing	15:06
clarkb	I think that is our culprit and won't be impact by our existing user agent filter (and we probably don't want it to be until we understand better)	15:06
clarkb	seems to have started sometime yesterday	15:06
clarkb	its in a tight loop failing about twice a minute (which aligns with our 30 second timeout)	15:07
clarkb	no nevermind its older than that	15:08
clarkb	(my greps are getting better)	15:08
clarkb	it started with the 500 errors on Sunday (March 31)	15:10
clarkb	it seems before that it was iterating through 100 builds at a time (arg why do you need to rescan over and over again)	15:10
clarkb	but now it gets the 500 and retries the first request over and over again in a loop	15:10
fungi	could it be ci-log-processor or whatever it's called?	15:10
fungi	but yeah, it's probably some phd student doing "research" like we've seen knock gerrit offline in the past	15:11
clarkb	[2024-03-30 07:18:01.736] ok I think this is when it starts based on the first instance of a 500 error in the apache log on zuul02	15:12
clarkb	and I think zuul-web was restarted on march 30 at 07:13	15:12
frickler	just for reference it seems we have about 20M builds in 4M buildsets. that sounds to me like something mysql should still be able to handle reasonably	15:12
clarkb	I'm beginning to suspect a regression in zuul after our automatic weekend upgrade	15:13
fungi	yeah, unless someone's pounding it with queries maybe	15:13
clarkb	corvus1: ^ fyi	15:13
fungi	but yes, the timing is conspicuous	15:13
clarkb	`zgrep '/api/tenant/openstack/builds?' zuul-ssl-access.log.4 \| grep 'HTTP/1.1" 500' \| head` to find the first occurence that I see	15:13
clarkb	I wonder if that may be related to the slow event processing too (a bug affecting zuul in both ways I mean)	15:14
fungi	without more data, it seems at least possible that they're linked	15:14
frickler	checking git log https://review.opendev.org/c/zuul/zuul/+/913875 spring to my eye	15:14
clarkb	I think there are ~3 things going on that we should consider addressing 1) good bots that will respect robots.txt crawling data that probably shouldn't be crawled we should write a robots.txt to refine what if anything gets crawled 2) bad bots that use weird user agents that we should apply our existing apache ruleset to block on zuul's apache servers and 3) a likely zuul regression	15:20
clarkb	after the weekly upgrade leading to timeouts trying to load builds from the db and possibly other related slowness	15:20
clarkb	1) and 2) are straight forward and may even help the situation with 3) by reducing total queries	15:21
frickler	I'm no mysql expert but could that query block the table for the 30s while it is running and thus cause executor updates like event processing be blocked?	15:21
clarkb	frickler: generally I would expect read locks on rows to not interfere with write locks, btu maybe we don't have those locks properly set up in the orm and it is being conservative and safe?	15:22
clarkb	I think we're in no wrose a spot than we were since the weekend so I'll take a break here for morning things like food and shower. I can work on 1) and/or 2) when I get back	15:23
clarkb	frickler: I would agree that chagne you identified is suspicious	15:26
corvus1	i'll take a look at that query	15:58
*** gthiemon1e is now known as gthiemonge		15:59
clarkb	unrelated to everything else rpm's man page doesn't have ascii -'s it has ‐'s instead. This is very disappointing	16:17
frickler	ah, wasn't there a big discussion about what's typographically correct recently?	16:28
fungi	not sure about recently, but it's been coming up in *roff related discussions for at least 3 decades i can remember, likely closer to 5	16:30
opendevreview	Clark Boylan proposed opendev/system-config master: Apply apache us filter to Zuul https://review.opendev.org/c/opendev/system-config/+/914985	16:34
clarkb	ya it came up somewhere maybe hn	16:35
clarkb	I would've expected it to be a solved problem in man pages for popular utilities like rpm though	16:35
clarkb	914985 is 2) from above	16:35
clarkb	1) I realized is a bit more tricky because we have to decide what i anything we awnt to be crawlable. I'm tempted to say nothing should be?	16:35
clarkb	but at the very least crawling our log files seems like unnecessary bw for our cloud providers	16:36
fungi	there's no reference material on zuul.o.o, so i'm inclined to say there's nothing worth indexing by search engines at least	16:37
fungi	also the ability of unvetted actors to use it to publish content/urls in some locations probably means search engine indexing creates an attractive nuisance for spammers and scammers	16:38
clarkb	ack I'll make a blanket please don't index rule	16:39
*** cid__ is now known as cid		16:55
*** jroll00 is now known as jroll0		16:55
*** Reverbverbverb_ is now known as Reverbverbverb		16:55
*** gmann_ is now known as gmann		16:55
clarkb	I think we lost gerritbot	16:58
clarkb	remote: https://review.opendev.org/c/opendev/system-config/+/914989 Add robots.txt to Zuul web [NEW]	16:58
clarkb	Thats 1) and 2) proposed. reviews welcome	17:00
corvus1	regarding the database issue: i think i have two half-solutions that add up to a whole one: first, we can add some index hints to the query, that improves the runtime for queries locally for me with mysql 8.x. they do not improve the runtime in all cases on the opendev production database. also, they might actually produce the wrong data on opendev's production mysql. so i think we should also move to a mysql 8.x trove instance. we can do	17:11
corvus1	that relatively quickly, though it will require a small downtime (maybe 20 mins?) for the final copy step.	17:11
clarkb	we did have a small outage previously where the db wasn't available and zuul mostly kept working but things weren't recorded properly obviously	17:13
clarkb	I guess we can decdie if we want to shutdown zuul or go through that half working state again?	17:13
corvus1	true, we could actually just decide to lose a few minutes of data	17:13
clarkb	corvus1: also I guess we want to land the index updates first? Any idea what caused the change? is it the change frickler identified?	17:14
corvus1	we would just end up with missing builds.	17:14
clarkb	and if it is that chagne I guess my next thought is "is that change valid if it has such a big impact"	17:14
clarkb	re downtown or missing builds I'll defer to fungi and frickler as they aer more involved in the openstack release process today. I suspect we can tolerate that elsewhere	17:14
corvus1	clarkb: that change gave us correct results at the expense of speed; so a revert is a reasonable solution for opendev, but not a good long term solution	17:15
clarkb	I see	17:15
corvus1	adding the index hints would give us potentially incorrect results in opendev, and not necessarily speed up all the queries we need to	17:16
clarkb	but with a more modern mysql it would both be correct and relatively quick (at least compared to human wall clock scale?)	17:16
corvus1	i think the incorrect results from adding the index hints on old mysql would be more visible than anything we've seen yet -- the builds page would show the first 100 builds instead of the last 100. despite the very explicit "ORDER BY ... DESC" in the query.	17:17
corvus1	so i think in opendev, we want mysql 8 first, then the index hints.	17:18
clarkb	++ that does seem like it wouldbe problematic for the common use case of that search page	17:18
corvus1	and yes, once we have both, at least in my local testing, the builds and buildset queries should both be nearly instantaneous.	17:18
clarkb	so ya no objectiosn from me. I think the main thing is to coordinate with fungi and frickler to ensure we minimize impact to the openstack release process (not sure if anything is left for that)	17:19
corvus1	agree; i'll spin up a trove mysql 8 to prep, but await more input before proceeding past that	17:20
fungi	sorry, had to step away for a moment but catching back up...	17:22
fungi	openstack release activities have wound down now, so i'm not concerned about a few missing build results if we go that route	17:23
fungi	also, unless we stop gerrit at the same time, having zuul offline would mean missing events, which isn't necessarily any better	17:25
fungi	i think rackspace lets you do in-place mysql upgrades, but a cut-over to an already running and populated second instance might be faster	17:26
fungi	what's the order of operations for swapping out the trove instance in that case?	17:26
fungi	is zuul going to reread the configuration for the dburi or is there still some sort of restart needed?	17:27
clarkb	fungi: I think its prevent access from zuul to existing trove instance, dump db, restore db into new instance, update zuul to point at new instance	17:27
corvus1	start mysql 8; import data; restart zuul-web with new dburi	17:27
clarkb	thats a good question I think zuul may need a restart to pick up the ini config	17:27
corvus1	if we're not worried about losing data, we don't have to prevent writes	17:28
corvus1	we will end up with incomplete data, but not inconsistent data in the new db, so that should be fine	17:28
corvus1	remote: https://review.opendev.org/c/zuul/zuul/+/914990 Restore mysql index hint [NEW] <-- the index hints change	17:29
clarkb	ah ok I wasn't sure if it would be broken/inconsistent	17:29
clarkb	and I don't think we're too worried about losing data. It has happened before and impact was minimal (almost no one noticed even)	17:30
corvus1	the relations are like a tree, we'll just be missing leaf nodes, and updates to buildsets to say they are complete	17:30
fungi	yeah, that's exactly what we observed in the past	17:30
corvus1	(technically, there is a cycle in the relations graph, but there is still a definite order they are created in, so it doesn't affect this issue)	17:30
fungi	someone was looking at an affected build result and confused that zuul thought it had never completed	17:31
fungi	which was what led me to discover the db outage	17:31
corvus1	heh our db is called "zuul-v3" maybe i should call the new one "zuul-v10"	17:34
clarkb	nice	17:34
fungi	we coulda had a v8	17:34
clarkb	zuul the hot rod or zuul the tomato juice?	17:35
fungi	why can't it be both?	17:35
clarkb	hot rod powered tomato blender	17:37
corvus1	do we want more ram? currently 4g	17:37
clarkb	performance has been fine I think. But maybe not super fast at time	17:38
clarkb	more ram can't hurt I guess	17:38
corvus1	their internal monitoring thing just has the last 6 hours and shows all memory used	17:38
clarkb	probably a bit of an abnormal state in the last 6 hours but also good enough indication to me we should bump it	17:39
corvus1	there's no separate cache page metric, so it could be fine	17:39
corvus1	but <shrug> i'm also inclined to maybe bump it up to 8	17:39
clarkb	++	17:39
fungi	i think it will always show all memory used	17:41
corvus1	when we move to self-hosted, that will be a nice benefit	17:41
fungi	it basically tells mysqld to use up to the amount available, and that will quickly end up as query cache	17:41
corvus1	fun fact: no query cache in mysql 8 (but there are other kinds of index caches -- some of which persist on disk!)	17:42
fungi	interesting	17:43
corvus1	oh i just noticed in the web ui we can enable the root user (but rax may not support us if we do). just filing that bit of info away	17:44
fungi	i suppose people who want an actual query cache just inline a separate proxy daemon these days	17:44
corvus1	the zuul-v3 server has some server parameters set, but i think the defaults in mysql 8 are equivalent or better, so i don't think we need a custom config	17:56
corvus1	(zuul-v3 has character_set_server=utf8 (which is an alias for utf8mb3) and collation_server=utf8_bin and wait_timeout=28800)	17:57
corvus1	the defaults in mysql 8 are utf8mb4 and that same wait_timeout value	17:57
corvus1	so i think our forward-compatible config has done its job :)	17:57
fungi	sounds good	17:59
fungi	i guess utf8mb4 could make some tables larger, but i doubt we're short on disk	17:59
clarkb	++ to utf8mb4	18:02
corvus1	i timed a mysqldump: 16m	19:32
corvus1	when i finish these potstickers, i'll time a restore	19:33
clarkb	thanks!	19:34
*** carloss_ is now known as carloss		19:46
fungi	clarkb: looks like https://review.opendev.org/914989 isn't quite working. maybe we need a carve-out in the proxy?	20:26
fungi	i'm heading out to run some errands and grab dinner but will be back in an hour or so	20:27
clarkb	ya I wasn't sure if the mod rewrite proxying would work with the location directive like that	20:27
clarkb	I guess not	20:27
clarkb	I see that etherpad works differently I'll update	20:27
opendevreview	Clark Boylan proposed opendev/system-config master: Add robots.txt to Zuul web https://review.opendev.org/c/opendev/system-config/+/914989	20:31
clarkb	something like that maybe?	20:31
corvus1	the mysqldump is going to create the tables with utf8mb3 instead of mb4; i'm thinking we should go ahead and sed the dump file to change that.	20:36
clarkb	corvus1: thinking out loud here: that should be safe beacuse we're using mysqldump which isn't a binary backup	20:50
clarkb	and we can probably test that process ahead of time?	20:50
corvus1	yeah, i'm currently restoring an unaltered dump; can repeat with the mb4 version	20:51
corvus1	the restore is not fast	20:51
fungi	i'm not actually here at the moment, but sounds good to me	20:52
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: chore: refactor setting buildset_registry fact https://review.opendev.org/c/zuul/zuul-jobs/+/915025	20:57
corvus1	i think the restore is going to be like 90 minutes; the dump was about 15, throw in a few more mins for munging the dump file and we're probably looking at just under 2 hours for the whole operation.	21:03
clarkb	sorry stepped away for a bit. I wonder if that is long enough we should schedule ti or try to do over a weekend?	21:45
clarkb	if we do do it over a weeekdn we may want to stop the auto upgrade playbook from running? Just throwing thoughts out there	21:45
fungi	or i can stick around friday afternoon your time	22:20
*** dmitriis is now known as Guest15		22:38
corvus1	yeah, if we can limp along for a bit, then perhaps a non-weekday would be good. i can plan on doing it saturday.	22:42
clarkb	I can likely be around on saturday	22:46
opendevreview	Mohammed Naser proposed zuul/zuul-jobs master: chore: refactor setting buildset_registry fact https://review.opendev.org/c/zuul/zuul-jobs/+/915025	22:47
fungi	same	22:48
fungi	also related, topic:zuul-ua-filter lgtm now	22:49
fungi	landing that might take some pressure off	22:50
corvus1	what's the intent there? would it block folks running requests-based tools? (also zuul-client? :)	23:05
corvus1	or is it just blocking known bad uas?	23:06
fungi	the idea is that we've got a corpus of known made-up/nonsensical user agent strings often used by botfarms that completely ignore robots.txt	23:06
corvus1	got it	23:07
fungi	the same ones are presently being filtered for gitea and gerrit i think	23:07
fungi	because they flood us with repeated/duplicate requests for computationally-intensive queries, bringing those services to their knees at times	23:08
fungi	i think the bulk of that set was taken from a filter mwf uses to keep the load on wikipedia in check	23:08
fungi	s/mwf/wmf/	23:09
fungi	and when clarkb was going through the apache logs on the zuul-web servers, he saw quite a few of them trying to scrape builds and buildsets query urls	23:12
fungi	which were probably contributing to the overall db query load	23:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!