Saturday, 2020-04-25

openstackgerrit	Merged opendev/system-config master: Add nodepool node key https://review.opendev.org/723049	00:24
openstackgerrit	Merged opendev/system-config master: Remove an extra backslash https://review.opendev.org/723052	00:24
clarkb	723046 collected the zuul.conf andit looks good but the job failed	00:26
clarkb	its trying to run zuul executor under python2	00:27
clarkb	mordred: ^ fyi I'll see if it is obvious why after dinner	00:28
*** DSpider has quit IRC		00:42
*** mlavalle has quit IRC		01:03
*** elod has quit IRC		01:06
openstackgerrit	Clark Boylan proposed opendev/system-config master: The package is libjemalloc1 https://review.opendev.org/723023	01:15
openstackgerrit	Clark Boylan proposed opendev/system-config master: Fix zuul.conf jinja2 template https://review.opendev.org/723046	01:15
openstackgerrit	Clark Boylan proposed opendev/system-config master: Use pip3 to install zuul on executors https://review.opendev.org/723058	01:15
clarkb	mordred: ^ I think that will fix it	01:15
clarkb	infra-root we are still seeing merge failures, but these appear to be due to git problems	01:45
clarkb	http://paste.openstack.org/show/792699/	01:46
clarkb	also the http gerrit driver retries after receiving HTTP 409 errors when submitting votes. I believe that is happening because we were trying to vote on merged changes, we can probably avoid retrying if http 409 was the return code	01:46
clarkb	I don't understand the git problems. permissions look fine. Is that related to the thing corvus said we are now doing to curate those repos possibly?	01:47
clarkb	hrm the thing tobiash did for gc'ing only sets git gc options	01:51
fungi	rootfs on ze01 is close to full, but not quite so enospc probably isn't it	01:53
fungi	oh, this is saying SHA could not be resolved, git returned: ''	01:54
clarkb	fungi: I think ze03 was doing this before we did any updtes to zuul	01:55
clarkb	fungi: so its possible this was existing previously and now we are just noticing beacuse merge failures are a red flag right now	01:55
clarkb	however it seems to definitely be doing it more in the fallout of the work	01:55
clarkb	and it doesn't seem like we've done it in the last half hour or so?	01:56
clarkb	the stack ending at https://review.opendev.org/#/c/723023/ is green now after my fix	01:56
clarkb	(which is why I returned to the computer so maybe I should call it a day for now since the merge issues seems to be settled out?)	01:56
clarkb	er https://review.opendev.org/#/c/723046/ caps the stack	01:57
fungi	where does zuul keep original copies of git repos to clone into workspaces? /var/lib/zuul/git seems empty	01:57
clarkb	fungi: /var/lib/zuul/executor-git I think	01:57
fungi	oh yep it's /var/lib/zuul/executor-git/	01:57
fungi	too bad that error doesn't say what sha it wanted to reset its state to	01:58
clarkb	fungi: its possible that is logged earlier if you grep on the build or event id	01:58
fungi	yeah, likely	01:59
clarkb	fungi: what might be interesting is to recheck that change	02:00
clarkb	https://review.opendev.org/#/c/722540 to see if that is a persistent issue with that repo	02:00
clarkb	but I need to pop out again. likely until tomorrow	02:00
*** rchurch has quit IRC		03:13
AJaeger	infra-root, 722540 still shows merger-failures, I just rechecked ;(	05:09
*** sgw has quit IRC		06:44
*** DSpider has joined #opendev		06:47
AJaeger	we have new config-errors for openstack-tenant - how could those get introduced: https://zuul.opendev.org/t/openstack/config-errors "bad escape \l at position 3"	07:27
AJaeger	seems like a new Zuul check - I'm fixing...	07:54
AJaeger	infra-root, the daily periodic job has failures https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_81d/periodic/opendev.org/opendev/system-config/master/infra-prod-service-codesearch/81d1995/ and	08:21
AJaeger	https://5ee8a394869b896f2ffd-b765094214961b8b34fe718f08477de4.ssl.cf1.rackcdn.com/periodic/opendev.org/opendev/system-config/master/infra-prod-service-eavesdrop/c8e62cc/	08:21
AJaeger	and accessbot gets skipped	08:21
AJaeger	do we need these jobs daily and hourly?	08:21
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Disable ovn pypi jobs temporarily https://review.opendev.org/723072	09:08
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Revert "Disable ovn pypi jobs temporarily" https://review.opendev.org/723073	09:10
*** tosky has joined #opendev		09:36
*** dpawlik has joined #opendev		09:59
*** avass has quit IRC		10:16
*** elod has joined #opendev		10:28
*** dpawlik has quit IRC		10:36
AJaeger	config-errors are fixed, horizon merged the changes quickly...	12:54
AJaeger	infra-root, now zuul tenant has new errors: https://zuul.opendev.org/t/zuul/config-errors	12:57
openstackgerrit	Andreas Jaeger proposed openstack/project-config master: Fix some config errors in zuul-tenant https://review.opendev.org/723082	13:01
AJaeger	proposed fix ^	13:01
mnaser	I’m getting some non deterministic behaviour with zuul saying changes have merge conflicts	13:15
mnaser	example: https://review.opendev.org/#/c/723087/	13:16
frickler	mnaser: there was an issue with that yesterday evening, but I though that this should be fixed now	13:16
mnaser	and another example on the first patch set: https://review.opendev.org/#/c/723083/2	13:16
mnaser	frickler: yeah, this is happening now	13:16
frickler	humm, that looks fresh :( infra-root ^^	13:16
mnaser	and the parent of those patches are very much tip of the branch	13:17
mnaser	so it wouldn’t make sense (especially for the second patch, revision 1) to say that...	13:17
mnaser	oh I just saw a MERGER_FAIL in zuul status for VEXXHOST tenant	13:18
AJaeger	frickler: 722540 is also recent, see back scroll	13:18
mnaser	I’m also seeing a bunch of retries. Is it possible that a specific executor is unhappy?	13:18
AJaeger	mnaser, so, probably fallout from yesterday and needs more work - best to wait for others to fix	13:18
AJaeger	mnaser: I guess ;)	13:19
mnaser	cool. just wanted to make sure there’s eyes on it	13:19
mnaser	I’ve been running into this issue in my own installation	13:19
mnaser	I wonder if it is a recent zuul bug	13:19
AJaeger	mnaser: might be - I think Zuul was restarted as well	13:21
AJaeger	time for some cake - bbl	13:22
mnaser	yeah it hadn’t been restarted for a while before this and I’ve been dealing with this issue too.	13:22
mnaser	AJaeger: yum! Have fun!	13:22
fungi	(see scrollback) clarkb spotted one of those in the log last night when he was signing off: http://paste.openstack.org/show/792699/	13:24
fungi	ValueError: SHA could not be resolved, git returned: b''	13:25
fungi	i can't easily troubleshoot at the moment, though i looked at the executor last night and didn't see anything wrong. i was also having trouble working out from the log what ref it was trying to reset to	13:26
fungi	at the time we thought it could just be a broken repo, as some of the executors had been logging this before the switch to containers/ansiblification	13:27
fungi	but if similar errors are being raised for other repos then maybe it's more widespread	13:30
fungi	mnaser: do you have corresponding tracebacks in your repo? maybe this is a recent regression which came in with our restarts	13:31
fungi	which would make sense... clarkb said he saw ze03 logging problems like this before yesterday's maintenance... and ze03 was rebooted unexpectedly by the provider a few days ago because of a hypervisor host problem	13:32
frickler	this would be the latest change to the merger code that might have an issue https://review.opendev.org/701531	13:37
frickler	at least it heavily touches the code in clarkb's backtrace	13:37
fungi	merged 2020-04-15 12:46:09	13:49
fungi	possible ze03 was the first merger restarted after that	13:50
fungi	mnaser: if that's what you're also finding in logs for your deployment, how does your timeline match up?	13:51
mnaser	fungi: let me check logs!	13:53
mnaser	fungi: i dont seem to see logs that are going that far but in terms of behaviour its certainly similar to what i was seeing	13:59
mnaser	fungi: perhaps maybe we should issue a notice -- looks like its happening (often) enough: https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE	14:07
AJaeger	like: status notice Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear.	14:14
AJaeger	infra-root ^	14:15
mordred	AJaeger: ++	14:15
AJaeger	#status notice Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear.	14:16
openstackstatus	AJaeger: sending notice	14:16
AJaeger	morning, mordred !	14:16
-openstackstatus- NOTICE: Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear.		14:16
mordred	mnaser, frickler: tobiash has been running that merger patch for a few months I think - but it also seems reasonble to imagine it might be involved	14:16
mnaser	infra-root: oh also it seems to be hitting some repos harder than others, it may be useful to grab a backup of an affected repo from a merger so we have something to inspect, in order to be able to build tests off of it	14:16
mordred	morning AJaeger !	14:16
AJaeger	mordred, fungi, could either of you review https://review.opendev.org/723082 and https://review.opendev.org/723072	14:16
mnaser	just in case it's somehow transient and it disappears and comes back later	14:17
mordred	do we know if we've seen it from both mergers and executors or only from executors? asking because different python versions are involved	14:17
corvus	oh look an alarm	14:18
mnaser	don't think anyone dug that deep yet mordred -- but the only captured traceback earlier was from ze01	14:18
mordred	nod	14:18
mordred	corvus: morning!	14:18
openstackstatus	AJaeger: finished sending notice	14:19
AJaeger	corvus: sorry, didn't want to wake you up ;)	14:21
AJaeger	morning, corvus !	14:21
frickler	mnaser: do you have some list of affected repos? my current guess is it affects only repos that only have a master branch	14:22
mnaser	frickler: vexxhost/openstack-operator is one of mine that's being hit, openstack/openstack-helm-images was in the traceback that fungi provided earlier from zm01 -- the link i posted earlier with jobs that hit MERGER_FAILURE lists the bunch	14:23
corvus	https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE	14:24
mordred	kayobe has branches	14:24
corvus	and some of those failures are on non-master branches	14:24
AJaeger	keystone has branches	14:24
openstackgerrit	Merged openstack/project-config master: Fix some config errors in zuul-tenant https://review.opendev.org/723082	14:25
mnaser	fwiw i think this is a zuul issue so maybe we can discuss in #zuul -- looks like there was a revert proposal so perhaps it was observed in another environment too	14:26
openstackgerrit	Merged openstack/project-config master: Disable ovn pypi jobs temporarily https://review.opendev.org/723072	14:27
corvus	well, we need to decide what opendev should do	14:27
corvus	should we attempt more interactive debugging, or restart with a revert of the gc change (then, if it still happens, revert the other merger change)?	14:28
mnaser	corvus: i suggested capturing a copy of the git repo on-disk from some of the mergers that are posting tracebacks so we can actually use those later to possibly create tests and simplify investigation	14:29
mnaser	and i guess, depends on how much work people want to do on a saturday :)	14:29
corvus	has anyone looked into this enough to know if we have enough debugging info to reconstruct the series of git operations that got to this point?	14:30
mordred	corvus: restarting with the revert is fairly easy to do on a saturday morning - although - if that patch was the cause of the issue it could have done something to the repo so that we'll still see the issue on repos after a restart - but might not see more repos be impacted	14:30
corvus	i only see a traceback pasted, no merger debug log lines	14:30
mordred	corvus: I do not believe we have gotten that far, no	14:30
mnaser	^	14:30
corvus	is this only happening on executors or mergers as well?	14:31
mnaser	unknown info as far as reported here, mordred asked the same earlier too	14:31
mordred	I do think capturing a copy of couple of repos showing the issue wouldn't be a bad idea	14:32
tristanC	it would be good to know what the merger logged before the traceback, there should be `Reset` and `Delete` message that could help understand the issue	14:32
mordred	I have grabbed a copy of openstack-helm-images on ze01 (Which is the one referenced in the traceback) - just in case	14:33
corvus	tristanC: yeah, i just grabbed that http://paste.openstack.org/show/792708/	14:33
tristanC	corvus: arg, it seems like it deleting too agresvely empty dirs	14:35
mordred	we are running git 2.7.4 on the executor	14:38
mordred	which is < 2.13 - so would cause the "cleanup empty dirs" logic to run	14:38
mordred	I don't see any exceptions in debug log on zm01	14:38
mordred	mergers have git 2.20.1	14:39
mordred	in case those data points are useful	14:39
mnaser	(we seem to be on something from those logs in #zuul :)	14:39
corvus	mordred: yes, all helpful	14:39
mordred	corvus, tristanC: I checked - no Execption in any merger logs	14:43
mordred	(which seems to jive with #zuul analysis	14:44
corvus	mordred: looking at the code, i think this is unlikely to trip on the mergers because they generally don't see remote refs deleted	14:44
corvus	i also think that means that the executor git repo cache is probably fine	14:45
AJaeger	mordred: infra-prod-remote-puppet-else failed, see https://zuul.opendev.org/t/openstack/build/d14ec247821c4a19b30cde6f4b5fa7cf	14:45
corvus	these errors are probably only happening on the transient build repo	14:45
mordred	corvus: that's good news	14:49
mordred	AJaeger: thanks - will investigate	14:49
mnaser	so given zuul is in emergency, we'll probably have to manually rollout this across mergers once this lands?	14:50
corvus	yeah, though i think i just approved the outstanding changes. but i guess they could run into this error	14:53
openstackgerrit	Monty Taylor proposed opendev/system-config master: Use python3 for ansible https://review.opendev.org/723105	14:54
corvus	if the zuul change lands first, we can use ansible to run "git pull --ff-only" and "pip3 install ."	14:54
mordred	once the zuul patch lands, we can manually run the service and then do a restart of the executors	14:54
mordred	corvus: fwiw - I noticed the warnings while doing ad-hoc ansible to the mergers to check for exceptions in teh logs ... made that ^^	14:55
mordred	mnaser: while I've got you here, could I get you to +A https://review.opendev.org/#/c/721099/ ? we landed the corresponding system-config change yesterday	14:57
AJaeger	mordred: want to merge https://review.opendev.org/721627 and https://review.opendev.org/720719 as well? in case mnaser wants to review more....	14:59
mnaser	mordred: i am not fully understanding how that job authenticates against eavesdrop	14:59
mordred	AJaeger: yeah - both of those would be good as well - although I'm most concerned about 721099 getting in so we're not in an inconsistent state	14:59
mnaser	there's no secret, the job has no parent	15:00
mordred	mnaser: currently there is a "jenkins" user on eavesdrop and we connect to it with the private ssh key - in the new system, we're using the zuul per-project ssh key - we've added the corresponding public key from project-config to a zuul user on eavesdrop	15:00
mordred	mnaser: here, https://review.opendev.org/#/c/721098/30/playbooks/service-eavesdrop.yaml - zuul-user role, and here: https://review.opendev.org/#/c/721098/30/playbooks/roles/zuul-user/defaults/main.yaml	15:01
mnaser	mordred: maybe this is new for me but zuul has an ssh key per project? :p	15:02
mnaser	if so, really cool and TIL	15:02
mnaser	oh it does	15:02
mordred	and here: https://review.opendev.org/#/c/721098/30/playbooks/group_vars/eavesdrop.yaml	15:02
mordred	mnaser: yes it does!	15:03
mnaser	that's a really super cool	15:03
mnaser	TIL	15:03
mnaser	ok so this explains a lot then!	15:03
mordred	mnaser: so you do an add_host in a playbook, and zuul will try to ssh to those hosts by default with a per-project key	15:03
mordred	mnaser: yah - we're using it for the new zuul-driven-cd we're doing	15:03
mnaser	mordred: cool, i have a few uses for that too then :)	15:04
mordred	we do add_host with bridge.openstack.org (our bastion host) - but on bridge we've put in the public keys for the system-config and project-config repos - so only jobs triggered by those repos can ssh in	15:04
mordred	it's pretty amazeballs	15:04
*** ildikov has quit IRC		15:06
*** jroll has quit IRC		15:06
*** ildikov has joined #opendev		15:06
*** jroll has joined #opendev		15:07
mnaser	++	15:11
corvus	mnaser: https://zuul-ci.org/docs/zuul/reference/jobs.html#project-key	15:12
mnaser	oh and only added in post-review	15:12
mnaser	ever nicer.	15:12
corvus	there's a tenant ssh key in the design, but we haven't implemented it yet	15:13
mordred	AJaeger: the puppet error you reported is because we replaced a dir with a symlink and rsync is unhappy about that	15:15
openstackgerrit	Merged openstack/project-config master: Use zuul deployment keys for yaml2ical https://review.opendev.org/721099	15:16
mordred	corvus, fungi: I think to fix that I need to run an ad-hoc ansible to remove the old dir and then re-run the puppet	15:17
mordred	infra-root: http://paste.openstack.org/show/792711/ I'd like to run that	15:19
mordred	then the next normal runs should be able to rsync state appropriately	15:20
mordred	(this is also the error in service-nodepool atm fwiw)	15:21
openstackgerrit	Merged opendev/system-config master: Use pip3 to install zuul on executors https://review.opendev.org/723058	15:24
openstackgerrit	Merged opendev/system-config master: The package is libjemalloc1 https://review.opendev.org/723023	15:28
openstackgerrit	Merged opendev/system-config master: Fix zuul.conf jinja2 template https://review.opendev.org/723046	15:28
mordred	corvus: ^^ I think that's all of the things	15:30
fungi	okay, back from the wilds of my yard-jungle for a breather, before i have to head back into battle	15:30
mordred	corvus: want me to remove from emergency and re-run the zuul service playbook?	15:31
fungi	caught up on scrollback, and sounds like you do think it's a recent zuul regression?	15:31
mordred	fungi: if you have a sec, I'd like to run http://paste.openstack.org/show/792711/ to fix an issue with a dir changing to a symlink and making rsync sad	15:32
corvus	mordred: sure	15:32
mordred	fungi: and yes - we've approved a presumptive fix, but are still trying to make a reproducing test	15:32
openstackgerrit	Merged openstack/project-config master: Run accessbot script on accessbot channels update https://review.opendev.org/721627	15:32
fungi	mordred: yep, i concur, that sounds sensible for the rsync symlink problem	15:32
openstackgerrit	Merged openstack/project-config master: Start building focal images https://review.opendev.org/720719	15:32
fungi	and 723104 looks like the probable fix we'll try for the executors?	15:32
corvus	yep	15:32
mordred	corvus: the commits are on bridge	15:33
fungi	and the outstanding fixes from yesterday's unplanned maintenance all just merged	15:33
corvus	mordred can run the playbook now before it lands and we can confirm everything we fixed yesterday is still fixed	15:33
corvus	mordred: what commits?	15:33
mordred	I have removed zuul from the emergency	15:33
mordred	the ones we just landed	15:33
mordred	the system-config ones	15:33
corvus	mordred: ah you mean they have propagated, great thx	15:33
fungi	723058, 723023 and 723046 for system-config	15:33
mordred	running service-zuul	15:34
fungi	not from the screen session i guess	15:35
mordred	oh - no, that would have been more smarter	15:35
fungi	no worries, i'll just use my imagination ;)	15:36
openstackgerrit	Monty Taylor proposed opendev/system-config master: Cron module wants strings https://review.opendev.org/723106	15:38
mordred	fungi, corvus : everything has run successfully	15:38
fungi	and that cron patch is something you just noticed when running i guess?	15:38
mordred	zuul.conf on zm01 looks correct	15:38
mordred	yup	15:39
mordred	as is its parent	15:39
mordred	so - non-urgent - just respoding to nagging from ansible	15:39
mnaser	fyi zuul api is replying with 500s	15:39
mordred	that's non-awesome	15:39
mnaser	https://zuul.opendev.org/api/tenant/zuul/status	15:40
mnaser	we'll probably have a trace in zuul-web somewhere	15:40
mordred	gear.GearmanError: Unable to submit job to any connected servers	15:40
mnaser	ok so either gearman has pooped out or maybe ca/crt/keys are not properly setup/added into the zuul-web container?	15:41
corvus	the scheduler is also not connected to gearman	15:42
corvus	zuul 22260 1.1 0.0 0 0 ? Z Apr24 11:50 [zuul-scheduler] <defunct>	15:42
corvus	it's probably that process	15:42
fungi	was just about to mentionh that	15:42
fungi	yeah, probably the geard fork	15:42
corvus	this is likely an unrecoverable error and we will need a full restart and to restore from a queue backup	15:43
corvus	i'm really curious what could have caused it	15:43
mordred	yeah. me too	15:43
fungi	/var/log/zuul/gearman-server.log is empty	15:43
fungi	the rotated log has some tracebacks related to statsd though	15:44
mordred	corvus: we have a "reload zuul scheduler" handler in the ansible that fires when main.yaml changes	15:44
fungi	AttributeError: 'Server' object has no attribute 'statsd'	15:44
yoctozepto	I see you already know what I was about to say :-)	15:44
mordred	corvus: it runs "docker-compose kill -s HUP scheduler"	15:44
fungi	but those stop at 2020-04-24 22:42:48,486 which i think was when we restarted	15:45
mordred	corvus: perhaps that is not doing the right thing?	15:45
corvus	mordred: yeah, for example, if that sends a HUP to all processes in the container, that's bad	15:45
mordred	I betcha it does	15:45
fungi	it's a good bet	15:45
mordred	we probably want to do an exec scheduler kill	15:45
corvus	the old version of this only HUP'd the actual scheduler process (via the pidfile)	15:45
mordred	to run the kill inside the container	15:46
mordred	yeah	15:46
corvus	how about	15:46
corvus	we run zuul-scheduler smart-reconfigure	15:46
mordred	ooh - that's a great idea	15:46
corvus	or whatever reconfigure we intend to run :)	15:46
fungi	far safer than process signals anyway	15:46
corvus	i think smart-reconfigure is probably what we want	15:46
mordred	smart-reconfigure seems like a great choice	15:46
mordred	yeah	15:46
mordred	patch coming	15:47
mnaser	i guess for now we'll have to backup/restore to restart scheduler	15:47
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107	15:47
fungi	we should have queue backups, unless the cronjob fix wasn't already in place	15:47
corvus	it was definitely in place :)	15:47
mordred	corvus: I should probably put the scheduler back in emergency until that lands yes?	15:47
corvus	whether it works i dunno	15:48
corvus	mordred: would anything else trigger it? i think we can probably just proceed assuming that's the next thing that will land and therefore be okay	15:48
mordred	-rw-r--r-- 1 root root 825 Apr 25 15:48 openstack_status_1587829681.json	15:48
mordred	corvus: kk	15:48
mordred	looks like we have a backup	15:48
fungi	15:48 is probably too new?	15:48
corvus	mordred: on second thought, i guess any old project-config change could trigger this, so maybe emergency is a good idea	15:48
mordred	nope	15:48
mordred	looks like we have a 500	15:49
mordred	corvus: kk. doing	15:49
mnaser	file size will probably be different	15:49
corvus	mordred: 107 needs a fix	15:49
mordred	-rw-r--r-- 1 root root 376412 Apr 25 15:37 openstack_status_1587829021.json	15:49
mordred	looks better	15:49
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107	15:50
corvus	i'm not sure how we're sposed to use those json files now	15:51
corvus	cause the changes script does api calls	15:51
corvus	i think we'd have to modify the script	15:51
mordred	corvus: they're mounted in the container in case that's helpful	15:52
corvus	?	15:52
corvus	i'm just saying if we try to run python /opt/zuul/tools/zuul-changes.py file:///var/lib/zuul/backup/openstack_status_1587829021.json	15:52
corvus	we get urllib2.URLError: <urlopen error [Errno 20] Not a directory: '/var/lib/zuul/backup/openstack_status_1587829021.json/api/info'>	15:53
corvus	so there's hacking to be done if we want to try to re-enqueue from those files	15:53
mordred	oh - duh. I get what you're saying	15:53
corvus	so is that worth doing, or should we restart with data loss?	15:53
corvus	oh there's a ~root/zuul-changes-static.py script	15:54
corvus	but that has to be run with every tenant+pipeline combo	15:54
corvus	how about i do that for openstack check and gate and stop there	15:55
corvus	i'll restart the scheduler now	15:55
mordred	corvus: ++	15:56
corvus	erm, the statsd errors are showing up on a scheduler start	15:56
corvus	oh, no, maybe i'm wrong about that and those are zombie docker error messages	15:57
mnaser	thats reassuring :)	15:57
mordred	the debug log says statsd enabled	15:57
corvus	yeah, i dunno where those came from, but i don't see them now	15:58
fungi	there were definitely some in the gearman-server.log from before yesterday's last restart	16:00
corvus	fungi: yeah, i think they are side effects of broken connections	16:01
corvus	re-enqueueing	16:01
openstackgerrit	Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339	16:54
openstackgerrit	Monty Taylor proposed zuul/zuul-jobs master: Use failed_when: false instead of ignore_errors: true https://review.opendev.org/723109	16:54
openstackgerrit	Monty Taylor proposed opendev/system-config master: Use python3 for ansible https://review.opendev.org/723105	16:59
openstackgerrit	Monty Taylor proposed opendev/system-config master: Cron module wants strings https://review.opendev.org/723106	16:59
mnaser	https://review.opendev.org/#/c/723104/ has landed, do we want to restart executors with it?	17:01
corvus	mnaser, mordred, tristanC: yeah -- looks like that's installed on the executors now, so i will run the restart mergers and executors playbook	17:24
mnaser	corvus: ok cool, i have a patch that somehow always manages to fail miserably	17:24
corvus	mnaser: excellent! :)	17:25
mnaser	ftr https://review.opendev.org/#/c/723083/2 is the really sad patch	17:27
corvus	grr, the stop playbook doesn't wait for them to stop (	17:28
corvus	i should not have used those	17:28
corvus	just waiting for 2 stragglers to stop	17:33
mnaser	coolz	17:36
corvus	okay, all stopped. starting now.	17:37
corvus	mnaser: they're up if you want to recheck	17:37
mnaser	corvus: ok cool let me try it out	17:37
mnaser	just rechecked 723083 and will watch status	17:38
mnaser	ah dang	17:38
mnaser	opendev-buildset-registry (2. attempt)	17:38
mnaser	opendev-buildset-registry (3. attempt)	17:39
mnaser	only 2 jobs managed to start	17:39
mnaser	sorry 3, 2 are in merger_failure and 1 still waiting	17:39
corvus	ok i'll check the logs on opendev-buildset-registry for that and make sure it's the same error	17:40
corvus	ze01	17:41
corvus	http://paste.openstack.org/show/792715/	17:43
mnaser	wonder if ze01 is the culprit, that's two trace from it (and eventually things seem to work)	17:43
corvus	let me grep for "Unable to merge" across the executors	17:44
mnaser	oh hey that's another error tho, that's useful!	17:44
corvus	yeah, i don't know if that helps us triangulate the error, or if we've popped one problem off the stack and this is the next one :\|	17:44
mnaser	yeah. this seems a lot more useful of an error message anyhow	17:45
corvus	mnaser: the only time i see 'not something we can merge' is for vexxhost/openstack-operator	17:46
mnaser	corvus: could it be possible that because im the first recheck to be done since that fix we implemented	17:46
corvus	mnaser: did we confirm your earlier error on openstack-operator was the same one that was pasted earlier?	17:46
corvus	mnaser: could be	17:46
corvus	but i'm also wondering if that repo just happens to have something else wrong with it	17:46
mnaser	corvus: we did not confirm my error to be the same as the helm one which was the traceback provided	17:46
corvus	mnaser: ok. maybe we should recheck that helm change	17:47
corvus	i do see the 'not something we can merge' error across many executors	17:47
mnaser	it does seem to be a pattern of hitting specific repos -- https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE	17:47
mnaser	i think it was openstack-helm-images so i see a change there with merger failures so ill recheck that	17:47
mnaser	https://review.opendev.org/#/c/722540/ specifically	17:47
corvus	mnaser: great, thanks; let me know if/when it hits and i'll search again; meanwhile, i'll continue digging	17:48
mnaser	corvus: also, something to provide signal -- 722769,7 has a lot of retries in the check pipeline, an openstack/openstack-helm change	17:48
mnaser	it could be a bad job though, because it's at 4 attempts, so i wonder if they override retries and are just failing a lot in pre :(	17:49
mnaser	opnestack-helm-images have all started except for 1 which is still queued, but signaling a much better overall situation	17:52
corvus	okay, i've turned on keep on ze01 and will direct-enqueue 723083 and hope a job lands there	17:52
corvus	got build b2f966145388493ab63aaa9187ba1317 on ze01 with that error	17:57
corvus	oops, not that one, that's too early	17:58
mnaser	i wonder if the on-disk copy of the repo has somehow become unclean	17:59
mnaser	we got a few more changes of landing there	17:59
corvus	git fsck on a copy of the cached repo is clean	18:01
mnaser	corvus: we lose :( all jobs queued up unless the build/functional jobs end up failing but that'll be in a little bit till the registry goes up /images built / etc	18:02
corvus	i'll dequeue and re-enqueue	18:02
corvus	on the bright side, i haven't seen "SHA could not be" since before 1700 utc	18:05
mnaser	thats good!	18:05
mnaser	we got a merger_failure! hopefully ze1?	18:05
corvus	doesn't look like it; i may need to increase the odds	18:05
corvus	i'll wait for that linters job, then i'll set keep everywhere if it doesn't hit ze01	18:07
corvus	okay let's try 6011e23339bf4938b1d5f12d0334087a on ze07	18:12
mnaser	neat	18:16
corvus	the git object in FETCH_HEAD definitely isn't there; if i manually fetch it again it works	18:17
mnaser	corvus: this is beyond my git and zuul internal but what puts the object there in the first place?	18:20
mnaser	i assume once a change is queued, zuul clones from review.opendev.org (or fetches that head) and once its sure that its there, it plops it down in /var/lib/zuul/builds/foo/	18:21
corvus	narrowing down on a reproducer -- i've switched back to a copy of the cached git repo (ignoring the build repo for now), i get this: http://paste.openstack.org/show/792716/	18:21
mnaser	sounds like gc could be involved	18:22
corvus	i'm somewhat inclined to just delete this repo from all the executors and see if it happens again	18:25
mnaser	corvus: i wouldn't be opposed. it seems like the cached repo may have kinda hit some weird stage	18:26
corvus	i have low confidence that will fix it, but since it's the only one so far, i'd like to eliminate that as a possibility	18:26
corvus	aaaand there's another one:	18:26
corvus	stderr: 'fatal: not something we can merge in .git/FETCH_HEAD: a57bca5811d4bb4b0bf7d4b8e7f46cd922eb4430'refs/changes/31/723031/3' of ssh://review.opendev.org:29418/openstack/kolla-ansible'	18:26
corvus	so on second thought, i think we should revert the gc change, then remove all the cached repos. because they're all now configured with gc settings that aren't going to be reverted with the revert of the change.	18:27
mnaser	corvus: i	18:28
mnaser	i'm in agreement	18:28
mnaser	i think the gc is def involved in this	18:28
corvus	i've got a copy of that repo on ze07 and ze01 for us to experiment with	18:29
mnaser	corvus: i suspect as more repos gc on fetches, we'll probably see this start popping up more, the relative quietness of the weekend may have masked it	18:30
corvus	yep	18:30
AJaeger	shall I propose reverting tobiash 's gc change? https://review.opendev.org/#/c/722272/	18:32
AJaeger	corvus: I see your suggestion above, let me propose	18:32
corvus	already did	18:32
corvus	AJaeger: https://review.opendev.org/723110	18:32
AJaeger	corvus: ok, I'll abandon...	18:33
mnaser	corvus: i dont know how much time you have right now :) but maybe we'll have to figure out if we're going to wipe/rotate executor-by-executor to avoid a big thundering herd at gerrit	18:35
mnaser	i'm not sure if opendev has done this much in the past	18:35
corvus	mnaser: it's not that bad even when we're busy;	18:36
mnaser	oh right because specifically for the executors we clone ad-hoc	18:36
corvus	(it's basically we just have to clone nova and neutron 12 times :)	18:36
corvus	yeah, that helps a lot	18:36
mnaser	cool	18:37
* mnaser is going to go find something to distract for next 40-50 minutes as that change lands		18:37
corvus	infra-root: i am about to start a multi-hour gap; if anyone else is able to ensure that https://review.opendev.org/723110 lands, all executors and mergers are stopped, all repos in /var/lib/zuul/executor-git and wherever they are on the mergers are deleted, and then executors and mergers started, that would be great :)	18:38
AJaeger	looking at http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1 we've lost a couple mergers and executors and got them back after the restart	18:38
corvus	i probably won't be able to return until after midnight utc	18:38
mnaser	corvus: thanks for evertthing and the summary, ill try and keep eyes on 723110 :)	18:39
mordred	corvus: I'm in and out over the next bit - but I do have coverage before midnight - so i'll keep an eye on it too	18:45
fungi	okay, i'm in for the day, got as much yardwork in as i could before the weather changed. catching back up and can help if stuff's still wacky	19:31
fungi	and looks like we're presently trying to merge 723110	19:34
fungi	and it's about done in the gate	19:35
fungi	and it's merged	19:38
fungi	now for the deploy	19:39
fungi	oh, promote goes first	19:39
fungi	though i suppose our executors don't need the images	19:40
mnaser	fungi: yep, executors dont use containers right now, and promote is done so we should be ready for the follow up steps afaik	19:41
fungi	well, the zuul source code still needs to end up on the executors and get installed	19:42
fungi	at which point we can stop, clean and start them all	19:42
mnaser	fungi: ah, right! but i suspect that might not happen because zuul is in emergency	19:51
mnaser	i think	19:51
mnaser	so we might need to git pull it in there	19:51
fungi	it's not, i double-checked	19:53
fungi	so ansible will in theory update it on its usual schedule	19:54
* AJaeger is a bit puzzled - we have not seen any further MERGER_FAILURE since some time, haven't we?		19:55
AJaeger	https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE	19:55
fungi	the last infra-prod-service-zuul started at 2020-04-25T19:10:35	19:57
fungi	so the next one should be pretty soon	19:57
* AJaeger had expected more problems after the restart and the debugging earlier, so I'm not completly convinced yet thtat the gc patch is the problem. But corvus debugging was conclusive...		19:58
fungi	https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-service-zuul	19:58
AJaeger	But it's late here, I might be confused	19:58
fungi	so probably in another ~10 minutes from now	19:59
AJaeger	great, fungi!	19:59
* AJaeger signs off again...		19:59
fungi	have a good night, AJaeger!	19:59
AJaeger	have a nice weekend	19:59
AJaeger	thanks, fungi - you as well in a few hours ;)	20:00
fungi	looks like it's just started running the build	20:14
fungi	the revert commit is starting to show up in the local clone of zuul on the executors now	20:15
fungi	and done, double-checking it got installed	20:19
fungi	zuul==3.18.1.dev104 # git sha 9b300bc	20:20
fungi	that's the right commit	20:20
fungi	okay, stopping the executors	20:20
fungi	waiting for them to finish shutting down	20:24
* tristanC crosses finger		20:25
fungi	okay, no remaining zuul-owned processes on any of the twelve other than leaked ssh-agent daemons	20:32
fungi	time to start deleting	20:32
fungi	oh, need to stop all mergers too, i almost missed that	20:32
fungi	this should be faster	20:32
tristanC	fungi: i was about to ask... perhaps this can be done in a second step?	20:33
tristanC	fungi: e.g. remove executor-git cache, restart executor, then repeat for merger	20:33
fungi	maybe, though i don't know that it makes any difference	20:35
fungi	i just need to take a moment to refresh my memory on how to use docker-compose to stop these since they're in containers	20:35
fungi	looks like it should be `sudo docker-compose down`	20:36
fungi	except i think it wants a very specific pwd	20:37
fungi	cd /etc/zuul-merger/ && sudo docker-compose down	20:41
fungi	that seems to do the trick	20:41
fungi	and they're all stopped now	20:42
fungi	on to git repo deletions	20:42
fungi	looks like it's still /var/lib/zuul/git/* on the mergers	20:43
fungi	doing those first	20:43
fungi	okay, i have those and also /var/lib/zuul/executor-git/* on the executors all recursively deleting in parallel	20:50
fungi	mergers have all finished, some executors are still deleting	20:52
fungi	going ahead and bringing the mergers back up	20:53
fungi	huh... does docker-compose up not close its copies of parent fds?	20:54
fungi	seems remotely running it via an ssh command line hangs indefinitely after "Creating zuul-merger_merger_1 ... done"	20:54
fungi	i guess i'll just do them in interactive sessions	20:55
fungi	huh, even running docker-compose up from an interactive shell never disassociates	20:58
fungi	and sigint stops the container again	20:58
fungi	this is not good	20:58
tristanC	fungi: there may be a cli arg to make docker-compose detach	20:58
fungi	yeah, i'm doing some reading	20:58
fungi	manpage says: -d Detached mode. Run container in the background, print new container name.	21:00
fungi	and it's `docker-compose up -d` not `docker-compose -d up` in case anyone was wondering. the latter just prints usage instructions and not even an informative error	21:02
fungi	okay, looks like all the mergers are started again	21:05
fungi	the executors also finished their deletions whilst i was fumbling around with docker-compose so starting them back up as well	21:06
fungi	and they're all running again too now	21:08
fungi	#status log restarted all zuul executors and mergers on 9b300bc	21:09
openstackstatus	fungi: finished logging	21:09
clarkb	fungi: on the mergers note that you may need to docker-compose pull before doing the up in order to run the new thing. I expect ansible did that for you	21:43
clarkb	fungi: I think you can do a docker ps -a and cross check the image hash against what is on dockerhub to double check though	21:44
clarkb	or docker exec /usr/bin/python3 /usr/local/bin/pbr freeze \| grep zuul ?	21:44
mordred	clarkb: the ansible should have done a pull - but it's good to double-check	22:13
mnaser	moment of truth	22:13
mnaser	im running a recheck on the super broken patch of mine	22:13
fungi	Error: No such container: /usr/bin/python3	22:13
fungi	i'll need to read up on how that works	22:13
mordred	docker-compose exec executor pbr freeze \| grep zuul	22:14
mordred	or docker-compose exec merger pbr freeze \| grep zuul	22:14
fungi	zuul-merger_merger_1	22:14
fungi	?	22:14
mordred	if you use docker-compose to do it it'll do it for you	22:14
fungi	ahh, yeah	22:15
mnaser	speaking of which	22:15
mordred	root@zm01:/etc/zuul-merger# docker-compose exec merger pbr freeze \| grep zuul	22:15
mordred	ftr	22:15
mnaser	did we merge the smart-reconfgiure patch	22:15
mnaser	before we nuke zuul-scheduler again	22:15
mordred	no - we should	22:15
mordred	https://review.opendev.org/#/c/723107/	22:15
mordred	scheduler is in the emergency file	22:16
fungi	zuul==3.18.1.dev104 # git sha 9b300bc8	22:16
fungi	so basically what the executors say	22:16
mordred	woot	22:16
fungi	except for some reason this pbr gives an 8-character abbreviation of the sha instead of 7	22:16
mnaser	failed with: No container found for scheduler_1	22:17
fungi	maybe pbr changed that recently	22:17
mordred	fungi: different git versions	22:17
mordred	fungi: mergers are running debian buster userspace - git 2.20 or something	22:17
mordred	executors are on xenail - git 2.7 or something old	22:17
fungi	ahh	22:17
mordred	(once we're back to being solid - I want to rolling-replace all the executors with focal)	22:17
mordred	but - you know - not today :)	22:18
mnaser	weird	22:18
mnaser	i dont know why your patch failed mordred -- seems like zuul01 which feels like is the scheduler doesnt have a scheduler container?	22:18
mnaser	"No container found for scheduler_1"	22:18
mordred	hrm	22:19
mnaser	mordred: oh, you somehow dont run docker-compose up after the pull	22:19
mnaser	after the zuul-scheduler image pull, it goes right into zuul-web	22:19
mordred	oh - we need to tell it to start the scheduler in the gate	22:19
mordred	yeah- one sec, I can push up a fix	22:19
mnaser	as an aside, my patches are happily going through check/gate without merger failures, so the revert of the git gc was the culprit	22:20
openstackgerrit	Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107	22:23
mordred	mnaser, clarkb, fungi ^^	22:23
mordred	we need that before we can take the scheduler out of emergency	22:23
mordred	mnaser: woot!	22:23
mordred	did we delete the git caches on the exectors and mergers?	22:23
mordred	yup - I see fungi said that in scrollback	22:24
mordred	cool	22:24
mordred	I mean - not awesome, since we still can't reproduce it in test	22:24
mordred	and sadly it seems to be an issue that's only on xenial - so it's going to be a bunch of work to fix it for a platform that soon we wont' be running on :(	22:24
mordred	(I mean, obviously we currently support xenial, so it's important - but you know - still annoying)	22:25
mnaser	mordred: but also i think corvus left a copy of the repos on the exceutors that we could inspect that	22:25
mordred	cool	22:26
mordred	maybe tomorrow we will not wake up to any fire drills	22:26
openstackgerrit	Monty Taylor proposed zuul/zuul-jobs master: Use failed_when: false and ignore_errors: true https://review.opendev.org/723109	22:29
mordred	fungi, clarkb: if you have any more interest pellets - https://review.opendev.org/#/c/723105/ and https://review.opendev.org/#/c/723106/ are cleanup patches from earlier ansible runs	22:30
fungi	mnaser: well, we assume the revert solved it, but it could also technically have been the repository removal (though what got the repos into a sad state if not the gc tuning patch, i don't know)	22:42
mordred	++	22:47
openstackgerrit	Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339	23:02
* corvus reads scrollback		23:03
corvus	mordred: yeah, i don't think the gc thing is tied to xenial -- lemme summarize	23:06
corvus	(there were 2 merger related patches)	23:06
corvus	the "fix leaked refs" patch and the "gc" patch. the fixed leaked refs had the flaw that would delete .git/refs, but it only activates on old git. that's the one that's tied to xenial; we fixed that and rolled forward, and then we stopped seeing widespread failures.	23:08
corvus	the "gc" patch applies to all git versions. we saw some errors on a very few number of repos which seem very much like they were caused by aggressive gc. the small number of affected repos could be only because those were the first to git some gc edge condition, and perhaps others would join them over time. or they could just be in a weird state.	23:10
corvus	since we reverted and deleted the repos on disk, i think either of those are still a possibility.	23:11
corvus	for the record, i just investigated the possibility that the second error was caused by the first -- however i don't think that's the case. on ze12 the first errors all happened to repos other than openstack-operator, and the second errors were all for openstack-operator	23:22
*** tosky has quit IRC		23:32
fungi	makes sense	23:35
*** DSpider has quit IRC		23:41
clarkb	newer git is happier when dirs in .git are missing I guess?	23:44

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!