openstackgerrit | Merged opendev/system-config master: Add nodepool node key https://review.opendev.org/723049 | 00:24 |
---|---|---|
openstackgerrit | Merged opendev/system-config master: Remove an extra backslash https://review.opendev.org/723052 | 00:24 |
clarkb | 723046 collected the zuul.conf andit looks good but the job failed | 00:26 |
clarkb | its trying to run zuul executor under python2 | 00:27 |
clarkb | mordred: ^ fyi I'll see if it is obvious why after dinner | 00:28 |
*** DSpider has quit IRC | 00:42 | |
*** mlavalle has quit IRC | 01:03 | |
*** elod has quit IRC | 01:06 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: The package is libjemalloc1 https://review.opendev.org/723023 | 01:15 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Fix zuul.conf jinja2 template https://review.opendev.org/723046 | 01:15 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Use pip3 to install zuul on executors https://review.opendev.org/723058 | 01:15 |
clarkb | mordred: ^ I think that will fix it | 01:15 |
clarkb | infra-root we are still seeing merge failures, but these appear to be due to git problems | 01:45 |
clarkb | http://paste.openstack.org/show/792699/ | 01:46 |
clarkb | also the http gerrit driver retries after receiving HTTP 409 errors when submitting votes. I believe that is happening because we were trying to vote on merged changes, we can probably avoid retrying if http 409 was the return code | 01:46 |
clarkb | I don't understand the git problems. permissions look fine. Is that related to the thing corvus said we are now doing to curate those repos possibly? | 01:47 |
clarkb | hrm the thing tobiash did for gc'ing only sets git gc options | 01:51 |
fungi | rootfs on ze01 is close to full, but not quite so enospc probably isn't it | 01:53 |
fungi | oh, this is saying SHA could not be resolved, git returned: '' | 01:54 |
clarkb | fungi: I think ze03 was doing this before we did any updtes to zuul | 01:55 |
clarkb | fungi: so its possible this was existing previously and now we are just noticing beacuse merge failures are a red flag right now | 01:55 |
clarkb | however it seems to definitely be doing it more in the fallout of the work | 01:55 |
clarkb | and it doesn't seem like we've done it in the last half hour or so? | 01:56 |
clarkb | the stack ending at https://review.opendev.org/#/c/723023/ is green now after my fix | 01:56 |
clarkb | (which is why I returned to the computer so maybe I should call it a day for now since the merge issues seems to be settled out?) | 01:56 |
clarkb | er https://review.opendev.org/#/c/723046/ caps the stack | 01:57 |
fungi | where does zuul keep original copies of git repos to clone into workspaces? /var/lib/zuul/git seems empty | 01:57 |
clarkb | fungi: /var/lib/zuul/executor-git I think | 01:57 |
fungi | oh yep it's /var/lib/zuul/executor-git/ | 01:57 |
fungi | too bad that error doesn't say what sha it wanted to reset its state to | 01:58 |
clarkb | fungi: its possible that is logged earlier if you grep on the build or event id | 01:58 |
fungi | yeah, likely | 01:59 |
clarkb | fungi: what might be interesting is to recheck that change | 02:00 |
clarkb | https://review.opendev.org/#/c/722540 to see if that is a persistent issue with that repo | 02:00 |
clarkb | but I need to pop out again. likely until tomorrow | 02:00 |
*** rchurch has quit IRC | 03:13 | |
AJaeger | infra-root, 722540 still shows merger-failures, I just rechecked ;( | 05:09 |
*** sgw has quit IRC | 06:44 | |
*** DSpider has joined #opendev | 06:47 | |
AJaeger | we have new config-errors for openstack-tenant - how could those get introduced: https://zuul.opendev.org/t/openstack/config-errors "bad escape \l at position 3" | 07:27 |
AJaeger | seems like a new Zuul check - I'm fixing... | 07:54 |
AJaeger | infra-root, the daily periodic job has failures https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_81d/periodic/opendev.org/opendev/system-config/master/infra-prod-service-codesearch/81d1995/ and | 08:21 |
AJaeger | https://5ee8a394869b896f2ffd-b765094214961b8b34fe718f08477de4.ssl.cf1.rackcdn.com/periodic/opendev.org/opendev/system-config/master/infra-prod-service-eavesdrop/c8e62cc/ | 08:21 |
AJaeger | and accessbot gets skipped | 08:21 |
AJaeger | do we need these jobs daily and hourly? | 08:21 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Disable ovn pypi jobs temporarily https://review.opendev.org/723072 | 09:08 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Revert "Disable ovn pypi jobs temporarily" https://review.opendev.org/723073 | 09:10 |
*** tosky has joined #opendev | 09:36 | |
*** dpawlik has joined #opendev | 09:59 | |
*** avass has quit IRC | 10:16 | |
*** elod has joined #opendev | 10:28 | |
*** dpawlik has quit IRC | 10:36 | |
AJaeger | config-errors are fixed, horizon merged the changes quickly... | 12:54 |
AJaeger | infra-root, now zuul tenant has new errors: https://zuul.opendev.org/t/zuul/config-errors | 12:57 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Fix some config errors in zuul-tenant https://review.opendev.org/723082 | 13:01 |
AJaeger | proposed fix ^ | 13:01 |
mnaser | I’m getting some non deterministic behaviour with zuul saying changes have merge conflicts | 13:15 |
mnaser | example: https://review.opendev.org/#/c/723087/ | 13:16 |
frickler | mnaser: there was an issue with that yesterday evening, but I though that this should be fixed now | 13:16 |
mnaser | and another example on the first patch set: https://review.opendev.org/#/c/723083/2 | 13:16 |
mnaser | frickler: yeah, this is happening now | 13:16 |
frickler | humm, that looks fresh :( infra-root ^^ | 13:16 |
mnaser | and the parent of those patches are very much tip of the branch | 13:17 |
mnaser | so it wouldn’t make sense (especially for the second patch, revision 1) to say that... | 13:17 |
mnaser | oh I just saw a MERGER_FAIL in zuul status for VEXXHOST tenant | 13:18 |
AJaeger | frickler: 722540 is also recent, see back scroll | 13:18 |
mnaser | I’m also seeing a bunch of retries. Is it possible that a specific executor is unhappy? | 13:18 |
AJaeger | mnaser, so, probably fallout from yesterday and needs more work - best to wait for others to fix | 13:18 |
AJaeger | mnaser: I guess ;) | 13:19 |
mnaser | cool. just wanted to make sure there’s eyes on it | 13:19 |
mnaser | I’ve been running into this issue in my own installation | 13:19 |
mnaser | I wonder if it is a recent zuul bug | 13:19 |
AJaeger | mnaser: might be - I think Zuul was restarted as well | 13:21 |
AJaeger | time for some cake - bbl | 13:22 |
mnaser | yeah it hadn’t been restarted for a while before this and I’ve been dealing with this issue too. | 13:22 |
mnaser | AJaeger: yum! Have fun! | 13:22 |
fungi | (see scrollback) clarkb spotted one of those in the log last night when he was signing off: http://paste.openstack.org/show/792699/ | 13:24 |
fungi | ValueError: SHA could not be resolved, git returned: b'' | 13:25 |
fungi | i can't easily troubleshoot at the moment, though i looked at the executor last night and didn't see anything wrong. i was also having trouble working out from the log what ref it was trying to reset to | 13:26 |
fungi | at the time we thought it could just be a broken repo, as some of the executors had been logging this before the switch to containers/ansiblification | 13:27 |
fungi | but if similar errors are being raised for other repos then maybe it's more widespread | 13:30 |
fungi | mnaser: do you have corresponding tracebacks in your repo? maybe this is a recent regression which came in with our restarts | 13:31 |
fungi | which would make sense... clarkb said he saw ze03 logging problems like this before yesterday's maintenance... and ze03 was rebooted unexpectedly by the provider a few days ago because of a hypervisor host problem | 13:32 |
frickler | this would be the latest change to the merger code that might have an issue https://review.opendev.org/701531 | 13:37 |
frickler | at least it heavily touches the code in clarkb's backtrace | 13:37 |
fungi | merged 2020-04-15 12:46:09 | 13:49 |
fungi | possible ze03 was the first merger restarted after that | 13:50 |
fungi | mnaser: if that's what you're also finding in logs for your deployment, how does your timeline match up? | 13:51 |
mnaser | fungi: let me check logs! | 13:53 |
mnaser | fungi: i dont seem to see logs that are going that far but in terms of behaviour its certainly similar to what i was seeing | 13:59 |
mnaser | fungi: perhaps maybe we should issue a notice -- looks like its happening (often) enough: https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE | 14:07 |
AJaeger | like: status notice Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear. | 14:14 |
AJaeger | infra-root ^ | 14:15 |
mordred | AJaeger: ++ | 14:15 |
AJaeger | #status notice Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear. | 14:16 |
openstackstatus | AJaeger: sending notice | 14:16 |
AJaeger | morning, mordred ! | 14:16 |
-openstackstatus- NOTICE: Zuul is currently failing some jobs with MERGER_FAILURE, this needs investigation by OpenDev team. Please refrain from rechecking until we give the all-clear. | 14:16 | |
mordred | mnaser, frickler: tobiash has been running that merger patch for a few months I think - but it also seems reasonble to imagine it might be involved | 14:16 |
mnaser | infra-root: oh also it seems to be hitting some repos harder than others, it may be useful to grab a backup of an affected repo from a merger so we have something to inspect, in order to be able to build tests off of it | 14:16 |
mordred | morning AJaeger ! | 14:16 |
AJaeger | mordred, fungi, could either of you review https://review.opendev.org/723082 and https://review.opendev.org/723072 | 14:16 |
mnaser | just in case it's somehow transient and it disappears and comes back later | 14:17 |
mordred | do we know if we've seen it from both mergers and executors or only from executors? asking because different python versions are involved | 14:17 |
corvus | oh look an alarm | 14:18 |
mnaser | don't think anyone dug that deep yet mordred -- but the only captured traceback earlier was from ze01 | 14:18 |
mordred | nod | 14:18 |
mordred | corvus: morning! | 14:18 |
openstackstatus | AJaeger: finished sending notice | 14:19 |
AJaeger | corvus: sorry, didn't want to wake you up ;) | 14:21 |
AJaeger | morning, corvus ! | 14:21 |
frickler | mnaser: do you have some list of affected repos? my current guess is it affects only repos that only have a master branch | 14:22 |
mnaser | frickler: vexxhost/openstack-operator is one of mine that's being hit, openstack/openstack-helm-images was in the traceback that fungi provided earlier from zm01 -- the link i posted earlier with jobs that hit MERGER_FAILURE lists the bunch | 14:23 |
corvus | https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE | 14:24 |
mordred | kayobe has branches | 14:24 |
corvus | and some of those failures are on non-master branches | 14:24 |
AJaeger | keystone has branches | 14:24 |
openstackgerrit | Merged openstack/project-config master: Fix some config errors in zuul-tenant https://review.opendev.org/723082 | 14:25 |
mnaser | fwiw i think this is a zuul issue so maybe we can discuss in #zuul -- looks like there was a revert proposal so perhaps it was observed in another environment too | 14:26 |
openstackgerrit | Merged openstack/project-config master: Disable ovn pypi jobs temporarily https://review.opendev.org/723072 | 14:27 |
corvus | well, we need to decide what opendev should do | 14:27 |
corvus | should we attempt more interactive debugging, or restart with a revert of the gc change (then, if it still happens, revert the other merger change)? | 14:28 |
mnaser | corvus: i suggested capturing a copy of the git repo on-disk from some of the mergers that are posting tracebacks so we can actually use those later to possibly create tests and simplify investigation | 14:29 |
mnaser | and i guess, depends on how much work people want to do on a saturday :) | 14:29 |
corvus | has anyone looked into this enough to know if we have enough debugging info to reconstruct the series of git operations that got to this point? | 14:30 |
mordred | corvus: restarting with the revert is fairly easy to do on a saturday morning - although - if that patch was the cause of the issue it could have done something to the repo so that we'll still see the issue on repos after a restart - but might not see more repos be impacted | 14:30 |
corvus | i only see a traceback pasted, no merger debug log lines | 14:30 |
mordred | corvus: I do not believe we have gotten that far, no | 14:30 |
mnaser | ^ | 14:30 |
corvus | is this only happening on executors or mergers as well? | 14:31 |
mnaser | unknown info as far as reported here, mordred asked the same earlier too | 14:31 |
mordred | I do think capturing a copy of couple of repos showing the issue wouldn't be a bad idea | 14:32 |
tristanC | it would be good to know what the merger logged before the traceback, there should be `Reset` and `Delete` message that could help understand the issue | 14:32 |
mordred | I have grabbed a copy of openstack-helm-images on ze01 (Which is the one referenced in the traceback) - just in case | 14:33 |
corvus | tristanC: yeah, i just grabbed that http://paste.openstack.org/show/792708/ | 14:33 |
tristanC | corvus: arg, it seems like it deleting too agresvely empty dirs | 14:35 |
mordred | we are running git 2.7.4 on the executor | 14:38 |
mordred | which is < 2.13 - so would cause the "cleanup empty dirs" logic to run | 14:38 |
mordred | I don't see any exceptions in debug log on zm01 | 14:38 |
mordred | mergers have git 2.20.1 | 14:39 |
mordred | in case those data points are useful | 14:39 |
mnaser | (we seem to be on something from those logs in #zuul :) | 14:39 |
corvus | mordred: yes, all helpful | 14:39 |
mordred | corvus, tristanC: I checked - no Execption in any merger logs | 14:43 |
mordred | (which seems to jive with #zuul analysis | 14:44 |
corvus | mordred: looking at the code, i think this is unlikely to trip on the mergers because they generally don't see remote refs deleted | 14:44 |
corvus | i also think that means that the executor git repo cache is probably fine | 14:45 |
AJaeger | mordred: infra-prod-remote-puppet-else failed, see https://zuul.opendev.org/t/openstack/build/d14ec247821c4a19b30cde6f4b5fa7cf | 14:45 |
corvus | these errors are probably only happening on the transient build repo | 14:45 |
mordred | corvus: that's good news | 14:49 |
mordred | AJaeger: thanks - will investigate | 14:49 |
mnaser | so given zuul is in emergency, we'll probably have to manually rollout this across mergers once this lands? | 14:50 |
corvus | yeah, though i think i just approved the outstanding changes. but i guess they could run into this error | 14:53 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use python3 for ansible https://review.opendev.org/723105 | 14:54 |
corvus | if the zuul change lands first, we can use ansible to run "git pull --ff-only" and "pip3 install ." | 14:54 |
mordred | once the zuul patch lands, we can manually run the service and then do a restart of the executors | 14:54 |
mordred | corvus: fwiw - I noticed the warnings while doing ad-hoc ansible to the mergers to check for exceptions in teh logs ... made that ^^ | 14:55 |
mordred | mnaser: while I've got you here, could I get you to +A https://review.opendev.org/#/c/721099/ ? we landed the corresponding system-config change yesterday | 14:57 |
AJaeger | mordred: want to merge https://review.opendev.org/721627 and https://review.opendev.org/720719 as well? in case mnaser wants to review more.... | 14:59 |
mnaser | mordred: i am not fully understanding how that job authenticates against eavesdrop | 14:59 |
mordred | AJaeger: yeah - both of those would be good as well - although I'm most concerned about 721099 getting in so we're not in an inconsistent state | 14:59 |
mnaser | there's no secret, the job has no parent | 15:00 |
mordred | mnaser: currently there is a "jenkins" user on eavesdrop and we connect to it with the private ssh key - in the new system, we're using the zuul per-project ssh key - we've added the corresponding public key from project-config to a zuul user on eavesdrop | 15:00 |
mordred | mnaser: here, https://review.opendev.org/#/c/721098/30/playbooks/service-eavesdrop.yaml - zuul-user role, and here: https://review.opendev.org/#/c/721098/30/playbooks/roles/zuul-user/defaults/main.yaml | 15:01 |
mnaser | mordred: maybe this is new for me but zuul has an ssh key per project? :p | 15:02 |
mnaser | if so, really cool and TIL | 15:02 |
mnaser | oh it does | 15:02 |
mordred | and here: https://review.opendev.org/#/c/721098/30/playbooks/group_vars/eavesdrop.yaml | 15:02 |
mordred | mnaser: yes it does! | 15:03 |
mnaser | that's a really super cool | 15:03 |
mnaser | TIL | 15:03 |
mnaser | ok so this explains a lot then! | 15:03 |
mordred | mnaser: so you do an add_host in a playbook, and zuul will try to ssh to those hosts by default with a per-project key | 15:03 |
mordred | mnaser: yah - we're using it for the new zuul-driven-cd we're doing | 15:03 |
mnaser | mordred: cool, i have a few uses for that too then :) | 15:04 |
mordred | we do add_host with bridge.openstack.org (our bastion host) - but on bridge we've put in the public keys for the system-config and project-config repos - so only jobs triggered by those repos can ssh in | 15:04 |
mordred | it's pretty amazeballs | 15:04 |
*** ildikov has quit IRC | 15:06 | |
*** jroll has quit IRC | 15:06 | |
*** ildikov has joined #opendev | 15:06 | |
*** jroll has joined #opendev | 15:07 | |
mnaser | ++ | 15:11 |
corvus | mnaser: https://zuul-ci.org/docs/zuul/reference/jobs.html#project-key | 15:12 |
mnaser | oh and only added in post-review | 15:12 |
mnaser | ever nicer. | 15:12 |
corvus | there's a tenant ssh key in the design, but we haven't implemented it yet | 15:13 |
mordred | AJaeger: the puppet error you reported is because we replaced a dir with a symlink and rsync is unhappy about that | 15:15 |
openstackgerrit | Merged openstack/project-config master: Use zuul deployment keys for yaml2ical https://review.opendev.org/721099 | 15:16 |
mordred | corvus, fungi: I think to fix that I need to run an ad-hoc ansible to remove the old dir and then re-run the puppet | 15:17 |
mordred | infra-root: http://paste.openstack.org/show/792711/ I'd like to run that | 15:19 |
mordred | then the next normal runs should be able to rsync state appropriately | 15:20 |
mordred | (this is also the error in service-nodepool atm fwiw) | 15:21 |
openstackgerrit | Merged opendev/system-config master: Use pip3 to install zuul on executors https://review.opendev.org/723058 | 15:24 |
openstackgerrit | Merged opendev/system-config master: The package is libjemalloc1 https://review.opendev.org/723023 | 15:28 |
openstackgerrit | Merged opendev/system-config master: Fix zuul.conf jinja2 template https://review.opendev.org/723046 | 15:28 |
mordred | corvus: ^^ I think that's all of the things | 15:30 |
fungi | okay, back from the wilds of my yard-jungle for a breather, before i have to head back into battle | 15:30 |
mordred | corvus: want me to remove from emergency and re-run the zuul service playbook? | 15:31 |
fungi | caught up on scrollback, and sounds like you do think it's a recent zuul regression? | 15:31 |
mordred | fungi: if you have a sec, I'd like to run http://paste.openstack.org/show/792711/ to fix an issue with a dir changing to a symlink and making rsync sad | 15:32 |
corvus | mordred: sure | 15:32 |
mordred | fungi: and yes - we've approved a presumptive fix, but are still trying to make a reproducing test | 15:32 |
openstackgerrit | Merged openstack/project-config master: Run accessbot script on accessbot channels update https://review.opendev.org/721627 | 15:32 |
fungi | mordred: yep, i concur, that sounds sensible for the rsync symlink problem | 15:32 |
openstackgerrit | Merged openstack/project-config master: Start building focal images https://review.opendev.org/720719 | 15:32 |
fungi | and 723104 looks like the probable fix we'll try for the executors? | 15:32 |
corvus | yep | 15:32 |
mordred | corvus: the commits are on bridge | 15:33 |
fungi | and the outstanding fixes from yesterday's unplanned maintenance all just merged | 15:33 |
corvus | mordred can run the playbook now before it lands and we can confirm everything we fixed yesterday is still fixed | 15:33 |
corvus | mordred: what commits? | 15:33 |
mordred | I have removed zuul from the emergency | 15:33 |
mordred | the ones we just landed | 15:33 |
mordred | the system-config ones | 15:33 |
corvus | mordred: ah you mean they have propagated, great thx | 15:33 |
fungi | 723058, 723023 and 723046 for system-config | 15:33 |
mordred | running service-zuul | 15:34 |
fungi | not from the screen session i guess | 15:35 |
mordred | oh - no, that would have been more smarter | 15:35 |
fungi | no worries, i'll just use my imagination ;) | 15:36 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Cron module wants strings https://review.opendev.org/723106 | 15:38 |
mordred | fungi, corvus : everything has run successfully | 15:38 |
fungi | and that cron patch is something you just noticed when running i guess? | 15:38 |
mordred | zuul.conf on zm01 looks correct | 15:38 |
mordred | yup | 15:39 |
mordred | as is its parent | 15:39 |
mordred | so - non-urgent - just respoding to nagging from ansible | 15:39 |
mnaser | fyi zuul api is replying with 500s | 15:39 |
mordred | that's non-awesome | 15:39 |
mnaser | https://zuul.opendev.org/api/tenant/zuul/status | 15:40 |
mnaser | we'll probably have a trace in zuul-web somewhere | 15:40 |
mordred | gear.GearmanError: Unable to submit job to any connected servers | 15:40 |
mnaser | ok so either gearman has pooped out or maybe ca/crt/keys are not properly setup/added into the zuul-web container? | 15:41 |
corvus | the scheduler is also not connected to gearman | 15:42 |
corvus | zuul 22260 1.1 0.0 0 0 ? Z Apr24 11:50 [zuul-scheduler] <defunct> | 15:42 |
corvus | it's probably that process | 15:42 |
fungi | was just about to mentionh that | 15:42 |
fungi | yeah, probably the geard fork | 15:42 |
corvus | this is likely an unrecoverable error and we will need a full restart and to restore from a queue backup | 15:43 |
corvus | i'm really curious what could have caused it | 15:43 |
mordred | yeah. me too | 15:43 |
fungi | /var/log/zuul/gearman-server.log is empty | 15:43 |
fungi | the rotated log has some tracebacks related to statsd though | 15:44 |
mordred | corvus: we have a "reload zuul scheduler" handler in the ansible that fires when main.yaml changes | 15:44 |
fungi | AttributeError: 'Server' object has no attribute 'statsd' | 15:44 |
yoctozepto | I see you already know what I was about to say :-) | 15:44 |
mordred | corvus: it runs "docker-compose kill -s HUP scheduler" | 15:44 |
fungi | but those stop at 2020-04-24 22:42:48,486 which i think was when we restarted | 15:45 |
mordred | corvus: perhaps that is not doing the right thing? | 15:45 |
corvus | mordred: yeah, for example, if that sends a HUP to all processes in the container, that's bad | 15:45 |
mordred | I betcha it does | 15:45 |
fungi | it's a good bet | 15:45 |
mordred | we probably want to do an exec scheduler kill | 15:45 |
corvus | the old version of this only HUP'd the actual scheduler process (via the pidfile) | 15:45 |
mordred | to run the kill inside the container | 15:46 |
mordred | yeah | 15:46 |
corvus | how about | 15:46 |
corvus | we run zuul-scheduler smart-reconfigure | 15:46 |
mordred | ooh - that's a great idea | 15:46 |
corvus | or whatever reconfigure we intend to run :) | 15:46 |
fungi | far safer than process signals anyway | 15:46 |
corvus | i think smart-reconfigure is probably what we want | 15:46 |
mordred | smart-reconfigure seems like a great choice | 15:46 |
mordred | yeah | 15:46 |
mordred | patch coming | 15:47 |
mnaser | i guess for now we'll have to backup/restore to restart scheduler | 15:47 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107 | 15:47 |
fungi | we should have queue backups, unless the cronjob fix wasn't already in place | 15:47 |
corvus | it was definitely in place :) | 15:47 |
mordred | corvus: I should probably put the scheduler back in emergency until that lands yes? | 15:47 |
corvus | whether it works i dunno | 15:48 |
corvus | mordred: would anything else trigger it? i think we can probably just proceed assuming that's the next thing that will land and therefore be okay | 15:48 |
mordred | -rw-r--r-- 1 root root 825 Apr 25 15:48 openstack_status_1587829681.json | 15:48 |
mordred | corvus: kk | 15:48 |
mordred | looks like we have a backup | 15:48 |
fungi | 15:48 is probably too new? | 15:48 |
corvus | mordred: on second thought, i guess any old project-config change could trigger this, so maybe emergency is a good idea | 15:48 |
mordred | nope | 15:48 |
mordred | looks like we have a 500 | 15:49 |
mordred | corvus: kk. doing | 15:49 |
mnaser | file size will probably be different | 15:49 |
corvus | mordred: 107 needs a fix | 15:49 |
mordred | -rw-r--r-- 1 root root 376412 Apr 25 15:37 openstack_status_1587829021.json | 15:49 |
mordred | looks better | 15:49 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107 | 15:50 |
corvus | i'm not sure how we're sposed to use those json files now | 15:51 |
corvus | cause the changes script does api calls | 15:51 |
corvus | i think we'd have to modify the script | 15:51 |
mordred | corvus: they're mounted in the container in case that's helpful | 15:52 |
corvus | ? | 15:52 |
corvus | i'm just saying if we try to run python /opt/zuul/tools/zuul-changes.py file:///var/lib/zuul/backup/openstack_status_1587829021.json | 15:52 |
corvus | we get urllib2.URLError: <urlopen error [Errno 20] Not a directory: '/var/lib/zuul/backup/openstack_status_1587829021.json/api/info'> | 15:53 |
corvus | so there's hacking to be done if we want to try to re-enqueue from those files | 15:53 |
mordred | oh - duh. I get what you're saying | 15:53 |
corvus | so is that worth doing, or should we restart with data loss? | 15:53 |
corvus | oh there's a ~root/zuul-changes-static.py script | 15:54 |
corvus | but that has to be run with every tenant+pipeline combo | 15:54 |
corvus | how about i do that for openstack check and gate and stop there | 15:55 |
corvus | i'll restart the scheduler now | 15:55 |
mordred | corvus: ++ | 15:56 |
corvus | erm, the statsd errors are showing up on a scheduler start | 15:56 |
corvus | oh, no, maybe i'm wrong about that and those are zombie docker error messages | 15:57 |
mnaser | thats reassuring :) | 15:57 |
mordred | the debug log says statsd enabled | 15:57 |
corvus | yeah, i dunno where those came from, but i don't see them now | 15:58 |
fungi | there were definitely some in the gearman-server.log from before yesterday's last restart | 16:00 |
corvus | fungi: yeah, i think they are side effects of broken connections | 16:01 |
corvus | re-enqueueing | 16:01 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 16:54 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Use failed_when: false instead of ignore_errors: true https://review.opendev.org/723109 | 16:54 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use python3 for ansible https://review.opendev.org/723105 | 16:59 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Cron module wants strings https://review.opendev.org/723106 | 16:59 |
mnaser | https://review.opendev.org/#/c/723104/ has landed, do we want to restart executors with it? | 17:01 |
corvus | mnaser, mordred, tristanC: yeah -- looks like that's installed on the executors now, so i will run the restart mergers and executors playbook | 17:24 |
mnaser | corvus: ok cool, i have a patch that somehow always manages to fail miserably | 17:24 |
corvus | mnaser: excellent! :) | 17:25 |
mnaser | ftr https://review.opendev.org/#/c/723083/2 is the really sad patch | 17:27 |
corvus | grr, the stop playbook doesn't wait for them to stop ( | 17:28 |
corvus | i should not have used those | 17:28 |
corvus | just waiting for 2 stragglers to stop | 17:33 |
mnaser | coolz | 17:36 |
corvus | okay, all stopped. starting now. | 17:37 |
corvus | mnaser: they're up if you want to recheck | 17:37 |
mnaser | corvus: ok cool let me try it out | 17:37 |
mnaser | just rechecked 723083 and will watch status | 17:38 |
mnaser | ah dang | 17:38 |
mnaser | opendev-buildset-registry (2. attempt) | 17:38 |
mnaser | opendev-buildset-registry (3. attempt) | 17:39 |
mnaser | only 2 jobs managed to start | 17:39 |
mnaser | sorry 3, 2 are in merger_failure and 1 still waiting | 17:39 |
corvus | ok i'll check the logs on opendev-buildset-registry for that and make sure it's the same error | 17:40 |
corvus | ze01 | 17:41 |
corvus | http://paste.openstack.org/show/792715/ | 17:43 |
mnaser | wonder if ze01 is the culprit, that's two trace from it (and eventually things seem to work) | 17:43 |
corvus | let me grep for "Unable to merge" across the executors | 17:44 |
mnaser | oh hey that's another error tho, that's useful! | 17:44 |
corvus | yeah, i don't know if that helps us triangulate the error, or if we've popped one problem off the stack and this is the next one :| | 17:44 |
mnaser | yeah. this seems a lot more useful of an error message anyhow | 17:45 |
corvus | mnaser: the only time i see 'not something we can merge' is for vexxhost/openstack-operator | 17:46 |
mnaser | corvus: could it be possible that because im the first recheck to be done since that fix we implemented | 17:46 |
corvus | mnaser: did we confirm your earlier error on openstack-operator was the same one that was pasted earlier? | 17:46 |
corvus | mnaser: could be | 17:46 |
corvus | but i'm also wondering if that repo just happens to have something else wrong with it | 17:46 |
mnaser | corvus: we did not confirm my error to be the same as the helm one which was the traceback provided | 17:46 |
corvus | mnaser: ok. maybe we should recheck that helm change | 17:47 |
corvus | i do see the 'not something we can merge' error across many executors | 17:47 |
mnaser | it does seem to be a pattern of hitting specific repos -- https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE | 17:47 |
mnaser | i think it was openstack-helm-images so i see a change there with merger failures so ill recheck that | 17:47 |
mnaser | https://review.opendev.org/#/c/722540/ specifically | 17:47 |
corvus | mnaser: great, thanks; let me know if/when it hits and i'll search again; meanwhile, i'll continue digging | 17:48 |
mnaser | corvus: also, something to provide signal -- 722769,7 has a lot of retries in the check pipeline, an openstack/openstack-helm change | 17:48 |
mnaser | it could be a bad job though, because it's at 4 attempts, so i wonder if they override retries and are just failing a lot in pre :( | 17:49 |
mnaser | opnestack-helm-images have all started except for 1 which is still queued, but signaling a much better overall situation | 17:52 |
corvus | okay, i've turned on keep on ze01 and will direct-enqueue 723083 and hope a job lands there | 17:52 |
corvus | got build b2f966145388493ab63aaa9187ba1317 on ze01 with that error | 17:57 |
corvus | oops, not that one, that's too early | 17:58 |
mnaser | i wonder if the on-disk copy of the repo has somehow become unclean | 17:59 |
mnaser | we got a few more changes of landing there | 17:59 |
corvus | git fsck on a copy of the cached repo is clean | 18:01 |
mnaser | corvus: we lose :( all jobs queued up unless the build/functional jobs end up failing but that'll be in a little bit till the registry goes up /images built / etc | 18:02 |
corvus | i'll dequeue and re-enqueue | 18:02 |
corvus | on the bright side, i haven't seen "SHA could not be" since before 1700 utc | 18:05 |
mnaser | thats good! | 18:05 |
mnaser | we got a merger_failure! hopefully ze1? | 18:05 |
corvus | doesn't look like it; i may need to increase the odds | 18:05 |
corvus | i'll wait for that linters job, then i'll set keep everywhere if it doesn't hit ze01 | 18:07 |
corvus | okay let's try 6011e23339bf4938b1d5f12d0334087a on ze07 | 18:12 |
mnaser | neat | 18:16 |
corvus | the git object in FETCH_HEAD definitely isn't there; if i manually fetch it again it works | 18:17 |
mnaser | corvus: this is beyond my git and zuul internal but what puts the object there in the first place? | 18:20 |
mnaser | i assume once a change is queued, zuul clones from review.opendev.org (or fetches that head) and once its sure that its there, it plops it down in /var/lib/zuul/builds/foo/ | 18:21 |
corvus | narrowing down on a reproducer -- i've switched back to a copy of the cached git repo (ignoring the build repo for now), i get this: http://paste.openstack.org/show/792716/ | 18:21 |
mnaser | sounds like gc could be involved | 18:22 |
corvus | i'm somewhat inclined to just delete this repo from all the executors and see if it happens again | 18:25 |
mnaser | corvus: i wouldn't be opposed. it seems like the cached repo may have kinda hit some weird stage | 18:26 |
corvus | i have low confidence that will fix it, but since it's the only one so far, i'd like to eliminate that as a possibility | 18:26 |
corvus | aaaand there's another one: | 18:26 |
corvus | stderr: 'fatal: not something we can merge in .git/FETCH_HEAD: a57bca5811d4bb4b0bf7d4b8e7f46cd922eb4430'refs/changes/31/723031/3' of ssh://review.opendev.org:29418/openstack/kolla-ansible' | 18:26 |
corvus | so on second thought, i think we should revert the gc change, then remove all the cached repos. because they're all now configured with gc settings that aren't going to be reverted with the revert of the change. | 18:27 |
mnaser | corvus: i | 18:28 |
mnaser | i'm in agreement | 18:28 |
mnaser | i think the gc is def involved in this | 18:28 |
corvus | i've got a copy of that repo on ze07 and ze01 for us to experiment with | 18:29 |
mnaser | corvus: i suspect as more repos gc on fetches, we'll probably see this start popping up more, the relative quietness of the weekend may have masked it | 18:30 |
corvus | yep | 18:30 |
AJaeger | shall I propose reverting tobiash 's gc change? https://review.opendev.org/#/c/722272/ | 18:32 |
AJaeger | corvus: I see your suggestion above, let me propose | 18:32 |
corvus | already did | 18:32 |
corvus | AJaeger: https://review.opendev.org/723110 | 18:32 |
AJaeger | corvus: ok, I'll abandon... | 18:33 |
mnaser | corvus: i dont know how much time you have right now :) but maybe we'll have to figure out if we're going to wipe/rotate executor-by-executor to avoid a big thundering herd at gerrit | 18:35 |
mnaser | i'm not sure if opendev has done this much in the past | 18:35 |
corvus | mnaser: it's not that bad even when we're busy; | 18:36 |
mnaser | oh right because specifically for the executors we clone ad-hoc | 18:36 |
corvus | (it's basically we just have to clone nova and neutron 12 times :) | 18:36 |
corvus | yeah, that helps a lot | 18:36 |
mnaser | cool | 18:37 |
* mnaser is going to go find something to distract for next 40-50 minutes as that change lands | 18:37 | |
corvus | infra-root: i am about to start a multi-hour gap; if anyone else is able to ensure that https://review.opendev.org/723110 lands, all executors and mergers are stopped, all repos in /var/lib/zuul/executor-git and wherever they are on the mergers are deleted, and then executors and mergers started, that would be great :) | 18:38 |
AJaeger | looking at http://grafana.openstack.org/d/T6vSHcSik/zuul-status?orgId=1 we've lost a couple mergers and executors and got them back after the restart | 18:38 |
corvus | i probably won't be able to return until after midnight utc | 18:38 |
mnaser | corvus: thanks for evertthing and the summary, ill try and keep eyes on 723110 :) | 18:39 |
mordred | corvus: I'm in and out over the next bit - but I do have coverage before midnight - so i'll keep an eye on it too | 18:45 |
fungi | okay, i'm in for the day, got as much yardwork in as i could before the weather changed. catching back up and can help if stuff's still wacky | 19:31 |
fungi | and looks like we're presently trying to merge 723110 | 19:34 |
fungi | and it's about done in the gate | 19:35 |
fungi | and it's merged | 19:38 |
fungi | now for the deploy | 19:39 |
fungi | oh, promote goes first | 19:39 |
fungi | though i suppose our executors don't need the images | 19:40 |
mnaser | fungi: yep, executors dont use containers right now, and promote is done so we should be ready for the follow up steps afaik | 19:41 |
fungi | well, the zuul source code still needs to end up on the executors and get installed | 19:42 |
fungi | at which point we can stop, clean and start them all | 19:42 |
mnaser | fungi: ah, right! but i suspect that might not happen because zuul is in emergency | 19:51 |
mnaser | i *think* | 19:51 |
mnaser | so we might need to git pull it in there | 19:51 |
fungi | it's not, i double-checked | 19:53 |
fungi | so ansible will in theory update it on its usual schedule | 19:54 |
* AJaeger is a bit puzzled - we have not seen any further MERGER_FAILURE since some time, haven't we? | 19:55 | |
AJaeger | https://zuul.opendev.org/t/openstack/builds?result=MERGER_FAILURE | 19:55 |
fungi | the last infra-prod-service-zuul started at 2020-04-25T19:10:35 | 19:57 |
fungi | so the next one should be pretty soon | 19:57 |
* AJaeger had expected more problems after the restart and the debugging earlier, so I'm not completly convinced yet thtat the gc patch is the problem. But corvus debugging was conclusive... | 19:58 | |
fungi | https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-service-zuul | 19:58 |
AJaeger | But it's late here, I might be confused | 19:58 |
fungi | so probably in another ~10 minutes from now | 19:59 |
AJaeger | great, fungi! | 19:59 |
* AJaeger signs off again... | 19:59 | |
fungi | have a good night, AJaeger! | 19:59 |
AJaeger | have a nice weekend | 19:59 |
AJaeger | thanks, fungi - you as well in a few hours ;) | 20:00 |
fungi | looks like it's just started running the build | 20:14 |
fungi | the revert commit is starting to show up in the local clone of zuul on the executors now | 20:15 |
fungi | and done, double-checking it got installed | 20:19 |
fungi | zuul==3.18.1.dev104 # git sha 9b300bc | 20:20 |
fungi | that's the right commit | 20:20 |
fungi | okay, stopping the executors | 20:20 |
fungi | waiting for them to finish shutting down | 20:24 |
* tristanC crosses finger | 20:25 | |
fungi | okay, no remaining zuul-owned processes on any of the twelve other than leaked ssh-agent daemons | 20:32 |
fungi | time to start deleting | 20:32 |
fungi | oh, need to stop all mergers too, i almost missed that | 20:32 |
fungi | this should be faster | 20:32 |
tristanC | fungi: i was about to ask... perhaps this can be done in a second step? | 20:33 |
tristanC | fungi: e.g. remove executor-git cache, restart executor, then repeat for merger | 20:33 |
fungi | maybe, though i don't know that it makes any difference | 20:35 |
fungi | i just need to take a moment to refresh my memory on how to use docker-compose to stop these since they're in containers | 20:35 |
fungi | looks like it should be `sudo docker-compose down` | 20:36 |
fungi | except i think it wants a very specific pwd | 20:37 |
fungi | cd /etc/zuul-merger/ && sudo docker-compose down | 20:41 |
fungi | that seems to do the trick | 20:41 |
fungi | and they're all stopped now | 20:42 |
fungi | on to git repo deletions | 20:42 |
fungi | looks like it's still /var/lib/zuul/git/* on the mergers | 20:43 |
fungi | doing those first | 20:43 |
fungi | okay, i have those and also /var/lib/zuul/executor-git/* on the executors all recursively deleting in parallel | 20:50 |
fungi | mergers have all finished, some executors are still deleting | 20:52 |
fungi | going ahead and bringing the mergers back up | 20:53 |
fungi | huh... does docker-compose up not close its copies of parent fds? | 20:54 |
fungi | seems remotely running it via an ssh command line hangs indefinitely after "Creating zuul-merger_merger_1 ... done" | 20:54 |
fungi | i guess i'll just do them in interactive sessions | 20:55 |
fungi | huh, even running docker-compose up from an interactive shell never disassociates | 20:58 |
fungi | and sigint stops the container again | 20:58 |
fungi | this is not good | 20:58 |
tristanC | fungi: there may be a cli arg to make docker-compose detach | 20:58 |
fungi | yeah, i'm doing some reading | 20:58 |
fungi | manpage says: -d Detached mode. Run container in the background, print new container name. | 21:00 |
fungi | and it's `docker-compose up -d` not `docker-compose -d up` in case anyone was wondering. the latter just prints usage instructions and not even an informative error | 21:02 |
fungi | okay, looks like all the mergers are started again | 21:05 |
fungi | the executors also finished their deletions whilst i was fumbling around with docker-compose so starting them back up as well | 21:06 |
fungi | and they're all running again too now | 21:08 |
fungi | #status log restarted all zuul executors and mergers on 9b300bc | 21:09 |
openstackstatus | fungi: finished logging | 21:09 |
clarkb | fungi: on the mergers note that you may need to docker-compose pull before doing the up in order to run the new thing. I expect ansible did that for you | 21:43 |
clarkb | fungi: I think you can do a docker ps -a and cross check the image hash against what is on dockerhub to double check though | 21:44 |
clarkb | or docker exec /usr/bin/python3 /usr/local/bin/pbr freeze | grep zuul ? | 21:44 |
mordred | clarkb: the ansible should have done a pull - but it's good to double-check | 22:13 |
mnaser | moment of truth | 22:13 |
mnaser | im running a recheck on the super broken patch of mine | 22:13 |
fungi | Error: No such container: /usr/bin/python3 | 22:13 |
fungi | i'll need to read up on how that works | 22:13 |
mordred | docker-compose exec executor pbr freeze | grep zuul | 22:14 |
mordred | or docker-compose exec merger pbr freeze | grep zuul | 22:14 |
fungi | zuul-merger_merger_1 | 22:14 |
fungi | ? | 22:14 |
mordred | if you use docker-compose to do it it'll do it for you | 22:14 |
fungi | ahh, yeah | 22:15 |
mnaser | speaking of which | 22:15 |
mordred | root@zm01:/etc/zuul-merger# docker-compose exec merger pbr freeze | grep zuul | 22:15 |
mordred | ftr | 22:15 |
mnaser | did we merge the smart-reconfgiure patch | 22:15 |
mnaser | before we nuke zuul-scheduler again | 22:15 |
mordred | no - we should | 22:15 |
mordred | https://review.opendev.org/#/c/723107/ | 22:15 |
mordred | scheduler is in the emergency file | 22:16 |
fungi | zuul==3.18.1.dev104 # git sha 9b300bc8 | 22:16 |
fungi | so basically what the executors say | 22:16 |
mordred | woot | 22:16 |
fungi | except for some reason this pbr gives an 8-character abbreviation of the sha instead of 7 | 22:16 |
mnaser | failed with: No container found for scheduler_1 | 22:17 |
fungi | maybe pbr changed that recently | 22:17 |
mordred | fungi: different git versions | 22:17 |
mordred | fungi: mergers are running debian buster userspace - git 2.20 or something | 22:17 |
mordred | executors are on xenail - git 2.7 or something old | 22:17 |
fungi | ahh | 22:17 |
mordred | (once we're back to being solid - I want to rolling-replace all the executors with focal) | 22:17 |
mordred | but - you know - not today :) | 22:18 |
mnaser | weird | 22:18 |
mnaser | i dont know why your patch failed mordred -- seems like zuul01 which feels like is the scheduler doesnt have a scheduler container? | 22:18 |
mnaser | "No container found for scheduler_1" | 22:18 |
mordred | hrm | 22:19 |
mnaser | mordred: oh, you somehow dont run docker-compose up after the pull | 22:19 |
mnaser | after the zuul-scheduler image pull, it goes right into zuul-web | 22:19 |
mordred | oh - we need to tell it to start the scheduler in the gate | 22:19 |
mordred | yeah- one sec, I can push up a fix | 22:19 |
mnaser | as an aside, my patches are happily going through check/gate without merger failures, so the revert of the git gc was the culprit | 22:20 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107 | 22:23 |
mordred | mnaser, clarkb, fungi ^^ | 22:23 |
mordred | we need that before we can take the scheduler out of emergency | 22:23 |
mordred | mnaser: woot! | 22:23 |
mordred | did we delete the git caches on the exectors and mergers? | 22:23 |
mordred | yup - I see fungi said that in scrollback | 22:24 |
mordred | cool | 22:24 |
mordred | I mean - not awesome, since we still can't reproduce it in test | 22:24 |
mordred | and sadly it seems to be an issue that's only on xenial - so it's going to be a bunch of work to fix it for a platform that soon we wont' be running on :( | 22:24 |
mordred | (I mean, obviously we currently support xenial, so it's important - but you know - still annoying) | 22:25 |
mnaser | mordred: but also i think corvus left a copy of the repos on the exceutors that we could inspect that | 22:25 |
mordred | cool | 22:26 |
mordred | maybe tomorrow we will not wake up to any fire drills | 22:26 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Use failed_when: false and ignore_errors: true https://review.opendev.org/723109 | 22:29 |
mordred | fungi, clarkb: if you have any more interest pellets - https://review.opendev.org/#/c/723105/ and https://review.opendev.org/#/c/723106/ are cleanup patches from earlier ansible runs | 22:30 |
fungi | mnaser: well, we assume the revert solved it, but it could also technically have been the repository removal (though what got the repos into a sad state if not the gc tuning patch, i don't know) | 22:42 |
mordred | ++ | 22:47 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 23:02 |
* corvus reads scrollback | 23:03 | |
corvus | mordred: yeah, i don't think the gc thing is tied to xenial -- lemme summarize | 23:06 |
corvus | (there were 2 merger related patches) | 23:06 |
corvus | the "fix leaked refs" patch and the "gc" patch. the fixed leaked refs had the flaw that would delete .git/refs, but it only activates on old git. that's the one that's tied to xenial; we fixed that and rolled forward, and then we stopped seeing widespread failures. | 23:08 |
corvus | the "gc" patch applies to all git versions. we saw some errors on a very few number of repos which seem very much like they were caused by aggressive gc. the small number of affected repos could be only because those were the first to git some gc edge condition, and perhaps others would join them over time. or they could just be in a weird state. | 23:10 |
corvus | since we reverted and deleted the repos on disk, i think either of those are still a possibility. | 23:11 |
corvus | for the record, i just investigated the possibility that the second error was caused by the first -- however i don't think that's the case. on ze12 the first errors all happened to repos other than openstack-operator, and the second errors were all for openstack-operator | 23:22 |
*** tosky has quit IRC | 23:32 | |
fungi | makes sense | 23:35 |
*** DSpider has quit IRC | 23:41 | |
clarkb | newer git is happier when dirs in .git are missing I guess? | 23:44 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!