mordred | BUT- if it starts growing its own set of dependencies - then it might want to have its own container image | 00:00 |
---|---|---|
mordred | ircbot and statusbot are using the same libs right? | 00:00 |
clarkb | no one is limnoria, the other is python irclib or whatever it is called | 00:01 |
clarkb | there may be some overlap in other libs like ssl things | 00:01 |
ianw | mordred: i mean not really; ircbot is really a limnoria container. we'd *like* statusbot to be a limnoria plugin but it isn't | 00:01 |
mordred | ah. I mean - for my pov - that sounds like two different images. but- if it's not a problem to co-install right now, then shrug - wait until it's a problem, yeah? | 00:02 |
ianw | i mean we do have to override the entrypoint, which is a bit ugly | 00:02 |
opendevreview | Merged opendev/system-config master: Remove special x/ handling patch in gerrit https://review.opendev.org/c/opendev/system-config/+/791995 | 00:03 |
mordred | any reason to avoid making a new image? just the overhead of the dockerfiles and the jobs and whatnot? | 00:03 |
*** odyssey4me has quit IRC | 00:03 | |
ianw | yeah, it's mostly me wanting to be done with this post-haste. but i think i will just make the separate image to avoid confusion | 00:04 |
mordred | I hear you :) | 00:06 |
*** odyssey4me has joined #opendev | 00:12 | |
clarkb | ianw: fungi re 791995 we should be covering that in testing with https://opendev.org/opendev/system-config/src/branch/master/testinfra/test_gerrit.py#L29-L33 | 00:13 |
clarkb | and https://gerrit.googlesource.com/gerrit/+/b1f4115304a3820be434a6201da57e4508862f82 is the upstream commit to fix things | 00:14 |
ianw | ++ | 00:15 |
clarkb | I had to go and tripleo check to convince myslelf it is probably ok | 00:15 |
ianw | i'll probably look at pulling/restarting in ~4 hours? | 00:16 |
clarkb | wfm, but I'll not be around :) I'm also happy to try and get it done tomorrow if you end up busy | 00:16 |
fungi | i may be around then, but also may not be very useful if so | 00:18 |
clarkb | ianw: I would take note of the image that is currently running so that a rollback is straightforward if necessary | 00:18 |
clarkb | I checked via our config mgmt stuff to see if we clean up images like we do with zuul and we don't appaer to do that with gerrit so we should be able to easily rollback if necessary by overriding the image in the docker-compose file then reverting the chagne above and udpating our image promotion | 00:19 |
ianw | ++ | 00:24 |
opendevreview | Ian Wienand proposed opendev/statusbot master: Add container image build https://review.opendev.org/c/opendev/statusbot/+/795428 | 00:37 |
*** ysandeep has joined #opendev | 00:41 | |
*** ysandeep is now known as ysandeep|ruck | 00:41 | |
ianw | mnaser: ^ we know f34 containerfile boots but after that ... i'm sure you'll let us know of any issues :) note that to use containerfile you'll want to make sure /var/lib/containers is mounted; see https://opendev.org/zuul/nodepool/src/branch/master/playbooks/nodepool-functional-container-openstack/templates/docker-compose.yaml.j2 | 00:42 |
ianw | it's all extremely new so documentation and bug reports all welcome | 00:42 |
opendevreview | Ian Wienand proposed opendev/statusbot master: Add container image build https://review.opendev.org/c/opendev/statusbot/+/795428 | 00:57 |
ianw | Temporary failure resolving 'debian.map.fastlydns.net' Could not connect to deb.debian.org:80 (151.101.250.132), connection timed out | 01:25 |
ianw | i wonder if there's still issues | 01:25 |
opendevreview | Merged opendev/statusbot master: Add container image build https://review.opendev.org/c/opendev/statusbot/+/795428 | 01:51 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213 | 02:19 |
*** ysandeep|ruck is now known as ysandeep|away | 02:53 | |
*** ykarel|away has joined #opendev | 03:30 | |
opendevreview | Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213 | 03:36 |
ianw | ok the current container on review is "Image": "sha256:57df55aec1eb7835bf80fa6990459ed4f3399ee57f65b07f56cabb09f1b5e455", | 03:48 |
ianw | the latest docker.io/opendevorg/gerrit:3.2 is https://hub.docker.com/layers/opendevorg/gerrit/3.2/images/sha256-9cbb7c83155b41659cd93cf769275644470ce2519a158ff1369e8b0eebe47671?context=explore that was pushed a few hours ago, that lines up | 03:50 |
ianw | i've pulled the latest image now | 03:54 |
ianw | reporting itself as 3.2.10-21-g6ce7d261e1-dirty | 03:58 |
ianw | opendevorg/gerrit 3.2 sha256:9cbb7c83155b41659cd93cf769275644470ce2519a158ff1369e8b0eebe47671 7a921f417d3c 4 hours ago 811MB | 03:59 |
ianw | "Image": "sha256:7a921f417d3cf7f9e1aa602e934fb22e8c0064017d3bf4f5694aafd3ed8d163c", | 04:00 |
ianw | ergo the container is running the image that has the same digest as the upstream tag. i.e. we're done :) | 04:00 |
ianw | #status restarted gerrit to pick up changes for https://review.opendev.org/c/opendev/system-config/+/791995 | 04:01 |
opendevstatus | ianw: unknown command | 04:01 |
ianw | #status log restarted gerrit to pick up changes for https://review.opendev.org/c/opendev/system-config/+/791995 | 04:01 |
opendevstatus | ianw: finished logging | 04:01 |
*** ysandeep|away has quit IRC | 04:18 | |
*** ysandeep|away has joined #opendev | 04:18 | |
opendevreview | Ian Wienand proposed opendev/statusbot master: Dockerfile: correct config command line https://review.opendev.org/c/opendev/statusbot/+/795473 | 04:29 |
*** odyssey4me has quit IRC | 04:31 | |
*** ysandeep|away is now known as ysandeep|ruck | 04:36 | |
opendevreview | Merged opendev/statusbot master: Dockerfile: correct config command line https://review.opendev.org/c/opendev/statusbot/+/795473 | 04:47 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213 | 05:06 |
*** marios has joined #opendev | 05:10 | |
opendevreview | Merged openstack/project-config master: Move devstack-gate jobs list to in-tree https://review.opendev.org/c/openstack/project-config/+/795379 | 05:30 |
*** ykarel|away has quit IRC | 05:41 | |
*** ykarel|away has joined #opendev | 05:41 | |
*** ykarel|away is now known as ykarel | 05:41 | |
opendevreview | Merged opendev/system-config master: Cleanup ask.openstack.org https://review.opendev.org/c/opendev/system-config/+/795207 | 05:42 |
*** ykarel is now known as ykarel|mtg | 06:07 | |
*** whoami-rajat has joined #opendev | 06:12 | |
*** ralonsoh has joined #opendev | 06:29 | |
*** osmanlic- has joined #opendev | 06:45 | |
*** osmanlicilegi has quit IRC | 06:45 | |
*** odyssey4me has joined #opendev | 06:55 | |
*** hashar has joined #opendev | 06:58 | |
*** dklyle has quit IRC | 06:58 | |
*** osmanlic- has quit IRC | 07:01 | |
*** osmanlicilegi has joined #opendev | 07:05 | |
*** amoralej|off is now known as amoralej | 07:13 | |
*** andrewbonney has joined #opendev | 07:16 | |
*** jpena|off is now known as jpena | 07:31 | |
*** rav has joined #opendev | 07:49 | |
rav | How do i rebase master branch to another branch in my repo? | 07:50 |
*** tosky has joined #opendev | 07:53 | |
frickler | rav: what would be the use case for that? you usually rebase other branches onto master (likely before merging them into master), but rebasing master seems very unusual? also, it this a general git question or somehow specific to opendev? | 07:57 |
*** ysandeep|ruck is now known as ysandeep|lunch | 08:05 | |
rav | So i did a clone to a branch.. then i did git checkout "stable/wallaby" then i made some changes but when i pushed the changes the changes have gone to master instead of stable/wallaby. In general git i know how to merge to a branch but opendev seems to not work the same way for me | 08:08 |
*** lucasagomes has joined #opendev | 08:10 | |
tosky | when you say "push", do you mean "git review"? And when you did git checkout stable/wallaby, did git really create a local stable/wallaby branch which points to the remote stable/wallaby branch? | 08:21 |
*** ykarel|mtg is now known as ykarel | 08:30 | |
rav | Yes its git review | 08:39 |
rav | I think its creating a local branch | 08:41 |
*** ysandeep|lunch is now known as ysandeep | 08:50 | |
*** ysandeep is now known as ysandeep|ruck | 08:50 | |
*** boistordu_ex has joined #opendev | 08:55 | |
ysandeep|ruck | #opendev is this the right channel to report if some jobs are stuck on zuul? | 09:21 |
*** rpittau|afk is now known as rpittau | 09:22 | |
ysandeep|ruck | top job in tripleo queue are stuck even though they are not running | 09:24 |
ysandeep|ruck | i think i should try my luck on # zuul | 09:25 |
rav | How to delete a branch in opendev?? | 09:28 |
tobiash[m] | ysandeep|ruck: looks like something might be stuck in the queue processing there. I guess an infra-root needs to look at that. (however most are located in us timezone) | 09:29 |
ysandeep|ruck | tobiash[m]: thank you so much for sharing that, I will reping in few hours then.. | 09:31 |
*** swest has joined #opendev | 09:32 | |
*** hjensas has joined #opendev | 09:35 | |
frickler | tobiash[m]: ysandeep|ruck: we did a gerrit restart earlier, maybe that broke something in zuul processing | 09:48 |
rav | Can someone tell me how to delete a branch in OpenDev.org? TIA | 09:51 |
*** hashar has quit IRC | 09:52 | |
tobiash[m] | ysandeep|ruck: if you need to unblock urgently you could abandon/restore the top change, but I guess it would be helpful for the analysis to leave it until someone had a chance to look at that if it's not urgent | 09:52 |
ysandeep|ruck | tobiash[m], frickler thanks! we can wait for few hours, so that infra-root can check and fix the issue permanently. | 09:54 |
frickler | rav: likely an admin would have to do that, which branch in particular are you referring to? | 09:54 |
rav | stable/wallaby | 09:55 |
rav | I have the admin rights | 09:55 |
rav | in my own repo | 09:55 |
frickler | rav: sorry, I should have been more specific: which repo? | 09:55 |
rav | https://opendev.org/x/networking-infoblox/src/branch/master | 09:55 |
rav | frickler: this is the branch | 09:58 |
frickler | rav: hmm, o.k., I'm actually not sure how our procedure to deal with this type of issues is for the x/ tree, so I'll have to defer to some other infra-root, should be around in a couple of hours | 10:02 |
rav | ok | 10:02 |
frickler | infra-root: checking zuul logs I found these errors, I see no direct relation to the tripleo stuckness, but likely worth looking at anyhow http://paste.openstack.org/show/806483/ | 10:06 |
frickler | these look more suspicous even: Exception while removing nodeset from build <Build 60b0ee70b6f34543a85719b673929635 of n | 10:07 |
frickler | ova-tox-functional-py38 ... No job nodeset for nova-tox-functional-py38 | 10:08 |
tobiash[m] | frickler: can you paste the stack trace? | 10:08 |
frickler | tobiash[m]: http://paste.openstack.org/show/806484/ I hope I selected the proper context | 10:11 |
frickler | to me it looks like zuul is in some weird broken state in general | 10:11 |
tobiash[m] | do you see some recurring exceptions in the live log of the scheduler? | 10:12 |
frickler | I've also earlier wondered why https://review.opendev.org/c/openstack/devstack-gate/+/795383 doesn't go into gate | 10:12 |
frickler | tobiash[m]: that seems to repeat every minute or so for multiple jobs, yes | 10:12 |
tobiash[m] | hrm, that exception is cancel related, are there other recurring exceptions as well other than the one in removeJobNodeSet? | 10:13 |
tobiash[m] | maybe something mentionung the run handler | 10:14 |
frickler | tobiash[m]: well there a those about semaphores I posted before that http://paste.openstack.org/show/806483/ | 10:14 |
tobiash[m] | is there anything suspicious when grepping the log for 4ef4a09fbb50478f8c7f6bfee2fb3926 (which is the event id of the stuck item in tripleo) | 10:17 |
frickler | tobiash[m]: not in today's log, checking yesterday now | 10:20 |
*** ysandeep|ruck is now known as ysandeep|afk | 10:33 | |
frickler | so the semaphore errors have been present for some time and thus are likely unrelated. found nothing else yet, will check back later | 10:41 |
*** ysandeep|afk is now known as ysandeep|ruck | 11:06 | |
*** jpena is now known as jpena|lunch | 11:44 | |
fungi | rav: make sure the .gitreview file on that branch mentions the correct branch as its defaultbranch, or alternatively use the first command-line argument with git review to tell it what branch you're proposing for | 12:03 |
fungi | rav: branch deletion is a permission which can be delegated in the project acl, but generally yes what frickler said, you want to be careful with branch deletions, and should probably tag them before deleting if they had anything useful on them | 12:04 |
fungi | ysandeep|ruck: frickler: tobiash[m]: the recent common cause i've seen for stuck items in pipelines is that one of the nodepool has managed to not unlock the node request after thinking it's either satisfied or declined it | 12:05 |
fungi | if you can find which launcher last tried to service those node requests, then restarting its container frees up the locks | 12:06 |
*** ysandeep|ruck is now known as ysandeep|mtg | 12:07 | |
fungi | i have yet to be able to identify the cause of that behavior in nodepool, but it seems to happen most when there are lots of boot failures, so likely some sort of race around that | 12:07 |
fungi | though this case looks different, i see the stuck builds aren't waiting for nodes | 12:18 |
*** amoralej is now known as amoralej|lunch | 12:19 | |
fungi | the "Exception: No job nodeset for" occurrences, while vast in number, are likely unrelated since they stretch back at least several days in the scheduler's logs at similar levels... 338913 so far today, 530287 last thursday | 12:26 |
*** artom has quit IRC | 12:28 | |
fungi | whatever's happening seems to stall after result processing... looking at the in progress tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates build for 795302,3 currently at the top of the check pipeline for the openstack tenant, this is the last thing the scheduler logged about it: | 12:30 |
fungi | 2021-06-09 01:10:26,804 DEBUG zuul.Scheduler: Processing result event <zuul.model.BuildStatusEvent object at 0x7f267ce9f940> for build fa45a5d15d404267b912be4bc14b3734 | 12:30 |
fungi | so that build seems to have completed over 11 hours ago but hasn't been fully marked as such and so the buildset hasn't reported | 12:31 |
fungi | interestingly, a different build (7fbf525fe42b4f11849933cb606212a3) for that same nodeset completed a minute later and seems to have gone on to register a "Build complete, result ABORTED" in the log | 12:37 |
fungi | er, for that same buildset i mean | 12:37 |
fungi | tobiash[m]: as for your speculation in #zuul about lost result events, this does seem like it could be a case of it | 12:38 |
*** weshay|ruck has joined #opendev | 12:44 | |
*** dviroel has joined #opendev | 12:47 | |
opendevreview | Benjamin Schanzel proposed zuul/zuul-jobs master: Add a meta log upload role with a failover mechanism https://review.opendev.org/c/zuul/zuul-jobs/+/795336 | 12:48 |
*** jpena|lunch is now known as jpena | 12:49 | |
*** dviroel is now known as dviroel|brb | 13:02 | |
*** arxcruz|rover is now known as arxcruz | 13:03 | |
*** amoralej|lunch is now known as amoralej | 13:07 | |
*** ykarel has quit IRC | 13:19 | |
fungi | i could try to manually dequeue the affected items, or worst case perform a scheduler restart, but would prefer to wait for corvus and clarkb to be around in case the running state provides us some clues as to the cause which we could otherwise lose | 13:23 |
fungi | also hoping tobiash[m] has input, since he seemed to have suspicions as to the cause already (the affected build was seen as generating a result event but the scheduler never logged processing it, and the results queue is showing 0, suggesting the result was indeed lost somewhere along the way) | 13:25 |
*** rav has quit IRC | 13:26 | |
tobiash[m] | fungi: dequeue should be sufficient | 13:26 |
fungi | er, rather, the shcheduler logged that it was processing it, but never continued to log what the result actually was | 13:26 |
tobiash[m] | fungi: afaik the result events now go through zk so maybe there was a zk session loss during that time? | 13:27 |
fungi | ahh, good point, i'll check the logs for the zk cluster members | 13:27 |
tobiash[m] | fungi: I'd start with grepping the scheduler logs for 'ZooKeeper connection' | 13:30 |
fungi | interestingly we seem to mount /var/zookeeper/logs as /logs in the zk containers, but the directory is completely empty | 13:30 |
fungi | tobiash[m]: yeah, not finding any signs of a zk connection problem in the scheduler debug log, at least | 13:33 |
tobiash[m] | ok, that's good, so there is maybe some edge case during result event processing | 13:33 |
fungi | i'm going to need to step away soon for a bit to take care of some errands, but hopefully shouldn't be gone more than an hour-ish | 13:35 |
*** ysandeep|mtg is now known as ysandeep | 13:43 | |
corvus | o/ | 13:43 |
fungi | looks like zk is just logging to stdout/stderr, so `docker-compose logs` has it, though quite verbose | 13:43 |
fungi | mornin' corvus! | 13:44 |
fungi | we have what looks like a lost result for build fa45a5d15d404267b912be4bc14b3734 and possibly others | 13:44 |
fungi | trying to work out what could have precipitated it losing track of that | 13:45 |
fungi | it's basically stuck in a permanent running state because the scheduler didn't evaluate the result reported back in gearman | 13:46 |
fungi | looking in the zk cluster member logs, there was some sort of connection event around 01:28 which all of them logged, though that was 8 minutes after the build returned a result. i guess if the scheduler had a backlog in the results queue at that point due to a reconfiguration event or something then maybe the writes to zk were lost "somehow" ? (i'm a bit lost really) | 13:57 |
*** ykarel has joined #opendev | 14:00 | |
fungi | okay, taking a quick break now to run errands, but should return soon | 14:05 |
*** tcervi has joined #opendev | 14:13 | |
*** tosky has quit IRC | 14:16 | |
*** tosky has joined #opendev | 14:18 | |
*** tcervi has quit IRC | 14:42 | |
ysandeep | #opendev are you still investigation tripleo stuck gate, tobiash[m] earlier mentioned if we abandon/restore the top patch in gate queue.. it would clean the queue. Should we go that route? | 14:52 |
ysandeep | weshay|ruck, fyi.. ^^ | 14:53 |
weshay|ruck | happy to abandon all the patches in gate if that will help infra | 14:53 |
tobiash[m] | you'd just need to abandon/restore the top item, that should cause a gate reset and re-run all the other items | 14:54 |
*** hashar has joined #opendev | 14:57 | |
weshay|ruck | tobiash[m], ysandeep just confirming it was just the top patch that needs a reset? https://review.opendev.org/c/openstack/tripleo-heat-templates/+/794634/ | 15:00 |
weshay|ruck | just FYI.. if I would have known that earlier.. I would have done that happily :) | 15:00 |
*** dviroel|brb is now known as dviroel | 15:01 | |
ysandeep | weshay|ruck, tobiash[m] suggested we wait for infra-root to debug first.. lets see what infra-root says | 15:01 |
tobiash[m] | weshay|ruck: yes, that worked, feel free to restore it again | 15:02 |
ysandeep | tobiash[m], thanks! | 15:02 |
frickler | corvus: ^^ since IIUC clarkb is familying, it would be up to you to decide whether you want to debug further or let folks try to get their queue unstuck | 15:03 |
*** dklyle has joined #opendev | 15:04 | |
clarkb | I'm sort of here but ya slow start today | 15:05 |
clarkb | looks like they may have gone ahead and done the abandon and restore? the top of the tripleo queue apperas to be moving at least | 15:10 |
ysandeep | clarkb, yes weshay|ruck abandon/restore the top one .. gate queue was getting really long | 15:14 |
fungi | ysandeep: weshay|ruck: sorry, i had to take a break from investigating to run some errands, and was hoping we could keep it in that state to make it easier for others to possibly identify the cause so we can hopefully keep it from happening again in the future | 15:15 |
weshay|ruck | ah k | 15:15 |
fungi | but maybe the logs we have will still lend a clue | 15:16 |
weshay|ruck | roger that.. hope I didn't mess you up | 15:16 |
weshay|ruck | that was not clear to me | 15:16 |
fungi | it's fine, wouldn't expect you to read dozens of lines of scrollback | 15:16 |
weshay|ruck | ya.. my friggin bouncer is down :( | 15:16 |
weshay|ruck | hardware :( | 15:16 |
weshay|ruck | fungi, so if it gets stuck again.. do you want us to hold it there? | 15:17 |
weshay|ruck | for investigation? | 15:17 |
fungi | weshay|ruck: at least give us a heads up again before resetting it, yeah | 15:18 |
weshay|ruck | k.. I'll make sure we're communicating that.. thanks for the help / support :) | 15:19 |
fungi | yw | 15:21 |
fungi | and thanks for bringing it to our attention earlier | 15:21 |
clarkb | fungi: does the zk blip line up with the gerrit 500s reported in -infra time wise? | 15:22 |
clarkb | I wonder if there was a disturbance in the force | 15:22 |
fungi | it's possible there are more examples still stuck, i'll check for them once i get done eating | 15:22 |
clarkb | a netowrk flap in that region could explain a number of these issues people are reporting possibly | 15:22 |
clarkb | and even if the times don't line up maybe there was some ongoing or rolling set of network updates | 15:23 |
frickler | clarkb: fungi: another example to look at may be https://review.opendev.org/c/openstack/devstack-gate/+/795383 which doesn't start gating even after I rebased and re-workflowed it. different error at first, but possibly related | 15:25 |
fungi | clarkb: no, the lost result came into the results queue at 01:10 utc, the zk logs i was looking at were from 01:28 utc, the reports of gerrit and gitea problems were around 10:10-10:15 utc | 15:25 |
frickler | and with that I'm mostly off for today | 15:25 |
fungi | thanks frickler! | 15:25 |
fungi | and yeah, i started to look at that d-g change, the last thing zuul logged was that it could not merge | 15:25 |
fungi | so possibly unrelated | 15:25 |
clarkb | fungi: also I think the zk connection status goes into syslog if you need an easy way to check that in the future | 15:26 |
clarkb | s/status/state changes/ | 15:27 |
fungi | it didn't seem to be in syslog that i could spot at least | 15:27 |
fungi | but docker-compose was capturing it all | 15:27 |
clarkb | the +A on the d-g change is not when the tripleo had a sad either looks like | 15:28 |
clarkb | fungi: frickler I reapproved 795383 and it is enqueued now | 15:38 |
*** ysandeep is now known as ysandeep|away | 15:50 | |
opendevreview | Hervé Beraud proposed openstack/project-config master: Temporarly reverting project template to finalize train-em https://review.opendev.org/c/openstack/project-config/+/795583 | 15:57 |
clarkb | fungi: frickler: is it possible zuul also lost connectivity to the gerrit event stream and missed that approval event? | 16:01 |
tobiash[m] | fungi, clarkb, corvus : this is another (still) live item that looks stuck as well and doesn't block a gate pipeline: https://zuul.opendev.org/t/openstack/status/change/795302,3 | 16:01 |
tobiash[m] | so if you need to live debug I guess this can be used as well | 16:01 |
tobiash[m] | same goes for the first few items in check | 16:01 |
clarkb | tobiash[m]: ya looks like everything that is older than ~14 hours in check? | 16:02 |
tobiash[m] | so to me it looks like there was some kind of event 14-16h ago that lead to a hand full of stuck items | 16:02 |
clarkb | tobiash[m]: we have evidence of network issues between gerrit and its database at a different time (about 6 hours ago) | 16:02 |
tobiash[m] | clarkb: yes | 16:02 |
opendevreview | Hervé Beraud proposed openstack/project-config master: Dropping revert https://review.opendev.org/c/openstack/project-config/+/795586 | 16:03 |
clarkb | I'm starting to think that there may have been widespread/rolling network problems and this is all fallout from that | 16:03 |
tobiash[m] | maybe fallout from the fastly cdn issue | 16:03 |
clarkb | apparently zoom is having trouble too | 16:04 |
tobiash[m] | at least network issues would be the best case since that would mean that we didn't introduce a regression in event handling | 16:04 |
clarkb | I'm also seeing incredibly slow downloads to my suse mirror for sytem updates | 16:06 |
corvus | i'm back from breakfast and will continue looking into this now | 16:06 |
clarkb | corvus: thanks, note the changes in openstack's check pipeline if you want to see some in action | 16:06 |
*** ysandeep|away has quit IRC | 16:06 | |
*** hjensas is now known as hjensas|afk | 16:06 | |
clarkb | I'll hold off on trying to reset those until you give a go ahead | 16:06 |
clarkb | (the ones older than 14 hours are likely to be hit by this) | 16:06 |
corvus | yes, thanks; especially whe something is "stuck" it's helpeful to have it stay stuck to try to figure out why | 16:07 |
*** ykarel has quit IRC | 16:07 | |
corvus | i'm going to ignore the previous gate issue for now and focus on 795302 instead | 16:07 |
fungi | the current zk logs on zk04 logged a rather lengthy java.io.IOException backtrace at 2021-06-09 01:28:33,550 | 16:11 |
fungi | seems to be a "Connection reset by peer" event | 16:11 |
fungi | the uptime of all 3 zk servers is a couple months though, so none of them rebooted | 16:11 |
clarkb | fungi: zk will reset connections if there is an internal ping timeout | 16:12 |
clarkb | and those internal ping timeouts could be caused by network instability | 16:13 |
fungi | similar one on zk05 at 01:28:31,083 | 16:13 |
fungi | (though it also has a prior one in its log from 2021-06-04 13:33:34,950) | 16:13 |
*** lucasagomes has quit IRC | 16:14 | |
*** marios is now known as marios|out | 16:15 | |
fungi | none in the logs on zk06 however | 16:15 |
fungi | so yes, more anecdotal evidence of "network instability" but no smoking gun | 16:15 |
*** rpittau is now known as rpittau|afk | 16:16 | |
clarkb | fungi: if you look at the client side (a lot more work since we have a few) we might see evidence that those talking to 04 and 05 disconnected but those connected to 06 were ok? | 16:17 |
fungi | got it, so this could be client disconnects not intra-cluster | 16:20 |
corvus | looking at the zk watches graph, i don't see any shift happing aronud 13:30 | 16:21 |
corvus | if there were client disconnects i would expect that graph to rebalance | 16:21 |
corvus | oh i'm looking at the wrong time; 13:30 is from a long time ago | 16:22 |
corvus | 1:30 is the relevant time | 16:22 |
fungi | yep, seems it could be clients... immediately after the disconnects i see nb01.opendev.org and nb03.opendev.org | 16:23 |
*** marios|out has quit IRC | 16:23 | |
fungi | reauthenticating | 16:23 |
fungi | also nb02.opendev.org | 16:24 |
fungi | so all the builders reconnected within seconds after the 3 disconnects. before the disruption one was connected to zk04 and two to zk05, but on reconnecting one of them moved from 05 to 06 | 16:25 |
fungi | at least that's what i infer from the zk logs anyway | 16:25 |
fungi | nice that we have recognizable cn fields in the client auth certs | 16:26 |
corvus | we haven't seen any indication the scheduler had a problem connecting with zk though, right? | 16:27 |
fungi | i found none, no | 16:28 |
fungi | nothing in the scheduler logs to indicate it, and i found only 3 connection reset exceptions in the zk server logs all of which are accounted for by the nb servers reconnectinfg | 16:29 |
clarkb | the build result events are recorded by executors now though right? | 16:29 |
clarkb | maybe the executor fo rtose jobs failed to write to zk properly and the scheduler never saw things update? | 16:29 |
corvus | yeah, i'm heading over to the executor to look now | 16:30 |
corvus | fa45a5d15d404267b912be4bc14b3734 on ze02 | 16:31 |
corvus | 16:32:02 up 14:26, 1 user, load average: 5.25, 4.70, 4.88 | 16:32 |
corvus | that's suspicious | 16:32 |
corvus | possible executor hard reboot around the time | 16:32 |
corvus | we may be looking at a case of an improperly closed tcp connection, so gearman didn't see the disconnect? | 16:33 |
tobiash[m] | I thought I had implemented a two way keep alive years ago | 16:34 |
corvus | 2021-06-09 01:28:42,748 is the last log entry on ze02 before: | 16:35 |
corvus | 2021-06-09 02:06:17,890 DEBUG zuul.Executor: Configured logging: 4.4.1.dev9 | 16:35 |
corvus | (which is a startup log message) | 16:35 |
corvus | also there's some binary garbage between the two | 16:35 |
corvus | tobiash: me too | 16:36 |
tobiash[m] | did it reboot or got stuck? | 16:37 |
clarkb | the uptime indicates a reboot | 16:37 |
clarkb | 16:32:02 up 14:26 | 16:37 |
tobiash[m] | I think I've seen seldom stuck executors that kept the gearman connection but didn't do anything, but that was long ago | 16:37 |
*** jpena is now known as jpena|off | 16:38 | |
corvus | definitely a reboot; and no syslogs preceeding it, so i'm assuming a forced reboot from the cloud | 16:38 |
corvus | like crash+boot | 16:38 |
corvus | i don't see any extra TCP connections between the scheduler and ze02 | 16:39 |
corvus | the gearman server log is empty | 16:40 |
corvus | geard should have detected that and sent a work_fail packet to the scheduler | 16:45 |
corvus | i don't know why it didn't, but the fact that we have zero geard logs will make that very hard to debug | 16:45 |
clarkb | corvus: did the geard logs possibly end up in the debug log? | 16:46 |
clarkb | corvus: fungi: I finally managed to check up on emails and both ze02 and the gerrit mysql had host problems | 16:48 |
clarkb | I guess that means not widespread networking issues but hosting problems. Good to have that info to point at though showing it didn't happen in a vacuum | 16:49 |
fungi | clarkb: ooh, good thinking, i hadn't checked that inbox yet today | 16:49 |
fungi | but in the "widespread network disruption bucket" this failure just got mentioned in #openstack-release https://zuul.opendev.org/t/openstack/build/8f1a900189f547c688259da9fcafa712 | 16:49 |
corvus | clarkb: i don't see any geard logs there, but i did find that the loop that's supposed to detect and clean up this condition is broken: | 16:49 |
corvus | http://paste.openstack.org/show/806500/ | 16:50 |
corvus | that is repeated X times in the log... I'm guessing X == the number of stuck jobs right now | 16:50 |
corvus | 11 times in the logs | 16:50 |
corvus | i count 9 stuck jobs | 16:51 |
fungi | seems like about the right order of magnitude, yeah | 16:51 |
fungi | one we know was reset by abandon/restore | 16:52 |
fungi | maybe the other got a new patchset in the meantime | 16:52 |
corvus | good point; i don't know if the abandoned job is still in that list; could be | 16:52 |
*** amoralej is now known as amoralej|off | 16:54 | |
corvus | oh i see | 16:55 |
corvus | we're right between two v5 ZK changes: we have result events in zk, but build requests in gearman | 16:55 |
corvus | a client disconnect is a gearman result event | 16:56 |
corvus | and they're effectively ignored | 16:56 |
fungi | oh neat! | 16:56 |
corvus | so it's very likely that geard did send the WORK_FAIL packet; and the scheduler ignored it since "real" results come from zk now | 16:56 |
corvus | tobiash: ^ | 16:57 |
fungi | yeah, that seems like a pretty good explanation of exactly what we saw | 16:57 |
corvus | the cleanup in http://paste.openstack.org/show/806500/ should have caught it even so, but it has a fatal flaw | 16:59 |
corvus | let me see if i can patch that real quick | 16:59 |
opendevreview | Michael Johnson proposed opendev/system-config master: Removing openstack-state-management from statusbot https://review.opendev.org/c/opendev/system-config/+/795596 | 16:59 |
corvus | patch in #zuul | 17:03 |
fungi | thanks! | 17:04 |
*** artom has joined #opendev | 17:04 | |
opendevreview | Michael Johnson proposed openstack/project-config master: Removing openstack-state-management from gerritbot https://review.opendev.org/c/openstack/project-config/+/795599 | 17:04 |
*** andrewbonney has quit IRC | 17:05 | |
johnsom | Hi opendev neighbors. I have posted the follow up patches to retire the openstack-state-management channel in favor of using openstack-oslo as discussed on the discuss mailing list. I don't have op privilege on the openstack-state-management channel to update the topic. | 17:06 |
johnsom | Can someone update the topic to something similar to "The channel is retired. Please join us in #openstack-oslo" ? | 17:07 |
fungi | we can take care of it, sure. i'll do that in a moment | 17:07 |
johnsom | Thank you! | 17:07 |
fungi | johnsom: can you also take a look at https://docs.opendev.org/opendev/system-config/latest/irc.html#renaming-an-irc-channel and comment it out in the accessbot config? | 17:09 |
fungi | johnsom: i've adjusted the topic now | 17:09 |
johnsom | fungi Ok, I wasn't sure on that part. Updating... | 17:10 |
fungi | it's new, capabilities on oftc differ from freenode so we had to adjust our channel renaming process, and decided to start aging out our channel registrations | 17:10 |
opendevreview | Michael Johnson proposed openstack/project-config master: Removing openstack-state-management from the bots https://review.opendev.org/c/openstack/project-config/+/795599 | 17:11 |
johnsom | Updated | 17:11 |
fungi | appreciated! | 17:12 |
opendevreview | Mohammed Naser proposed opendev/system-config master: Add Fedora 34 mirrors https://review.opendev.org/c/opendev/system-config/+/795602 | 17:27 |
clarkb | mnaser: question on ^ | 17:29 |
mnaser | yes, yes we do clarkb :) | 17:29 |
opendevreview | Mohammed Naser proposed opendev/system-config master: Add Fedora 34 mirrors https://review.opendev.org/c/opendev/system-config/+/795602 | 17:30 |
mnaser | i will update nodepool to build f34 | 17:30 |
mnaser | and then move f32 to f34 | 17:31 |
*** ralonsoh has quit IRC | 17:34 | |
opendevreview | Mohammed Naser proposed openstack/project-config master: Build images for Fedora 34 https://review.opendev.org/c/openstack/project-config/+/795604 | 17:35 |
*** amoralej|off has quit IRC | 17:42 | |
*** whoami-rajat has quit IRC | 18:13 | |
opendevreview | Mohammed Naser proposed zuul/zuul-jobs master: Switch jobs to use fedora-34 nodes https://review.opendev.org/c/zuul/zuul-jobs/+/795636 | 18:15 |
opendevreview | Mohammed Naser proposed opendev/base-jobs master: Switch fedora-latest to use fedora-34 https://review.opendev.org/c/opendev/base-jobs/+/795639 | 18:18 |
opendevreview | Mohammed Naser proposed openstack/project-config master: Stop launch fedora-32 nodes nodepool https://review.opendev.org/c/openstack/project-config/+/795643 | 18:30 |
opendevreview | Mohammed Naser proposed openstack/project-config master: Remove fedora-32 disk image config https://review.opendev.org/c/openstack/project-config/+/795644 | 18:30 |
mnaser | infra-core: i just pushed up everything needed to get rid of f32 | 18:30 |
mnaser | might need some rechecks here and there because of the dependencies | 18:30 |
mnaser | it's all covered here - https://review.opendev.org/q/hashtag:%22fedora-34%22+(status:open%20OR%20status:merged) | 18:31 |
clarkb | mnaser: thanks, I'll try to review them | 18:32 |
*** amoralej|off has joined #opendev | 18:54 | |
*** timburke_ is now known as timburke | 18:58 | |
*** amoralej|off has quit IRC | 19:11 | |
*** hashar has quit IRC | 19:39 | |
*** amoralej|off has joined #opendev | 19:50 | |
*** dviroel is now known as dviroel|brb | 20:23 | |
*** amoralej|off has quit IRC | 20:53 | |
*** dviroel|brb is now known as dviroel | 21:09 | |
ianw | mnaser: thanks! | 22:11 |
ianw | one thing is that i think we'll need to update our builders to mount /var/lib/containers | 22:11 |
ianw | to use the containerfile element | 22:11 |
clarkb | ianw: do we also need to mount in cgroup stuff from sysfs? | 22:12 |
clarkb | or is that automagic since the host conatiner depends on it? | 22:12 |
ianw | yeah i've not had to do that in local testing, and we don't do that in the gate test and it works | 22:13 |
ianw | i should probably qualify that with "for now" :) | 22:13 |
mnaser | ianw, clarkb: i've not have to do any cgroup stuff either and once i mounted /var/lib/containers -- i got a clean build that boots | 22:13 |
ianw | fungi/clarkb: not sure if you saw https://review.opendev.org/c/opendev/system-config/+/795213 but i went ahead and made a statusbot container and deploy it with that. it's really just a mechanical patch now deploying the config file | 22:15 |
clarkb | ianw: I haven't yet | 22:16 |
ianw | if you don't have any objections, i can try moving meetbot to limnoria maybe again this afternoon when it's quiet | 22:17 |
clarkb | no objections from me. fungi ^ anything to consider when doing that? maybe double checking the meeting schedule to ensure its not quiet with a meeting? | 22:18 |
ianw | beforehand i'll sort out syncs of the logs | 22:18 |
ianw | from e.openstack.org -> e.opendev.org | 22:18 |
fungi | ianw: go for it. sorry i'm not much help, internet outage here, i'm limping along on a wireless modem for the moment | 22:19 |
ianw | no worries :) i'm surprised i still have power/internet, it's crazy here | 22:21 |
ianw | https://www.abc.net.au/news/2021-06-10/wild-weather-batters-victoria/100203532 | 22:22 |
fungi | yikes | 22:22 |
fungi | stay safe! | 22:22 |
ianw | settling down now but last night could have thought you were on the Pequod at times :) | 22:22 |
opendevreview | Ian Wienand proposed opendev/system-config master: nodepool-builder: add volume for /var/lib/containers https://review.opendev.org/c/opendev/system-config/+/795707 | 22:41 |
ianw | mnaser / infra-root: ^ that mirrors what we do in nodepool gate for production | 22:41 |
*** tosky has quit IRC | 22:47 | |
ianw | -09 22:42:25.839331 | LOOP [build-docker-image : Copy sibling source directories] | 22:50 |
ianw | 2021-06-09 22:42:26.548311 | ubuntu-focal | cp: cannot stat 'opendev.org/opendev/meetbot': No such file or directory | 22:50 |
ianw | i wonder why the ircbot build would fail in the gate like that :/ | 22:50 |
ianw | https://zuul.opendev.org/t/openstack/build/1529ed5e1b0242e39eabc2bc3c86b79a/log/job-output.txt#279 is the prepare-workspace from the check job | 22:55 |
ianw | the gate job only cloned system-config | 22:55 |
clarkb | missing required project? | 22:55 |
ianw | maybe but why would it work in check? | 22:56 |
ianw | https://1aa2de7fd6bc4a6a901d-a3c1233a0305e644b60ccc0279f1954b.ssl.cf1.rackcdn.com/793704/24/gate/system-config-upload-image-ircbot/ec9c181/job-output.txt is the failed job | 22:56 |
ianw | ohhh, i guess it's the "upload-image" job ... not "build-image" | 22:57 |
ianw | they should be sharing required jobs via yaml tags, but something must have gone wrong | 22:57 |
ianw | required projects i mean | 22:57 |
*** whoami-rajat has joined #opendev | 22:58 | |
opendevreview | Ian Wienand proposed opendev/system-config master: Create ircbot container https://review.opendev.org/c/opendev/system-config/+/793704 | 23:04 |
opendevreview | Ian Wienand proposed opendev/system-config master: limnoria/meetbot setup on eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/793719 | 23:04 |
opendevreview | Ian Wienand proposed opendev/system-config master: Move meetbot config to eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795007 | 23:04 |
opendevreview | Ian Wienand proposed opendev/system-config master: Cleanup eavesdrop puppet references https://review.opendev.org/c/opendev/system-config/+/795014 | 23:04 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run statusbot from eavesdrop01.opendev.org https://review.opendev.org/c/opendev/system-config/+/795213 | 23:04 |
ianw | https://zuul.opendev.org/t/openstack/build/6804e9a6b18d44cd947f03280b9921be failed with POST_FAILURE -- WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! | 23:30 |
clarkb | we've seen that if there is an arp fight for an IP | 23:31 |
clarkb | usually it will get bad for a day or two and the ngo away as the cloud notices and cleans those up ? | 23:32 |
clarkb | er :/ | 23:32 |
ianw | rax-iad in that one | 23:32 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!