ianw | thanks, yeah will need to debug | 00:00 |
---|---|---|
*** ysandeep|out is now known as ysandeep | 04:30 | |
*** ykarel|away is now known as ykarel | 05:02 | |
*** iurygregory_ is now known as iurygregory | 06:42 | |
*** rpittau|afk is now known as rpittau | 07:22 | |
*** jpena|off is now known as jpena | 07:34 | |
*** mgoddard- is now known as mgoddard | 08:20 | |
*** ysandeep is now known as ysandeep|lunch | 08:26 | |
opendevreview | chzhang8 proposed openstack/project-config master: bring tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/804969 | 09:00 |
opendevreview | chzhang8 proposed openstack/project-config master: bring tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/804970 | 09:04 |
*** ykarel is now known as ykarel|lunch | 09:04 | |
opendevreview | chzhang8 proposed openstack/project-config master: bring tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/804972 | 09:20 |
noonedeadpunk | folks, I started seing weird issues when trying to pull changes from gerrit https://paste.opendev.org/show/808165/ | 09:40 |
noonedeadpunk | really no idea what's wrong here... | 09:40 |
jssfr | noonedeadpunk, it works for me | 09:44 |
jssfr | the error sounds as if your filesystem may be corrupt | 09:44 |
noonedeadpunk | hm... might be... | 09:45 |
jssfr | did git or your machine crash recently? | 09:45 |
noonedeadpunk | well, X crashed several days ago, but dunno, worth running fsck indeed. | 09:45 |
noonedeadpunk | thanks anyway for checking that | 09:45 |
jssfr | (fwiw, I ran `git init foobar && cd foobar && git fetch "https://review.opendev.org/openstack/openstack-ansible" refs/changes/68/804868/1` and that passed without error) | 09:47 |
*** ysandeep|lunch is now known as ysandeep | 09:47 | |
opendevreview | chzhang8 proposed openstack/project-config master: bring tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/804977 | 09:51 |
*** odyssey4me is now known as Guest4722 | 10:37 | |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 10:38 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 10:52 |
*** ykarel|lunch is now known as ykarel | 10:58 | |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 11:00 |
*** dviroel|ruck|out is now known as dviroel|ruck | 11:12 | |
*** jpena is now known as jpena|lunch | 11:34 | |
*** sshnaidm|pto is now known as sshnaidm | 12:25 | |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 12:29 |
*** jpena|lunch is now known as jpena | 12:32 | |
*** ykarel is now known as ykarel|away | 14:42 | |
clarkb | noonedeadpunk: jssfr I too can clone the repo and fetch that ref. | 15:15 |
clarkb | noonedeadpunk: I would run fsck on the repo and on the filesystem. But also you may want to check SMART for your disks and memtest your memory | 15:15 |
fungi | corvus: on the stale cache for the moved config, it looks like zuul did actually record processing the change-merged event: | 15:19 |
fungi | 2021-07-21 10:38:28,954 DEBUG zuul.Scheduler: [e: 9f2d4a1e191b4ebd86e908bb8c30cbe1] Processing trigger event <GerritTriggerEvent change-merged opendev.org/x/devstack-plugin-tobiko master 801436,4> | 15:19 |
fungi | that's currently in /var/log/zuul/debug.log.28.gz on the scheduler | 15:19 |
fungi | so not the same situation as what we found yesterday | 15:19 |
clarkb | Today was the listed day for the gitea release (if their milestones are accurate in github), but no release yet | 15:30 |
clarkb | fungi: might be worth making a copy of that file as it will rotate out in a couple of days | 15:35 |
clarkb | just in case debugging this takes longer | 15:35 |
fungi | yep | 15:35 |
*** ysandeep is now known as ysandeep|away | 15:38 | |
*** jpena is now known as jpena|off | 15:42 | |
*** rpittau is now known as rpittau|afk | 16:06 | |
clarkb | we restarted to actually use the cache after that point in time | 16:11 |
clarkb | Is it possible there was some half implemented cache state we were running whcih wuold've populated the cache but not cleared it when the merge event occurred? | 16:11 |
clarkb | generally removals of content seem to be working since the removal for the cloud launcher job in our hourly deploy pipeline has been reflected | 16:12 |
clarkb | though that wasn't the removal ofa file | 16:13 |
clarkb | fungi: maybe we should try and reproduce with teh sandbox repo? | 16:13 |
clarkb | create a new job in a new file. Merge that with pipeline use. Then remove it and see if zuul complains would be a pretty minimal reproducer if this is a consistent issue | 16:14 |
clarkb | fungi: do you think you can review the stack at https://review.opendev.org/c/opendev/system-config/+/804925 to furhter improve the hourly run time? also I've just approved https://review.opendev.org/c/opendev/system-config/+/758594 to ensure we don't forget that for the next renames | 16:19 |
clarkb | iurygregory: out of curiousity what is the location for the ironinc midcycle? Will you be using meetpad? If so would be good ot hear if you find any oddness between jitsi meet and etherpad since we upgraded etherpad last week (we did some testing and it seems fine) | 16:23 |
iurygregory | clarkb, hey! I think we will use the meetpad - https://meetpad.opendev.org/ironic | 16:24 |
iurygregory | I haven't check with Julia since she is on PTO | 16:24 |
clarkb | iurygregory: cool let us know if you see any weirdness but like I said we expect it will be fine based on tested we did | 16:25 |
clarkb | s/tested/testing/ | 16:25 |
iurygregory | clarkb, sure! | 16:26 |
fungi | if anything, it seems to be working better recently than it did around the time of the last ptg | 16:28 |
fungi | as far as handling of the "shared document" etherpad embedding | 16:29 |
clarkb | fungi: also let me know if you want me to look at logs or zk db for that cache thing. I'll be boostrapping a lot of that from scratch as I didn't review that stack but happy to look if the extra eyes will be helpful | 16:32 |
clarkb | fungi: I half expect that we may need to look at the zk dbto see what we have cached then figure out why it didn't got away | 16:32 |
fungi | clarkb: yeah, i liked your idea of testing a file move in the sandbox repo. i'm working my way back around to this problem and will give that a shot | 16:34 |
fungi | also good theory on the "maybe we populated the cache via wip cache management which wasn't entirely clean, but were not actually reading from it until the restart yesterday" | 16:35 |
corvus | clarkb, fungi: i'm around now | 16:41 |
clarkb | corvus: sounds like fungi may try to reproduce with the sandbox repo and I threw out a theory that maybe we had a half complete cache implementation that wouldn't properly purge things back on July 21 but does now (at least in simpler deletions of content that seems to work today) | 16:44 |
corvus | kk. i'll inspect the zk contents | 16:45 |
fungi | i also didn't follow the cache implementation closely. did it start out by writing a cache but not actually reading from it at restart? and then the restart yesterday was the first one where it read its config state in from the cache? | 16:45 |
corvus | fungi: it's... complicated. but we started fully relying on it last week. | 16:47 |
fungi | okay, so doesn't necessarily explain why the tobiko changes would have just started breaking on stale configuration after the most recent restart | 16:47 |
corvus | if it becomes important, i can go narrow down when each thing happened and in what order -- not trying to be evasive, just don't have that info handy right now | 16:48 |
corvus | fungi: i agree, that's the bit of data that doesn't make sense to me. there should be no difference between today's performance and 2 days ago. | 16:48 |
clarkb | fungi: did the dleetion happen before the most recent restart? | 16:49 |
clarkb | oh ya on the 21st duh | 16:49 |
corvus | let's start an etherpad for notes: https://etherpad.opendev.org/p/34eHDRUw0OH3IXn3grT4 | 16:49 |
fungi | thanks | 16:52 |
fungi | i'll copy some examples in there | 16:52 |
clarkb | do we know if tobiko was active between july 21 and ~now? | 16:52 |
clarkb | that could explain not noticing the issue if it was present all along | 16:53 |
fungi | yes, there's a change brought to our attention in #openstack-infra where it was working yesterday | 16:53 |
fungi | and then a minor edit to the change this morning couldn't be tested | 16:53 |
fungi | i'll get that in there | 16:54 |
clarkb | ok we can rule that out then | 16:54 |
corvus | we've had a lot of restarts since july 21; even if we assume that zuul worked correctly when it got the event, but is now using old cached data, it seems surprising that it worked yesterday. | 16:55 |
clarkb | in the new startup process is there detection of stale cache and if so does merging happen again? Is it possible that something caused zuul to think an older repo state was current which would invalidate the cache and then cause it to update with the old version in the cache? | 17:00 |
clarkb | like maybe a merger failed to get the HEAD of the remote repo so it treated its local HEAD as current? | 17:00 |
corvus | clarkb: there is no detection of a stale cache on startup; we're operating under the assumption for now that merge events don't happen when zuul isn't watching | 17:03 |
corvus | (if they, do, press the reset button) | 17:04 |
fungi | i've added the details from the two observed symptoms to the pad | 17:04 |
fungi | for the first symptom, we have a rough ~16 hour time window where it seems to have started, and the latest scheduler restart falls in that window | 17:05 |
fungi | for the second symptom, i don't think we have a history for config-errors so hard to know if it was complaining before the restart | 17:05 |
fungi | with some additional research in open changes for tobiko and/or zuul logs we could probably narrow the window | 17:06 |
opendevreview | Merged opendev/system-config master: Add additional post project rename reindexing https://review.opendev.org/c/opendev/system-config/+/758594 | 17:20 |
*** dviroel|ruck is now known as dviroel|ruck|out | 19:07 | |
Clark[m] | I'm starting to page in some Gerrit 3.3 upgrade stuff. Does anyone understand what is meant by step 2 of the downgrade process at https://www.gerritcodereview.com/3.3.html#downgrade | 19:14 |
Clark[m] | Do we need to hash an object whose content is 183 and then update the ref to that value? I guess they don't use a proper dag on that ref allowing us to revert a commit? | 19:14 |
Clark[m] | Other than some confusion over that process I think this upgrade is straightforward. From a user noticeability standpoint the only comments toggle seems to go away so we need to see what the new behavior is from that. We also need to decide if we want to enable attention sets | 19:16 |
fungi | yes, looks like it's saying to update the refs/meta/version to point at a hash of 183 | 19:16 |
fungi | but again, that's a manual step when downgrading | 19:16 |
fungi | i guess that's their equivalent of a schema serial | 19:17 |
fungi | Clark[m]: questions on two of the topic:hourly-run-optimizations changes | 19:17 |
Clark[m] | corvus: ^ totally not urgent but I think if we do enable attention sets that gertty may want to set the state on comment responses more intentionally so that the attention is toggled properly | 19:17 |
Clark[m] | fungi: thanks will take a look | 19:17 |
corvus | Clark: ack thx | 19:18 |
fungi | Clark[m]: have a (link to a) summary of "attention sets?" | 19:19 |
clarkb | fungi: http://gerrit-documentation.storage.googleapis.com/Documentation/3.3.5/user-attention-set.html | 19:19 |
clarkb | I'm thinking I may try to put together a change that tests a 3.2 to 3.3 upgrade on our test jobs. Then hold that and test the revert process on that test setup | 19:20 |
clarkb | If that isn't a terrible process we might consider doing this upgrade during a period of time we'd otherwise avoid since there is a revert | 19:21 |
clarkb | but I'm still looking at the change list. I note that hashar upgraded wikimedia to 3.3 recently and they upated their jgit.conf to include a setting we already set. | 19:21 |
clarkb | hashar: ^ you may have other input on upgrading to 3.3? | 19:21 |
fungi | at first i thought maybe attention sets could alleviate the need some folks saw in adding the reviewers plugin, but on reading that it looks like it might increase interest in adding it | 19:21 |
clarkb | fungi: responded to your review comments | 19:24 |
fungi | yep, saw the notifications, thanks | 19:24 |
clarkb | fungi: ya I think the reviewers plugin may be complimentary to attention sets. I suspect that some people may find attention sets to be annoying but I personally think they may be worth trying after interacting with them with my upstream gerrit changes | 19:25 |
fungi | right, i foresee some considering the reviewers plugin as a way to make attention sets less annoying | 19:25 |
fungi | or at least more useful | 19:26 |
opendevreview | Merged opendev/system-config master: Run infra-prod-service-zuul-preview daily instaed of hourly https://review.opendev.org/c/opendev/system-config/+/804925 | 19:33 |
opendevreview | Merged opendev/system-config master: Run remote-puppet-else daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804926 | 19:37 |
opendevreview | Merged opendev/system-config master: Stop requiring puppet things for afs, eavesdrop, and nodepool https://review.opendev.org/c/opendev/system-config/+/804927 | 19:37 |
clarkb | I think we also need to put all of the bot accounts in Non-Interactive Users (which becomes Service Users in 3.3) to prevent them from confusing attention set | 19:50 |
fungi | which also means making sure we don't delegate any special privileges to that group | 19:54 |
clarkb | yup | 19:55 |
clarkb | https://etherpad.opendev.org/p/gerrit-3.3-upgrade-prep I'm going to start putting notes in there | 19:55 |
mordred | in theory the idea of attention sets make me happy. like - Important Changes was an attempt to answer "what should I be looking at" | 20:28 |
mordred | so I'll be intersted to see how it is | 20:28 |
*** timburke_ is now known as timburke | 21:00 | |
clarkb | fungi there is a netflix documentary called Fantastic Fungi | 21:03 |
clarkb | it comes with a health disclaimer. | 21:03 |
corvus | if you've ever been drinking with fungi, you'll know he should come with a health disclaimer too | 21:28 |
Unit193 | fungi: I was able to get in contact with the pastebinit dev, he's at least read what I said and ACK'd it. Hopefully we'll see changes soon™. Last time I backported pastebinit to buster, if it's fixed I'll likely do it again for bullseye. | 21:45 |
corvus | i'd like to restart zuul to pick up a bugfix | 22:33 |
clarkb | the release queues look empty | 22:33 |
corvus | i'm assuming that tobiko will be stable after the full-reconfig, since they merged a valid config change | 22:35 |
corvus | i'm running the docker pull now; it's doing work, so will be a minute | 22:36 |
clarkb | corvus: is it sitll going? | 22:51 |
corvus | done; sorry task switched momentarily | 22:51 |
corvus | restarting now | 22:52 |
clarkb | no problem, wanted to make sure there wasn't an issue with the images | 22:52 |
corvus | the light's a little orange here today | 22:54 |
clarkb | does the air remind you of a campfire? | 22:55 |
corvus | not yet -- so far i think it's all upper atmosphere | 22:55 |
corvus | apparently that may start to change soon | 22:56 |
corvus | it's sort of weird seeing the tenants come on-line one-by-one now | 22:57 |
corvus | it's up; i'm going to run full-reconfigure now | 23:02 |
corvus | cat jobs are being dispatched | 23:03 |
corvus | i think i have spotted a case where the fix isn't completely thorough -- i don't think it's wrong, but it may be only 99% complete -- i'm going to look into that real quick | 23:13 |
clarkb | corvus: we reenqueue after the full reconfigure? | 23:13 |
corvus | clarkb: that's my plan; i felt that would produce fewer errors | 23:14 |
clarkb | makes sense | 23:14 |
fungi | clarkb: yep, i've witnessed that documentary, however i did neither participate in its creation nor suggest its fantastic name | 23:19 |
fungi | Unit193: thanks! ianw and i are debating in https://review.opendev.org/804539 over the most reasonable compromise to still have a redirect but not break pastebinit users | 23:20 |
fungi | corvus: thanks for the fix and restart. in theory the x/devstack-tobiko-plugin entries in openstack tenant config-errors should disappear | 23:21 |
corvus | okay, good news and bad news! | 23:23 |
corvus | good news 1: the files we wanted to be deleted from the cache have been! | 23:23 |
corvus | bad news 1: the extra debug line i added has indicated that we have a latent bug in that code, which could have caused a cache corruption problem once the cache has more than one user: | 23:24 |
corvus | 2021-08-18 23:20:05,346 DEBUG zuul.TenantParser: Removing file from cache for project gerrit.googlesource.com/zuul/jobs @master: zuul.d/devstack-tobiko.yaml | 23:24 |
corvus | the project name is wrong there -- it's not used in the cache cleanup, so i'm confident that part is correct. but it is used to lock the cache, so the wrong part of the cache is being locked | 23:25 |
clarkb | that is an interesting mixup | 23:26 |
corvus | bad news 2: due to the way the merger returns files, the fix is incomplete and doesn't cover the case where someone deletes a zuul.yaml file (ie, zuul.yaml -> .zuul.d/foo.yaml). | 23:26 |
corvus | essentially, the merger always returns specifically requested file paths, whether they exist or not | 23:26 |
corvus | i'm hoping the value is None or similar; will check on that in a bit | 23:27 |
corvus | we're still waiting on cat jobs | 23:27 |
Unit193 | fungi: FWIW, I passed him https://paste.opendev.org/show/bFVbXF44VrHbyS1Fxpd0 | 23:27 |
clarkb | corvus: hrm seems like cat jobs took about 12 minutes to complete before we switched to the cache? | 23:28 |
clarkb | maybe that was different than a full reconfigure though | 23:28 |
fungi | Unit193: thanks! that will eventually help as the old versions age out, but we also want to make sure we accommodate users of old versions in various distros for however many years that takes to happen | 23:29 |
Unit193 | Indeed. | 23:29 |
ianw | sorry yes i will get back to that in a minute, just got my head deep in a treeview | 23:32 |
clarkb | I see jobs have started and queue lengths went to zero | 23:33 |
clarkb | I think that means the full reconfigure is done? | 23:33 |
corvus | yes, re-enqueueing | 23:33 |
corvus | it looks like there were tobiko config errors but they are gone now | 23:34 |
fungi | that's a good sign at least | 23:34 |
corvus | so i think the fix worked (modulo the above caveat) | 23:34 |
corvus | #status log restarted all of zuul on 598db8a78ba8fef9a29c35b9f86c9a62cf144f0c to correct tobiko config error | 23:43 |
opendevstatus | corvus: finished logging | 23:43 |
clarkb | 23:54 | |
clarkb | oops | 23:55 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!