opendevreview | Ian Wienand proposed opendev/lodgeit master: Add mariadb connector to container https://review.opendev.org/c/opendev/lodgeit/+/798411 | 00:33 |
---|---|---|
*** odyssey4me is now known as Guest1231 | 01:12 | |
opendevreview | Ian Wienand proposed opendev/lodgeit master: Add mariadb connector to container https://review.opendev.org/c/opendev/lodgeit/+/798411 | 01:16 |
*** ysandeep|out is now known as ysandeep | 01:48 | |
*** ysandeep is now known as ysandeep|afk | 02:11 | |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [wip] test centos8-stream with ro /sys https://review.opendev.org/c/openstack/diskimage-builder/+/799126 | 03:22 |
*** ysandeep|afk is now known as ysandeep | 03:59 | |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [wip] test centos8-stream with ro /sys https://review.opendev.org/c/openstack/diskimage-builder/+/799126 | 04:19 |
*** ykarel|away is now known as ykarel | 05:34 | |
kopecmartin | ianw: oh, I thought you've already did, because I noticed yesterday that the server stopped downloading guidelines and was throwing errors so I merged the interop change - https://review.opendev.org/c/osf/interop/+/796413 | 06:22 |
kopecmartin | everything is working now which is very weird if the container hasn't been pulled yet | 06:23 |
*** gthiemon1e is now known as gthiemonge | 06:32 | |
*** jpena|off is now known as jpena | 06:52 | |
*** amoralej|off is now known as amoralej | 06:56 | |
*** ysandeep is now known as ysandeep|lunch | 08:30 | |
*** ykarel is now known as ykarel|lunch | 08:31 | |
*** ysandeep|lunch is now known as ysandeep | 09:34 | |
*** ykarel|lunch is now known as ykarel | 09:52 | |
ricolin | ianw, fungi clarkb found this error in https://nb03.opendev.org/debian-bullseye-arm64-0000029245.log | 10:11 |
ricolin | Exit code: 1 | 10:11 |
ricolin | "/usr/local/lib/python3.7/site-packages/diskimage_builder/lib/disk-image-create: line 145: cannot create temp file for here-document: No space left on device" | 10:13 |
ricolin | currently all debian-bullseye-arm64 jobs are queued for days | 10:14 |
*** frenzy_friday is now known as frenzyfriday|afk | 11:00 | |
*** ysandeep is now known as ysandeep|afk | 11:01 | |
*** dviroel|out is now known as dviroel | 11:34 | |
*** jpena is now known as jpena|lunch | 11:36 | |
*** bhagyashris_ is now known as bhagyashris|ruck | 12:15 | |
*** ysandeep|afk is now known as ysandeep | 12:26 | |
*** ysandeep is now known as ysandeep|mtg | 12:29 | |
*** jpena|lunch is now known as jpena | 12:36 | |
*** ysandeep|mtg is now known as ysandeep | 12:38 | |
*** amoralej is now known as amoralej|lunch | 12:45 | |
*** ysandeep is now known as ysandeep|mtg | 13:00 | |
*** amoralej|lunch is now known as amoralej | 13:41 | |
fungi | ricolin: thanks for the heads up, i wonder if we're having growroot issues on those specific images | 13:46 |
fungi | ricolin: oh! that's in the build log, so we've likely filled up the disk on that builder, i'll check it | 13:47 |
fungi | we may need to shut down the builder container on it and clean up the disk | 13:47 |
fungi | /dev/mapper/main-main 787G 787G 0 100% /opt | 13:47 |
fungi | bingo | 13:47 |
fungi | we basically haven't been building any new arm64 images | 13:48 |
fungi | ricolin: the backlog may be unrelated, i know we were also waiting on the linaro-us cloud to fix an expired ssl cert, i need to see if it's been replaced yet | 13:49 |
fungi | the ssl cert for the api endpoint expired some days ago | 13:49 |
fungi | the full disk on nb03 might actually be related to that if it's been struggling and failing to upload new images there | 13:50 |
fungi | i've downed the nodepool-builder container on nb03.opendev.org now | 13:50 |
corvus | i'd like to restart zuul to see how the zk executor api changes perform | 14:04 |
fungi | corvus: seems like a good day for it. also we'll get the zuul vars back in the inventory.yaml file after that | 14:15 |
*** ysandeep|mtg is now known as ysandeep | 14:17 | |
corvus | ya | 14:17 |
corvus | restarting now | 14:21 |
corvus | #status log restarted all of zuul on commit cc3ab7ee3512421d7b2a6c78745ca618aa79fc52 (includes zk executor api and zuul vars changes) | 14:22 |
opendevstatus | corvus: finished logging | 14:22 |
fungi | i let the openstack release team know, they were about to start approving some patches in their meeting | 14:28 |
corvus | oh sorry, i thought they were typically idle on friday; i will re-evaluate my assumptions | 14:29 |
corvus | it's up again, and jobs are running | 14:29 |
fungi | no worries, i told them i would give them a heads up when we were starting, but no harm done | 14:29 |
corvus | re-enqueue in progress | 14:29 |
fungi | thanks! | 14:30 |
corvus | jobs seem to be running, so that's a good sign | 14:30 |
corvus | there are significantly more ephemeral nodes in zk | 14:32 |
corvus | also signficantly less data size (probably compression) | 14:33 |
corvus | we've added about 2k nodes (for a total of 39k) but dropped from 21.5mb to 14.9mb | 14:34 |
corvus | oh, interesting, the data has gone back up and increased; i guess that metric lagged a bit? | 14:35 |
tobiash[m] | has been the scheduler startup time impacted (due to mergers via zk)? | 14:36 |
corvus | tobiash: it didn't seem significant; let me see if i can get a number | 14:36 |
corvus | tobiash: almost exactly average. our mean of 4 reconfiguratons in the last month was 378 seconds (range from 357-403), today's was 375 | 14:40 |
tobiash[m] | great | 14:41 |
fungi | that's great news | 14:42 |
corvus | the executors seem to have reached their nominal capacity for builds fairly quickly | 14:43 |
corvus | i wonderi f we need a stats adjustment for the executors and executor queue though; those graphs appear to have flatlined | 14:43 |
fungi | okay, release team meeting has wrapped up and i'm back to looking at nb03 to see what we need to clean up | 14:45 |
tobiash[m] | the queued jobs still counts the gearman queue | 14:46 |
fungi | i expect the contents of /opt/dib_tmp are all leaked trash at this point | 14:46 |
tobiash[m] | as it looks like | 14:46 |
*** ysandeep is now known as ysandeep|dinner | 14:46 | |
corvus | i think for the executors graph we need to add "unzoned" | 14:46 |
fungi | ooh, yep, none of our executors are zoned | 14:47 |
fungi | so if it's treed by zone now that would make sense we'd have to adjust the stat we're polling | 14:47 |
tobiash[m] | corvus: I wonder why the running jobs graph still works given that the stats seems to still count the gearman queue: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L363 | 14:49 |
corvus | i'm confused, i think the plain zuul.executors.accepting stat should work; we shouldn't need to switch to unzoned yet | 14:50 |
tobiash[m] | the accepting should work | 14:51 |
*** dviroel is now known as dviroel|lunch | 14:51 | |
corvus | yet it doesn't; and the running should not work, yet it does | 14:52 |
tobiash[m] | that's weird | 14:52 |
tobiash[m] | the running might be taken from the per executor metric | 14:54 |
tobiash[m] | which should not have changed | 14:54 |
tobiash[m] | ah I think I got it, the "Executor Queue" graph is taken from the queue metrics from the scheduler which are broken now and is flatlined | 14:56 |
tobiash[m] | the "Running Builds" graph uses the executor stats and works | 14:56 |
corvus | ah yep, that's it | 14:57 |
tobiash[m] | which leaves the "Executors" graph to be checked | 14:57 |
tobiash[m] | which I think should continue to work | 14:57 |
corvus | though, we still have the mystery of why zuul.executors.accepting isn't working but zuul.executors.unzoned.accepting is | 14:59 |
fungi | apparently nl02 got caught up in a hypervisor host problem earlier in the week and was rebooted, per a ticket from rackspace | 15:01 |
fungi | but looks like it's running okay currently | 15:01 |
tobiash[m] | corvus: there is a bug: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L329 | 15:04 |
tobiash[m] | that's not taking the accepting into account | 15:04 |
corvus | aha | 15:05 |
*** ysandeep|dinner is now known as ysandeep | 15:10 | |
*** ysandeep is now known as ysandeep|away | 15:24 | |
*** ykarel is now known as ykarel|away | 15:26 | |
* clarkb is catching up | 15:35 | |
clarkb | Sounds like things are working toher than stats reporting? not bad considering | 15:37 |
clarkb | fungi: re nb03 the builders all do that. I suspect it is partially related to us updating the docker container images forcefully. But ianw thought that the issue on the dib side that let that happen had been addressed | 15:38 |
clarkb | fungi: one thought I had after the last cleanup was that we could run a simple find in cron to clean those up based on what the current build is (basically find a way to ignore the current build) | 15:38 |
fungi | i suppose we could hold a lock on a file in the tempdir and then check that known filename for open handles before removing the containing directory? | 15:46 |
clarkb | fungi: ya that should probably do it. I think you can also find the random string in the current build log or in the process tree (I suppose your idea is to look it up from the process tree) | 15:47 |
fungi | nah, i mean actually stat the known filename inside each tempdir and then if it has no open file handles we know it's been leaked... but that assumes the process grows or is wrapped in a script with the feature to hold that lock until the process terminates | 15:50 |
clarkb | fungi: also I started thinking about the gerrit account cleanup and realized that the last set of data was generated long enough ago that if I disabled accounts today that suddenly started being active again in the last 2 months that would be sadness. I don't expect a large delta but I think I should regenerate all the outputs of our scripts around this (redo the config check in | 15:52 |
clarkb | gerrit, feed that into the audit, compare the audit from nowish to a couple of months ago) before retiring accounts | 15:52 |
clarkb | I suspect we'll get zero delta and we can proceed without much extra checking beyond that, but if there is a delta it should be small and we can accomodate it | 15:53 |
*** jpena is now known as jpena|off | 15:53 | |
*** amoralej is now known as amoralej|off | 15:55 | |
fungi | yeah, that's a great point | 15:58 |
fungi | trying to run du over /opt/dib_tmp on nb03 is taking a very long time to return | 16:00 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gerrit image to v3.2.11 https://review.opendev.org/c/opendev/system-config/+/799225 | 16:01 |
clarkb | melwitt: fungi: ^ re gerrit update | 16:01 |
*** dviroel|lunch is now known as dviroel | 16:16 | |
fungi | clarkb: i'm beginning to think du is never going to finish counting the contents of /opt/dib_tmp, is it safe just to empty that while the builder is stopped? | 16:25 |
clarkb | fungi: yes all of the data there is temporary. One suggestion though is that you down the builder container, then reboot to clear out any stale mounts that may exist for those entries (hopefully would only be for the running build that dies due to the stop), then cleanup and start the process again | 16:26 |
fungi | clarkb: there is nothing mounted currently anyway, at least not according to df/mount commands | 16:27 |
fungi | just the normal system mounts and a /run/user mount for my session | 16:28 |
clarkb | in that case should be totally fine without a reboot | 16:28 |
fungi | okay, wiping out everything inside /opt/dib_tmp in that case | 16:29 |
fungi | it's been an hour of deleting and freed ~120GiB so far, but i have a feeling it's still going to be deleting for a while | 17:26 |
JayF | I'm pretty reliably getting 400 errors from storyboard trying to submit a new story. error is a red box popup saying "400: POST /api/v1/stories/2009026: Invalid input for field/attribute story. Value: '2009026'. unable to convert to Story | 17:26 |
clarkb | JayF: https://storyboard.openstack.org/#!/story/2009026 I think that is because it was already created | 17:28 |
JayF | Got a new browser window and it... oh | 17:28 |
clarkb | I suspect you had a non fatal error on the intial creation then subsequent attempts result in that error you posted above | 17:28 |
JayF | Well, it worked in a new browser window. Now it's obvious as to why. | 17:28 |
clarkb | The timestampsfor creation are from about 7 minutes ago | 17:29 |
JayF | yeah, it matches | 17:29 |
JayF | weird but glad it's all fine, I'll cleanu p my dupe | 17:29 |
fungi | JayF: also it can do that if you try to add two initial tasks in the story creation dialog, known bug | 17:40 |
JayF | That is /exactly/ what I did. | 17:40 |
fungi | the task creations seem to try to happen in overlappnig transactions | 17:41 |
JayF | Thanks for closing the loop on it, that matches b/c I got a different error the first time (but didn't recall it) and then got this one every other step | 17:41 |
fungi | and the api call to add the second tasks fails on a lock | 17:41 |
fungi | 2.5 hours into cleanup we've deleted 300GiB from /opt/dib_tmp on nb03 | 18:58 |
clarkb | infra-root I'm going to run a gerrit config consistency check now to get an up to date list of conflicts that I can use to rerun an audit with. Though at this rate I probably won't get to that today as I think the zuul stuff is going to take priority | 19:16 |
clarkb | consistency check hasn't changed since we last ran it (good that is expected). Now I need to run the audit to see if user interactions have changed | 19:31 |
opendevreview | Goutham Pacha Ravi proposed openstack/project-config master: Add feature branch notifications to openstack-sdks https://review.opendev.org/c/openstack/project-config/+/799323 | 19:32 |
clarkb | I have a user audit running now | 19:45 |
fungi | /opt/dib_tmp on nb03 is finally empty. 356GiB available now. is there anything else i should clean up before starting the nodepool-builder container there again? | 19:46 |
fungi | it has 22 base images plus their variants and checksums in /opt/nodepool_dib which is probably reasonable | 19:47 |
clarkb | fungi: you can check if we haev leaked images in /opt/nodepool_dib but you can clean those up safely while the process is running | 19:47 |
clarkb | fungi: on the x86 builders you occasionally see the intermediate vhd file get lost | 19:48 |
clarkb | but we don't build vhds for arm64 so that shouldn't happen I suspect htat that cleanup si fairly complete | 19:48 |
fungi | yeah, no vhd files in there | 19:48 |
fungi | starting the container again in that case | 19:48 |
*** dviroel is now known as dviroel|brb | 19:50 | |
clarkb | I'm glad I decided to rerun the audit. There is at least one account that had gone from inactive for the last three yaers to active (not sure it was one the cleanup list yet, but there was certainly enough churn to make double checking a good idea) | 20:13 |
fungi | yep | 20:14 |
clarkb | doesn't look like it was on the chopping block (good means that my methods are not completely terrible) | 20:15 |
clarkb | but now I have a pretty good indication I can put the other account related to this user on the chopping block. However I was going to save those for when we got to the ~80 I think we haev remaining and reach out to people about it first | 20:16 |
*** dviroel|brb is now known as dviroel | 20:34 | |
clarkb | fungi: fwiw I'm going through the existnig proposals for my own piece of mind. I'm flagging any that seem more dangerous than others and I may ask you to take a look at those and double check them. If we want we can trim them out or if they look safe we can proceed with them. | 20:36 |
clarkb | Once I'ev gotten through this I'll push up files like we already have on review but with newer timestamps | 20:37 |
fungi | thanks | 20:55 |
*** dviroel is now known as dviroel|out | 21:03 | |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Enable ZooKeeper 4 letter words https://review.opendev.org/c/zuul/zuul-jobs/+/799334 | 21:24 |
opendevreview | Merged zuul/zuul-jobs master: Enable ZooKeeper 4 letter words https://review.opendev.org/c/zuul/zuul-jobs/+/799334 | 21:45 |
clarkb | fungi: there are three files in review:~clarkb/gerrit_user_cleanups/notes/ audit-results.yaml.20210702 is the otuput of the audit resutls which you can refer to to see what data was used to make decisions. proposed-cleanups.20210702 is the list of accounts that we will retire, then later the email associated with the external id conflicts that will be dleeted on the retired accounts. | 22:05 |
clarkb | And finally doublecheck.20210702 a subset of those in the previous file whcih I have identified as riskier because the other side of the conflict was somewhat recently used | 22:05 |
clarkb | if you can take a look at those files and doublecheck the double check list I think we're just about ready to retire the accounts identified in the proposed-cleanups.20210702 file | 22:06 |
fungi | lookin' | 22:06 |
clarkb | I probably won't do that today because the way that script is set up it takes a long time and I have to acknowledge use of my ssh key (though I could temporarily turn that off). But Definitely should be able to run that tuesday | 22:06 |
fungi | so 36 high-risk | 22:08 |
clarkb | ya and even then I think those are relatively low risk because for each of them its pretty clear which is used more recently | 22:08 |
clarkb | but if we are going to run into problems I suspect it would be with that set. Maybe they are using the second account in some way that is harder to measure for example | 22:09 |
fungi | low-high-risk ;) | 22:09 |
fungi | huh... i only just noticed that the poetry readme uses oslo.utils as its example of a challenging dep solver problem: https://pypi.org/project/poetry/ | 22:55 |
opendevreview | arkady kanevsky proposed opendev/irc-meetings master: Changed Interop WG meeting time for the summer 2 hours earlier. https://review.opendev.org/c/opendev/irc-meetings/+/799337 | 23:44 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!