*** tosky has quit IRC | 00:00 | |
ianw | wheels aren't releasing due to "Could not lock the VLDB entry for the volume 536871142." | 00:54 |
---|---|---|
ianw | i feel like i already fixed that at some point ... | 00:54 |
ianw | http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-03-29.log.html#t2021-03-29T04:08:37 | 00:57 |
ianw | indeed | 00:57 |
ianw | VLDB entries for all servers which are locked: | 00:59 |
ianw | Total entries: 0 | 00:59 |
*** brinzhang has joined #opendev | 01:02 | |
*** iurygregory has quit IRC | 01:05 | |
ianw | ok, this is a red herring | 01:21 |
ianw | the real problem is | 01:21 |
ianw | https://zuul.openstack.org/build/6dd9f20e9b3a41d7a6e6dec8347b2a3d | 01:21 |
ianw | which is openafs failing to install on centos7 which means it can't publish | 01:24 |
ianw | for a long time i've been meaning to reorganise these jobs to use the executor's afs client to copy the data ... but anyway | 01:25 |
ianw | i feel like the last time this happened, it was because we hadn't updated centos nodes and the kernel had changed, and we couldn't get the headers | 01:26 |
ianw | ok, 785675,1 is stuck waiting for arm64 nodes | 01:27 |
ianw | i'm not sure why it hasn't timed out | 01:27 |
ianw | nl03 isn't responding for me and could explain this | 01:30 |
ianw | ... ignore that. helps if you try nl01.openDEV.org (not stack) | 01:32 |
ianw | kevinz: hrm, i think i see in scrollback you'd identified some bogus nodes right? | 01:35 |
kevinz | ianw: morning! There are 3 instances are deleted but still remaining metadata.. | 01:36 |
kevinz | I'm working on removing it from DB | 01:36 |
ianw | ok, cool | 01:36 |
ianw | it almost looks to me like the launcher has somehow forgotten about the nodes being requested by zuul | 01:37 |
ianw | it doesn't appear to be trying to satisfy any requests | 01:37 |
ianw | 2021-04-12 01:37:34,918 DEBUG nodepool.PoolWorker.linaro-us-main: Active requests: [] | 01:37 |
ianw | but system-config-zuul-role-integration-bionic-arm64 has been queued for 54 hours | 01:38 |
kevinz | ianw: You means that linaro-us doesn't respond any requests from Zuul? | 01:39 |
kevinz | I saw that just 6 instances are currently running on the cluster | 01:39 |
ianw | kevinz: no i don't think that's it. it seems like nodepool has some how lost a bunch of requests; it is not trying to satisfy them | 01:40 |
ianw | i think linaro is responding ok | 01:40 |
kevinz | ianw: well, OK, what can I do to help? The first thing I think is to remove the existing "disappeared" vm instances from our cluster first | 01:41 |
ianw | kevinz: yeah, i don't know, this is a weird one | 01:44 |
kevinz | ianw: OK, I will fix this first to see if things will be better | 01:45 |
ianw | i feel like the node requests are not in zookeeper, so nodepool will never try to satisfy them. but zuul clearly thinks they are | 01:46 |
ianw | something happened about 58 hr 21 min ago | 01:46 |
ianw | 0a526f11-b784-416b-bd89-c5de47a9ba4c | debian-buster-arm64-linaro-us-0023946093 | BUILD | | debian-buster-arm64-1618117653 | os.large | | 01:47 |
ianw | kevinz: ^ can you see anything interesting relating to that | 01:48 |
ianw | 2021-04-11 08:22:04,016 INFO nodepool.NodeLauncher: [e: 788535e8d4bc49919afbc414a1fcaa45] [node_request: 300-0013647916] [node: 0023946093] Node is ready | 01:49 |
ianw | it seems to say the node is ready, but it's still showing "BUILDING"? | 01:49 |
ianw | but then "2021-04-11 09:45:19,783 INFO nodepool.NodeDeleter: Deleting ZK node id=0023946093, state=deleting, external_id=f5ee1b0f-107d-4965-a7b5-2375be42a30" | 01:50 |
ianw | this is all from yesterday | 01:50 |
ianw | ok, it's not correct that the requests aren't in zookeeper | 01:53 |
ianw | here is a log of a request in zookeeper and the NL related logs | 01:55 |
ianw | http://paste.openstack.org/show/804376/ | 01:55 |
ianw | this failed at | 01:55 |
ianw | launcher-debug.log.2021-04-09_15:2021-04-09 15:28:34,633 ERROR nodepool.NodeLauncher: [node_request: 300-0013634813] [node: 0023929201] Launch failed for node centos-8-stream-arm64-linaro-us-0023 | 01:56 |
ianw | after 3 attempts | 01:56 |
ianw | right, nodepool request-list shows this too | 02:01 |
ianw | i'm restarting nl03 container, i'm not sure what else to do | 02:04 |
ianw | kevinz: i think there is a problem | 02:06 |
ianw | i'm seeing a very helpful (not) message of | 02:06 |
ianw | openstack.exceptions.SDKException: Error in creating the server (no further information available) | 02:06 |
ianw | 2021-04-12 02:06:19,346 ERROR nodepool.NodeLauncher: [node_request: 300-0013634833] [node: 0023948459] Detailed node error: No valid host was found. There are not enough hosts available. | 02:07 |
ianw | kevinz: ^ it might actually be that | 02:07 |
ianw | kevinz: yeah, things are just going into ERROR state | 02:08 |
ianw | you can probably see that, nodepool is going crazy trying to create the nodes again :) | 02:09 |
kevinz | ianw: Yes I saw, quite a lot of instances are comming. Several instances are building and others are failed due to no valid host | 02:11 |
kevinz | ianw: I think we can stop some UT test since it is quite overloaded to the nodepool | 02:11 |
kevinz | btw, the 3 "disappeared" instances have been removed already | 02:12 |
ianw | yeah i would say this is trying to build too many nodes | 02:13 |
ianw | we've got max-servers: 40 ; i guess this is within limit | 02:15 |
ianw | kevinz: is this thundering heard of starting instances killing the cloud? | 02:15 |
kevinz | ianw: yes, the limitation is 40. I will try to find one more node to join the cluster to release the overload | 02:17 |
kevinz | ianw: Yes the cloud is receiving a lot of creating requests, so it is slow now :-) | 02:18 |
ianw | ok, i can turn that down if it's gotten too high | 02:18 |
ianw | i'm having a few authentication issues, but hopefully we'll have 15 nodes from OSU OSL coming online soon | 02:19 |
kevinz | ianw: That's fine actually, I see some instances creation is finished | 02:20 |
kevinz | cool, you mean 15 nodes are 15 vms or bare metal machines? | 02:20 |
kevinz | ianw: it looks that the OSU OSL machines are newer and maybe better performance :-) | 02:21 |
ianw | 15 vms :) | 02:26 |
kevinz | OK, nice | 02:27 |
ianw | i'm going to grab some lunch and hopefully things will start moving now | 02:27 |
kevinz | ianw: OK, np | 02:43 |
*** cloudnull8 has quit IRC | 03:02 | |
*** cloudnull8 has joined #opendev | 03:02 | |
ianw | hrm, something is still up | 03:08 |
ianw | we've got like 6 active nodes and nothing trying to build, but the queue is huge | 03:08 |
ianw | kevinz: it still seems to go straight into error node | 03:11 |
ianw | s/node/mode | 03:11 |
ianw | 151a8028-569b-4178-b09f-8c8411cf6aa5 for example, can you see what happened with that? | 03:12 |
kevinz | ianw: I'm adding one new compute node to this cluster, and it is under operation now. This instance is happened to schedulered to this new node | 03:14 |
ianw | kevinz: oh, ok np. lmn when things are stable | 03:14 |
kevinz | ianw: I saw https://zuul.openstack.org/stream/52a25bc0333c40febdebf9319c994321?logfile=console.log is running | 03:27 |
ianw | kevinz: if you check https://zuul.openstack.org/status there's lots of things waiting for nodes | 03:28 |
kevinz | ianw: yes I see, | 03:28 |
ianw | i've turned the max servers down to 10 for a little while you're working on it | 03:29 |
ianw | as you say, it does seem some nodes are building now | 03:29 |
kevinz | ianw: how long of zuul waiting for a instance creation? | 03:29 |
ianw | though that said, a bunch are in error | 03:29 |
ianw | e.g. 98b36e90-52ec-47fa-a413-5b246e1705af just errored | 03:30 |
ianw | kevinz: several days :) that's the problem ... | 03:30 |
kevinz | OK, will check | 03:30 |
ianw | kevinz: here's a big list http://paste.openstack.org/show/804377/ | 03:30 |
kevinz | I mean is there a timeout time for waiting instance launch, if timeout then retry | 03:30 |
ianw | yeah, i would have expected all these to fail with timeouts, but they haven't. i think that's perhaps a separate, but related issue | 03:31 |
kevinz | ianw: OK, ack | 03:31 |
ianw | something about the way things are failing isn't making zuul/nodepool give up | 03:31 |
kevinz | ianw: 98b36e90-52ec-47fa-a413-5b246e1705af : No valid host was found | 03:33 |
kevinz | I think there has some schulder issues maybe, always make the cloud no valide host... | 03:34 |
kevinz | Will fix the new host adding first anyway | 03:34 |
ianw | ok | 03:38 |
ianw | i've found at least one issue, that leaked nodes are put in a DELETING state but with no other details, and this confuses the quota calculator | 03:45 |
*** brinzhang_ has joined #opendev | 04:00 | |
*** brinzhang has quit IRC | 04:03 | |
*** mkowalski_ has joined #opendev | 04:12 | |
*** tristanC_ has joined #opendev | 04:15 | |
*** jrosser has quit IRC | 04:20 | |
*** tristanC has quit IRC | 04:20 | |
*** mkowalski has quit IRC | 04:20 | |
*** Alex_Gaynor has left #opendev | 04:21 | |
kevinz | ianw: adding one more 44core machines to the cluster | 04:33 |
kevinz | adding finished and I've tested the instance creation | 04:34 |
kevinz | ianw: yes I always saw that the DELETING state blocked.. | 04:34 |
*** jrosser has joined #opendev | 04:34 | |
ianw | kevinz: ok, cool, quota back up to 40 nodes in linaro. i think the xxxlarge instances though will keep the number of running nodes more limited (hitting memory quota) | 04:35 |
kevinz | ianw: Yes. That's is another problem | 04:36 |
*** ysandeep|away is now known as ysandeep | 04:53 | |
*** ykarel has joined #opendev | 04:56 | |
*** marios has joined #opendev | 05:08 | |
*** whoami-rajat_ has joined #opendev | 05:36 | |
*** sboyron has joined #opendev | 05:52 | |
*** ralonsoh has joined #opendev | 06:02 | |
*** slaweq has joined #opendev | 06:09 | |
*** eolivare has joined #opendev | 06:23 | |
*** dmsimard has quit IRC | 06:48 | |
openstackgerrit | Merged openstack/project-config master: Bump node version for publish-openstack-stackviz-element https://review.opendev.org/c/openstack/project-config/+/785768 | 06:49 |
*** dmsimard has joined #opendev | 06:51 | |
*** amoralej|off is now known as amoralej | 06:52 | |
openstackgerrit | Merged openstack/project-config master: nodepool elements: create suse boot rc directory https://review.opendev.org/c/openstack/project-config/+/781002 | 07:02 |
*** fressi has joined #opendev | 07:05 | |
*** eolivare has quit IRC | 07:07 | |
*** andrewbonney has joined #opendev | 07:08 | |
*** eolivare has joined #opendev | 07:09 | |
ianw | fungi: ^ i think that wheel building is held up because openafs fails to install on centos7. i think that's because our images have an out of date kernel, and the headers are not on the mirror any more. and i think that's because it's stuck behind suse. and that's what ^ fixes :) | 07:14 |
ianw | just a typical day in dependency land! | 07:14 |
*** dmsimard has quit IRC | 07:32 | |
*** dmsimard has joined #opendev | 07:33 | |
*** tosky has joined #opendev | 07:35 | |
*** jpena|off is now known as jpena | 07:54 | |
*** rpittau|afk is now known as rpittau | 08:04 | |
*** ysandeep is now known as ysandeep|lunch | 08:11 | |
*** gnuoy` has joined #opendev | 08:22 | |
*** gnuoy has quit IRC | 08:26 | |
*** brinzhang_ is now known as brinzhang | 08:55 | |
*** dtantsur|afk is now known as dtantsur | 08:56 | |
*** ysandeep|lunch is now known as ysandeep | 08:56 | |
hrw | morning | 09:45 |
hrw | I see that check-arm64 queue cleaned up | 09:45 |
*** whoami-rajat_ is now known as whoami-rajat | 10:33 | |
*** brinzhang_ has joined #opendev | 11:05 | |
*** brinzhang has quit IRC | 11:08 | |
*** iurygregory has joined #opendev | 11:11 | |
*** artom has joined #opendev | 11:19 | |
openstackgerrit | Guillaume Chauvel proposed opendev/system-config master: Increase autogenerated comment width to avoid line wrap https://review.opendev.org/c/opendev/system-config/+/771445 | 11:31 |
openstackgerrit | Guillaume Chauvel proposed opendev/system-config master: [DNM] test comment width: review without autogenerated tag https://review.opendev.org/c/opendev/system-config/+/771798 | 11:31 |
*** jpena is now known as jpena|lunch | 11:32 | |
*** dhellmann_ has joined #opendev | 11:44 | |
*** dhellmann has quit IRC | 11:45 | |
*** dhellmann_ is now known as dhellmann | 11:45 | |
fungi | ianw: it also sounds like you might have run into the same stuck node requests i've been trying to track down the cause of for a few weeks now | 11:52 |
fungi | ianw: http://eavesdrop.openstack.org/irclogs/%23zuul/%23zuul.2021-04-03.log.html#t2021-04-03T16:39:47 | 11:53 |
fungi | not sure if that sounds like some of what you saw too | 11:53 |
*** jpena|lunch is now known as jpena | 12:30 | |
*** cloudnull8 is now known as cloudnull | 12:48 | |
*** stephenfin has quit IRC | 12:49 | |
*** amoralej is now known as amoralej|lunch | 12:52 | |
*** stephenfin has joined #opendev | 13:08 | |
openstackgerrit | Guillaume Chauvel proposed opendev/system-config master: Increase autogenerated comment width to avoid line wrap https://review.opendev.org/c/opendev/system-config/+/771445 | 13:15 |
openstackgerrit | Guillaume Chauvel proposed opendev/system-config master: [DNM] test comment width: review without autogenerated tag https://review.opendev.org/c/opendev/system-config/+/771798 | 13:15 |
openstackgerrit | Merged openstack/diskimage-builder master: Add Debian Bullseye Zuul job https://review.opendev.org/c/openstack/diskimage-builder/+/783790 | 13:16 |
zigo | Hi there! | 13:20 |
zigo | I was wondering, would there be a way to get, in gerrit, a direct link to a plain patch file? | 13:20 |
zigo | I mean, no zip, tar.xz or base64... | 13:20 |
zigo | It'd be really helpful for me. | 13:20 |
hrw | zigo: press 'DOWNLOAD' link | 13:22 |
hrw | ah. you were there already | 13:22 |
hrw | zigo: curl patchlink|base64 --decode? | 13:22 |
zigo | hrw: Yeah, it has diff.base64, diff.zip, tgz, tar, tbz2, txz ... | 13:22 |
zigo | hrw: Yeah, I know, I can do that... :) | 13:23 |
zigo | I'd prefer if I didn't have to. | 13:23 |
hrw | zigo: I went that way in CI job | 13:24 |
hrw | as it was easiest way to fetch patches without having gerrit account | 13:24 |
zigo | hrw: It's not about CI or automation, it's that I very often pick-up patches by hand, and that's always one more step to do ... | 13:25 |
*** amoralej|lunch is now known as amoralej | 13:25 | |
hrw | zigo: make an alias? | 13:26 |
*** fressi has left #opendev | 13:49 | |
fungi | https://review.opendev.org/Documentation/rest-api-changes.html#get-patch explains the rest api call that download link represents | 13:57 |
fungi | i expect the reason for base64 encoding is that the diff could be of a binary file, and so trying to display that in a web browser would get weird | 13:58 |
*** sboyron has quit IRC | 14:06 | |
*** sboyron has joined #opendev | 14:07 | |
*** ykarel has quit IRC | 14:21 | |
*** ykarel has joined #opendev | 14:24 | |
clarkb | another approach could be to use git fetch | 14:44 |
clarkb | git fetch && git show FETCH_HEAD > foo.patch | 14:44 |
*** snapdeal has joined #opendev | 14:48 | |
fungi | yep, you can even fetch those refs from the opendev.org gitea server farm | 14:49 |
fungi | unfortunately gitea doesn't have a way to call named refs in its webui that i've been able to figure out (something i sorely miss from cgit and gitweb) | 14:50 |
clarkb | fungi re https://review.opendev.org/c/opendev/git-review/+/785723 I'll get that installed after breakfast then try and remember to use it once or twice to push some actual code | 14:50 |
fungi | you could use the gerrit-provided gitweb to do it, i think, but you'd need to be authenticated first because of the way its hooked up | 14:50 |
fungi | clarkb: thanks! | 14:50 |
*** marios is now known as marios|call | 14:53 | |
*** dpawlik has quit IRC | 14:58 | |
*** marios|call is now known as marios | 15:01 | |
*** ykarel is now known as ykarel|away | 15:19 | |
clarkb | I've received notice that git.airshipit.org and survey.openstack.org's ssl certs have 30 days of validity remaining. these are not LE certs. Do we want to bother renewing them? | 15:19 |
fungi | for survey.openstack.org i expect we can just let it expire | 15:20 |
fungi | i was going to propose we take that out of service anyway | 15:21 |
fungi | for git.airshipit.org it's probably a one-liner to add it to the other git redirect domains we already generate certs for | 15:21 |
fungi | just need a corresponding cname for the acme stuff | 15:21 |
fungi | clarkb: good news, the git.airshipit.org cert we're deploying is already generated with lets encrypt, so that can be ignored | 15:23 |
fungi | Subject: CN = git.airshipit.org | 15:23 |
fungi | Issuer: C = US, O = Let's Encrypt, CN = R3 | 15:23 |
clarkb | oh even better | 15:23 |
fungi | Not After : May 18 05:31:57 2021 GMT | 15:23 |
clarkb | and ya I agree re survey | 15:23 |
clarkb | I bet that got updated when the stuff serving files moved servers a while back | 15:24 |
fungi | yup, was pretty sure we had done them all, which is why i double-checked | 15:25 |
clarkb | fungi: do you have a sense for where the openstack release process is re final RCs? I'm starting to try and page the zk cluster rolling replacements back in and wonder if we need to be careful for their release still | 15:27 |
fungi | clarkb: wednesday around 10:00 utc i think is when the final release versions will all be tagged | 15:30 |
clarkb | cool in that case probably waiting for at least wednesday is fine | 15:30 |
clarkb | I can find other items to occupy my time between now and then | 15:30 |
clarkb | do the gerrit 3.2.8 upgrade later this week too likely | 15:30 |
*** ykarel|away has quit IRC | 15:33 | |
*** mlavalle has joined #opendev | 15:42 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Add note about python -u to external id cleanup script https://review.opendev.org/c/opendev/system-config/+/785907 | 15:45 |
clarkb | fungi: ^ that was pushed with `git review -v --no-thin` the -v helped me verify the command (and --no-thin was present as expected) and --no-thin appears to have functioned fine | 15:46 |
clarkb | given that seems to work and the test didn't cause any problems I suspect we can land that | 15:47 |
fungi | very cool! yeah, looking | 15:47 |
*** roman_g has joined #opendev | 15:47 | |
*** amoralej is now known as amoralej|off | 15:49 | |
clarkb | Looking at the gerrit user account conflicts I see there are a small numbr of CI accounts that we can likely pretty safely untangle | 15:57 |
clarkb | basically remove the human identifying conflict from the CI accounts | 15:57 |
clarkb | and let the human account be the owner of that external id without conflict | 15:58 |
clarkb | in some cases it is two different CI accounts conflicting with each other. In those cases I think we simply disable the one that least recently commented and clean it up | 15:58 |
clarkb | but I do need to review things because I think in some of these cases we don't actually want to retire any accounts as both are being used. We just want the CI system to ahve a CI system email addr and a human to have a human email addr without conflict | 16:00 |
fungi | also i expect there are cases where multiple ci systems were created with the same e-mail address | 16:04 |
clarkb | I'm also seeing a non zero set of accounts with ssh keys set but no username | 16:06 |
clarkb | I think the only way that would really make sense is if those accounts had been merged previously? | 16:06 |
fungi | yes, i think so. we often didn't remove old ssh keys from accounts we merged into other accounts | 16:08 |
clarkb | ya looking at more recently used timestamps and other attributes it seems that this is likely the case | 16:09 |
*** dtroyer has joined #opendev | 16:24 | |
clarkb | There is one CI account that conflicts with human accounts for four other people. I suspect in a case like that we don't retire anything, but simply remove the conflicts from the CI account, but I need to look at the external ids for that account more closely | 16:31 |
clarkb | 3 of those conflicts are simple mailtos and can be cleaned up. The fourth conflict is between emails on openids between what may be a human account and the CI account | 16:35 |
clarkb | the human account hasn't been used since 2015 though, but the ci account has been used this year. I guess in that case we can "sacrifice" the human account? | 16:35 |
fungi | yeah, i would | 16:40 |
fungi | we can always help them get a new account set up later if they come to us | 16:40 |
clarkb | It is interesting to see how different some of these accounts are from each other in terms of how they conflict | 16:47 |
clarkb | I'm going through and trying to understand each one a little better | 16:47 |
*** hamalq has joined #opendev | 16:47 | |
*** hamalq has quit IRC | 16:47 | |
*** hamalq has joined #opendev | 16:48 | |
fungi | it seems like it provides a window into the history of our infrastructure and how users have interacted with it in the past | 16:51 |
*** dtroyer has quit IRC | 16:53 | |
*** rpittau is now known as rpittau|afk | 16:59 | |
*** ysandeep is now known as ysandeep|holiday | 17:01 | |
*** marios is now known as marios|out | 17:03 | |
*** dtantsur is now known as dtantsur|afk | 17:06 | |
*** marios|out has quit IRC | 17:06 | |
corvus | infra-root: the squiggly lines on cacti and grafana look good to me. in particular, the memory line on cacti is not at all squiggly and is in fact horizontal. i think we're probably good to cut a release of zuul (which will be a good checkpoint release we can roll back to if needed for the next bit of v5 work). concurrences? dissents? | 17:14 |
fungi | corvus: i agree, the memory leak looks very much solved now. this would make for a good version to release | 17:15 |
*** sboyron has quit IRC | 17:21 | |
*** sboyron has joined #opendev | 17:21 | |
clarkb | corvus: sounds good to me | 17:23 |
clarkb | fungi: can you look at review:~clarkb/gerrit_user_cleanups/next-cleanups-20210412.txt ? That is what I worked through based on the conversation above. If all looks well to you I'll be trying to run the retire step against the ones listed as retireable and then in a few days can do the external id cleanups | 17:24 |
*** jpena is now known as jpena|off | 17:48 | |
*** ralonsoh has quit IRC | 17:52 | |
*** eolivare has quit IRC | 18:00 | |
openstackgerrit | Clark Boylan proposed opendev/git-review master: Add option for disabling thin pushes https://review.opendev.org/c/opendev/git-review/+/785723 | 18:00 |
clarkb | fungi: ^ manpage updated | 18:01 |
fungi | clarkb: were you adding a release note too (see earlier review comment)? | 18:03 |
*** roman_g has quit IRC | 18:08 | |
*** roman_g has joined #opendev | 18:08 | |
*** roman_g has quit IRC | 18:09 | |
*** roman_g has joined #opendev | 18:09 | |
*** roman_g has quit IRC | 18:09 | |
*** roman_g has joined #opendev | 18:10 | |
*** roman_g has quit IRC | 18:10 | |
*** roman_g has joined #opendev | 18:11 | |
*** roman_g has joined #opendev | 18:12 | |
*** roman_g has quit IRC | 18:12 | |
*** roman_g has joined #opendev | 18:12 | |
*** roman_g has quit IRC | 18:13 | |
*** roman_g has joined #opendev | 18:13 | |
*** roman_g has joined #opendev | 18:14 | |
*** roman_g has quit IRC | 18:14 | |
openstackgerrit | Merged opendev/system-config master: Add note about python -u to external id cleanup script https://review.opendev.org/c/opendev/system-config/+/785907 | 18:15 |
fungi | clarkb: on the "skipping" comments in your new cleanup list, those apply to the line immediately following them? | 18:16 |
fungi | i guess you're not feeding the list directly to a script so it's fine not to comment them out | 18:16 |
fungi | also the midokura account being dormant is a good call, pretty sure they're no longer involved, the midonet neutron driver has been retired due to lack of maintenance | 18:19 |
fungi | the plan in the latter half of next-cleanups-20210412.txt looks good, spot checks of the various account classes reflect the states i would expect | 18:21 |
clarkb | fungi: yup to the line below | 18:29 |
clarkb | fungi: oh I missed the release note comment. I'll address that | 18:29 |
openstackgerrit | Clark Boylan proposed opendev/git-review master: Add option for disabling thin pushes https://review.opendev.org/c/opendev/git-review/+/785723 | 18:38 |
clarkb | fungi: how's that? | 18:38 |
clarkb | fungi: and ya thats not the direct input to retire accounts. It takes a bit of massaging (I have to take the account ids and prefix them with refs/users/XY/ for account id ABXY | 18:40 |
*** sboyron has quit IRC | 18:41 | |
*** andrewbonney has quit IRC | 18:47 | |
clarkb | alright I'm going to start retiring the 56 accounts in that list | 18:54 |
fungi | sounds good, thanks! | 18:54 |
clarkb | that is done now and logs are in the normal location on review | 19:25 |
clarkb | I'm going to look at what is necessary to do the manual surgery that I proposed for the subset of CI accounst that we cannot just turn off now | 19:25 |
*** mailingsam has joined #opendev | 19:41 | |
*** whoami-rajat has quit IRC | 19:55 | |
clarkb | I have sent email to the third party account contacts for the accounst that need manual surgery. I gave them a week to respond otherwise I would proceed with the corrective action I described in the email | 20:15 |
clarkb | fungi was cc'd as well if questions come up and I am not around | 20:15 |
fungi | yep, i'll kep an eye out for replies | 20:18 |
clarkb | continues to feel like decent progress on the account conflict front | 20:19 |
fungi | very much so | 20:21 |
*** slaweq has quit IRC | 20:38 | |
*** slaweq has joined #opendev | 20:41 | |
*** snapdeal has quit IRC | 20:46 | |
*** hamalq has quit IRC | 21:11 | |
*** hamalq has joined #opendev | 21:11 | |
*** dmellado has quit IRC | 21:18 | |
*** dmellado has joined #opendev | 21:21 | |
*** jralbert has joined #opendev | 21:25 | |
ianw | clarkb/fungi: could you take a look at the ipv6 config @ https://review.opendev.org/c/opendev/system-config/+/785556 and make sure it's what we're thinking | 21:36 |
ianw | for review02 | 21:36 |
ianw | i would say linaro is not looing happy again :( | 21:38 |
ianw | looking even | 21:38 |
ianw | kevinz: ^ we don't seem to have any active nodes | 21:38 |
fungi | ianw: not sure if you saw my comments earlier today, but were you possibly seeing stuck node requests? | 21:43 |
ianw | fungi: yesterday, i would say yes. there were node requests in zookeeper and nodepool did not appear to be trying to satisfy them | 21:44 |
*** sshnaidm|pto has quit IRC | 21:44 | |
ianw | right now, linaro has three requests | 21:47 |
ianw | 2021-04-12 21:44:51,362 DEBUG nodepool.PoolWorker.linaro-us-main: Active requests: ['300-0013664209', '300-0013664234', '300-0013664235'] | 21:47 |
ianw | and three servers in BUILD status | 21:47 |
ianw | and no active servers | 21:48 |
fungi | ianw: 785556 looks fine other than the mac i think? where did you find that one? | 21:49 |
ianw | ohhhhh, i may have copy-pasted that wrong | 21:49 |
fungi | also do we do something similar for the mirror server, or was that set by hand? | 21:49 |
ianw | it seems the mirror might have been setup by hand | 21:50 |
fungi | got it, so if this works w/ netplan and ansible, maybe we can copy the same mechanism at least | 21:50 |
ianw | ok, three linaro nodes have now gone active. but the queue is large and we should have plenty of capacity for more | 21:51 |
fungi | server list reports only the active three as well | 21:51 |
*** slaweq has quit IRC | 21:53 | |
fungi | File "/usr/local/lib/python3.7/site-packages/nodepool/driver/utils.py", line 280, in estimatedNodepoolQuotaUsed | 21:54 |
fungi | if node.type[0] not in provider_pool.labels: | 21:54 |
fungi | IndexError: list index out of range | 21:54 |
fungi | not sure what's causing that in the debug log | 21:54 |
ianw | fungi: yeah, i debugged that yesterday -> https://review.opendev.org/c/zuul/nodepool/+/785821 | 21:54 |
fungi | oh, immediately above it... ERROR nodepool.driver.openstack.OpenStackProvider: Couldn't consider invalid node | 21:54 |
ianw | in short, openstack driver puts in a dummy entry for leaked nodes that doesn't have a "type"; the quota calculator however sees the dummy node and tries to look up it's config to see what flavor it is | 21:55 |
fungi | i wonder if the launcher thinks we're at quota in linaro-us even though we're nowhere near | 21:56 |
ianw | 2021-04-12 21:56:38,670 DEBUG nodepool.PoolWorker.linaro-us-main: Active requests: [] | 21:56 |
ianw | now it sees no active requests, despite the queue being large | 21:57 |
ianw | this is more or less what i saw yesterday. an examination of zk would probably show a lot of node requests. also restarting the launcher will probably pick them up | 21:57 |
fungi | luckily restarting the launcher isn't especially disruptive, at worst it discards some building nodes... but yeah i'm not sure why it sometimes doesn't signal that it's satisfied or rejected a node request after considering it | 21:58 |
fungi | after *locking* and considering it, that is to say | 21:59 |
corvus | why do you think the queue is large? | 22:00 |
corvus | grafana say's it's 3, and nodepool request-list says 4 | 22:00 |
ianw | corvus: the check-arm64 queue has a lot of things waiting in the ui | 22:00 |
corvus | but not queued? | 22:01 |
corvus | which tenant has a lot of items in check-arm64? | 22:02 |
corvus | i see 0 in openstack | 22:02 |
ianw | sigh, so do i after reloading the status page | 22:03 |
ianw | it looks like auto-refresh stopped after some point when i left it | 22:03 |
ianw | i agree sorry | 22:03 |
corvus | maybe the restart on friday | 22:03 |
fungi | okay, so probably things are going okay in that provider for now | 22:03 |
corvus | ianw: no worries, a problem easily solved! :) | 22:04 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add zuul keystore password https://review.opendev.org/c/opendev/system-config/+/785980 | 22:05 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: review02: pin ipv6 configuration https://review.opendev.org/c/opendev/system-config/+/785556 | 22:06 |
ianw | on a related note, we've sorted out all the credentials for the OSU OSL arm64 resources. i'll work on incorporating today | 22:07 |
ianw | kevinz: ^ please ignore my ping too :) it was -ESHOULDGETCUPOFTEABEFORECHECKINGTHINGS | 22:09 |
ianw | it looks like the wheel volumes got released, which i am assuming was cleared up by fresh centos images | 22:18 |
fungi | seemed that way | 22:34 |
ianw | btw i liked the suggestion of replacing openstack/openstack-planet with the OPML file in https://review.opendev.org/c/opendev/system-config/+/784191 so will do that | 22:35 |
fungi | yeah, not a bad idea to at least keep the last state of that list | 22:36 |
ianw | if no objections i'll make planet.openstack.org point to static and set that up just to redirect to the gitea page, where we can have a readme and the opml file | 22:36 |
fungi | yep, similar to how openstack handles docs url redirects for retired deliverables | 22:38 |
openstackgerrit | Merged opendev/system-config master: Fix up openafs-client job matching https://review.opendev.org/c/opendev/system-config/+/778353 | 22:43 |
clarkb | ianw: yes that netplan config lgtm, fungi already approved it but I put a +2 as well. When we did the mirror I did a reboot to ensure the old state was cleared out too iirc | 22:44 |
ianw | clarkb: yep, will do | 22:44 |
ianw | just changing locations here, back in about 20 min | 22:44 |
clarkb | as a side note we can safely modify the cloud init file becuase we should be removing that package after first boot | 22:52 |
*** gothicserpent has quit IRC | 22:54 | |
*** artom has quit IRC | 22:54 | |
*** artom has joined #opendev | 22:54 | |
*** gothicserpent has joined #opendev | 22:57 | |
clarkb | ianw: I said I would double check with you before dropping the tarballs ord replica meeting agenda item. Should that be dropped now or would you like to keep it up? | 23:05 |
ianw | clarkb: i think drop; we know it's taking a long time but i guess it doesn't really matter | 23:08 |
ianw | it's being "vos release" | 23:08 |
clarkb | k | 23:08 |
ianw | i had some terse notes here on using "vos dump" | ssh cat that were working, for a recovery case | 23:09 |
ianw | oh and ajaeger confirmed that docs-old can go. i'll work on removing that too | 23:12 |
clarkb | nice | 23:12 |
clarkb | ianw: not sure if you saw but I'm starting to run into where communicating to users is going to be necessary re gerrit accounts. No responses yet, though the first couple of sets are in china I think so I probably won't hear back until tmorrow morning at the earliest | 23:13 |
clarkb | but I did find another batch of ~56 that could be cleaned up (they had ssh keys but not user names and often invalid openids or no openids which implied to me they are likely accounts that had been previously merged | 23:14 |
clarkb | and the agenda has been sent out | 23:14 |
ianw | excellent, thanks for keeping on the user cleanup! | 23:15 |
clarkb | tomorrow I'm going to start putting some ptg stuff together as I suspect the later half of the week will be consumed with zk stuff | 23:16 |
clarkb | if anyone has ideas or wants to get started on that before me feel free :) | 23:17 |
openstackgerrit | Merged opendev/system-config master: haproxy: write to container log files https://review.opendev.org/c/opendev/system-config/+/783120 | 23:17 |
openstackgerrit | Merged opendev/system-config master: Add OSUOSL cloud https://review.opendev.org/c/opendev/system-config/+/785813 | 23:18 |
openstackgerrit | Merged opendev/system-config master: review02: pin ipv6 configuration https://review.opendev.org/c/opendev/system-config/+/785556 | 23:18 |
clarkb | ianw: if you have a moment https://review.opendev.org/c/opendev/git-review/+/785723 is what should be a straightforward git-review update to allow for --no-thin pushes (to workaround an annoying but infrequent pack file disagreement between jgit and c git) | 23:18 |
clarkb | will be easier for us to have people do git review --no-thin than git push --no-thin gerrit HEAD:/refs/for/master | 23:19 |
ianw | ok, it sounds like a need a trip to the git man page :) | 23:20 |
ianw | if i'm not back in 30 minutes, send a search party | 23:20 |
*** tosky has quit IRC | 23:24 | |
clarkb | ha, there is a link to an old lp bug in the commit message too which may help | 23:26 |
ianw | the man page doesn't, because no-thin isn't documented | 23:27 |
clarkb | ianw: grep for thin in git push | 23:27 |
clarkb | it is formatted as --[no-]thin | 23:27 |
ianw | ahhh | 23:28 |
*** artom has quit IRC | 23:29 | |
ianw | lgtm to i guess :) i'm not sure how a normal person would figure this out ... is it something repeatable with certain trees? | 23:32 |
clarkb | ianw: it has been repeatable for certain users until they do something like rebase or do the --no-thin manually. And yes I don't think people will necessarily figure it out themselves, but we can point them to the flag if it comes up | 23:33 |
clarkb | since instructing someone to do git review --no-thin is easier than detailing how to push to gerrit without git review | 23:34 |
ianw | sure. i guess the rebase bit might be the key there. it sort of makes sense i guess if we've repacked everything as we do and the client hasn't updated in a long time ... sort of. i mean the idea should be that any object is reachable all the time you'd think | 23:35 |
clarkb | yup, I think the issue is when you do a thin transaction each side takes the list of refs they know about and the list of refs they want one side to know about and then build an assumption about what objects need to be present in the tree | 23:36 |
clarkb | but it seems that sometimes jgit and c git disagree | 23:36 |
clarkb | and that is how the problem bubbles up | 23:36 |
clarkb | also it seems that protocol v2 vs v1 doesn't change the behavior | 23:37 |
clarkb | we managed to get the last person to hit this to check for us | 23:37 |
ianw | it looks like arm64 isn't happy in some way | 23:38 |
ianw | https://review.opendev.org/c/opendev/system-config/+/785675 all reported retry_limit | 23:38 |
ianw | Failed to fetch https://mirror.regionone.linaro-us.opendev.org/ubuntu-ports/dists/bionic/universe/binary-arm64/Packages 403 Forbidden | 23:39 |
clarkb | looks like mirror problems, ya | 23:39 |
ianw | indeed, 403 is a odd one | 23:39 |
clarkb | I'm having a hard time getting anything to load | 23:40 |
clarkb | the mirror root even | 23:40 |
ianw | ls: cannot open directory '/afs/openstack.org/': Connection timed out | 23:40 |
clarkb | ah that would do it | 23:40 |
ianw | ping: afs01.dfw.openstack.org: Temporary failure in name resolution | 23:41 |
ianw | so ... yeah | 23:41 |
ianw | it doesn't appear to have an ipv4 address | 23:42 |
clarkb | dhcp should be serving those, did our lease run out and we just never got a new one? | 23:45 |
clarkb | or is the other side of the NAT fine and our 1:1 to the fip broke? | 23:45 |
ianw | like "ip addr" doesn't show a ipv4 | 23:46 |
ianw | i'm going to reboot it before i dig too much further | 23:52 |
ianw | kevinz: ^ it still doesn't have an ipv4. my current theory is that dhcp is not responding? | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!