fungi | clarkb: see the beginning of the "Creating a Volume" section at https://opendev.org/opendev/system-config/raw/branch/master/doc/source/afs.rst and compare to the rendered version at https://docs.opendev.org/opendev/system-config/latest/afs.html#creating-a-volume | 00:00 |
---|---|---|
clarkb | thanks | 00:11 |
clarkb | I've got something working except that dot doesn't order nodes and it keeps putting the fourth column in the third position | 00:12 |
clarkb | and everything I try to do to correct that results in the graph going wild | 00:12 |
corvus | there's ordering based on "rank"; https://graphviz.org/docs/attrs/constraint/ may help | 00:18 |
clarkb | ya I've been playing with constraint=false to try and understand how it impacts the behavior and it seems to be a noop here | 00:27 |
clarkb | I'm sure it isn't but it doesn't seem to meaningfully change the graph output | 00:27 |
clarkb | oh got it. The important thing seemed to be having invisible edges so that order of edge priority is preserved? | 00:30 |
opendevreview | Tristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup https://review.opendev.org/c/zuul/zuul-jobs/+/847111 | 00:31 |
opendevreview | Tristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup https://review.opendev.org/c/zuul/zuul-jobs/+/847111 | 00:40 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082 | 01:17 |
clarkb | corvus: I think ^ that works for the most part. The only thing that is missing is the little warning box from the second sequence diagram | 01:18 |
clarkb | JayF: cid ^ fyi thats my hacked up graphviz replacement for sequence diagrams adapted from a stack overflow example | 01:18 |
clarkb | for the warning box I'm thinking rather than do that in the graph we could do it in text below the graph? Or just drop it entirely? Open to feedback and ideas on that | 01:19 |
clarkb | looks like xlabel may do what we want too? | 01:22 |
clarkb | getting that to render nicely is not straightforward the docs just say "somewhere near" and in my example it puts it in an awkward spot | 01:25 |
clarkb | oh wait maybe I managed it | 01:26 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082 | 01:28 |
opendevreview | Clark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082 | 01:30 |
clarkb | https://2dc96497c2ccf70a128b-4f9d8814b5757ae609c9c9a4c385ce18.ssl.cf1.rackcdn.com/930082/4/check/opendev-tox-docs/2f734e6/docs/docker-image.html | 01:34 |
clarkb | hrm I'm noticing that things aren't quite vertical in the first diagram and it seems to be worse in the zuul rendered png compared to locally | 01:38 |
clarkb | its fairly minor though so maybe we don't care (also that may be a side effect of addingthe _5 nodes to try and make the solid vs dashed lines correct) | 01:38 |
clarkb | ya seems less pronounced if I remove those extra nodes. But again I think its probably fine as is? | 01:41 |
JayF | eh, I mean, good enough | 02:16 |
JayF | it works as a diagram | 02:16 |
JayF | if someone wants to hold a ruler up to their screen, patches accepted? :D | 02:16 |
opendevreview | Tony Breeds proposed opendev/zone-opendev.org master: Remove stray whitespace https://review.opendev.org/c/opendev/zone-opendev.org/+/926690 | 03:33 |
opendevreview | Merged opendev/zone-opendev.org master: Remove stray whitespace https://review.opendev.org/c/opendev/zone-opendev.org/+/926690 | 05:58 |
opendevreview | Jonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmation for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294 | 12:11 |
opendevreview | Stephen Finucane proposed openstack/project-config master: Remove CI jobs from trio2o https://review.opendev.org/c/openstack/project-config/+/930302 | 12:37 |
opendevreview | Jonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmation for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294 | 12:39 |
opendevreview | Jonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmatian for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294 | 12:47 |
opendevreview | Stephen Finucane proposed openstack/project-config master: Remove jobs for openinfra/groups https://review.opendev.org/c/openstack/project-config/+/930305 | 12:50 |
opendevreview | Stephen Finucane proposed openstack/project-config master: Remove jobs for x/kingbird https://review.opendev.org/c/openstack/project-config/+/930306 | 12:52 |
opendevreview | Stephen Finucane proposed openstack/project-config master: Remove jobs for x/omni https://review.opendev.org/c/openstack/project-config/+/930307 | 12:53 |
opendevreview | Stephen Finucane proposed openstack/project-config master: Remove jobs for dead projects https://review.opendev.org/c/openstack/project-config/+/930305 | 13:06 |
opendevreview | Stephen Finucane proposed openstack/project-config master: Remove references to legacy-sandbox-tag job https://review.opendev.org/c/openstack/project-config/+/930319 | 13:26 |
opendevreview | Tobias Rydberg proposed opendev/irc-meetings master: Change to odd weeks for irc meetings for publiccloud-sig. https://review.opendev.org/c/opendev/irc-meetings/+/930334 | 13:56 |
opendevreview | James E. Blair proposed opendev/base-jobs master: Fine-tune graphviz sequence diagrams https://review.opendev.org/c/opendev/base-jobs/+/930358 | 16:18 |
corvus | clarkb: ^ that tweaks a few things; i think the framework you came up with is great and is easy to understand! :) | 16:19 |
clarkb | oh interesting using an xlabel you were able to avoid the edge line intersecting with the description text? | 16:24 |
clarkb | for the change around line 109 | 16:24 |
clarkb | and makes sense that being better about horizontal space would make straighter vertical lines possible. Thanks for the update | 16:25 |
clarkb | corvus: and are those the same graphs in zuul-jobs / zuul etc? Should make porting easy if we decide to do that | 16:28 |
corvus | yes, i think they're identical | 16:30 |
fungi | i seem to recall they were basically just forks | 16:42 |
fungi | copies | 16:42 |
clarkb | ya so if we prefer graphviz as a system dep over unmaintained python deps we can convert all the things | 16:44 |
clarkb | made it back from that errand with plenty of time to spare | 18:24 |
fungi | awesome | 18:24 |
clarkb | fungi: for the mailman change the updated packages will perform necessary upgrade steps on startup? eg we don't need to manually perform any steps? | 20:02 |
clarkb | a quick skim doesn't show any manual steps so I'm assuming this is the case and I think it was the case for the last upgrade | 20:02 |
fungi | correct, necessary upgrades being database migrations, yes | 20:02 |
clarkb | thanks for confirming | 20:02 |
fungi | there is actually an explicit step in the container startup for it | 20:03 |
fungi | https://opendev.org/opendev/system-config/src/branch/master/docker/mailman/web/docker-entrypoint.sh#L124-L126 | 20:06 |
fungi | clarkb: ^ | 20:06 |
clarkb | ah ok so not necessarily part of mailman itself but the containers are dealing wiht it | 21:01 |
clarkb | or not automatically part of mailman but the migrate tooling is | 21:01 |
clarkb | fungi: smtp secure mode would be used if we were using a relay/bouncer/proxy (whatever its called with email) ? | 21:26 |
clarkb | but since we're emailing directly we expect the remote smtp servrs to accept our connections without auth right? | 21:27 |
clarkb | though maybe this is more about doing smtp over tls? but again we can't assume the remote would support that so we're fine as is? | 21:27 |
corvus | clarkb: "smarthost" is one of the terms you're looking for | 21:34 |
corvus | =relay | 21:34 |
noonedeadpunk | hey there! I _think_ there might be something off with one of gitea backends. Haven't you seen anything weird lately? | 21:34 |
noonedeadpunk | I wasn't able to catch which exactly is misbehaving, as just realized it's likely not my internet issue (which I find very unreliable lately) | 21:35 |
clarkb | noonedeadpunk: no I haven't noticed but I last used gitea intensely this morning. Cna you be more specific about what happened? | 21:36 |
noonedeadpunk | as also seeing quite some issues reaching https://releases.openstack.org/constraints/upper/939e4eea5738ce51571ca85fd97aa2b02474e92a from CI | 21:36 |
noonedeadpunk | which fails with `Connection failure: The read operation timed out` | 21:36 |
clarkb | that url redirects to https://opendev.org/openstack/requirements/raw/commit/939e4eea5738ce51571ca85fd97aa2b02474e92a/upper-constraints.txt I can reach that url from all 6 gitea backends at the moment | 21:37 |
noonedeadpunk | yeah. right now me to, though during the day I was able to open gitea from second-third reload. was throing TLS connection issue or smth like that | 21:38 |
noonedeadpunk | sorry, I didn't gather enough details... | 21:38 |
noonedeadpunk | but also - I've spotted some weird scheduling issues in Zuul - like for ` 930377` job is being queued for an hour, while zuul keeps accepting/spawning new workers | 21:39 |
noonedeadpunk | seems to be the same same for 930383 | 21:39 |
clarkb | noonedeadpunk: so couple of things the first is that zuul scheduling jobs has nothing to do with gitea so the issues should be entirely decoupled. Second there are a lot of criteria/input that go into scheduling zuul jobs and that isn't necessarily abnormal | 21:41 |
clarkb | for example if the jobs are multinode jobs all nodes must be provided by the same provider and it is harder to pack more nodes into any one provider so there may be a delay while we wait for quota to clear out | 21:42 |
clarkb | this can also happen if there are semaphores or job order/dependency situations | 21:42 |
noonedeadpunk | I mean - that's `openstack-tox-py39` job | 21:42 |
clarkb | we also try to boot nodes three times in each provider before moving on with a timeout for each attempt. If a provider is struggling we could be delaying due to that | 21:42 |
noonedeadpunk | but yeah, I know these are likely different | 21:43 |
noonedeadpunk | fwiw, example of failed job with u-c fetch failure: https://zuul.opendev.org/t/openstack/build/218aa77846244c31857a394f73a94ffc/log/job-output.txt#16487 | 21:45 |
noonedeadpunk | I think I;ve rechecked today around 4-5 patches due to such error. | 21:45 |
noonedeadpunk | but yeah, anyway, I will come back once have something more conclusive rather then FUD :D | 21:47 |
clarkb | so I know I've said this before but its worth pointing out I guess. Zuul jobs can and do require the requirements project then they can refer to the constraints from the zuul caches | 21:47 |
noonedeadpunk | oh, yes, we do use cached version except 1 usecase - where we test upgrades | 21:48 |
clarkb | that example would be connecting from rackspace in dallas fort worth texas to vexxhost in san jose california. It could be a problem with the servers (maybe those jobs DOS'd us even) or it could be internet problems | 21:48 |
noonedeadpunk | as then depends-on won't be respected, so it's taken from zuul prepared repo on N but not N-1 | 21:49 |
clarkb | you can always checkout N-1 | 21:49 |
noonedeadpunk | and we actually pull u-c just once per job, and cache it locally after | 21:49 |
clarkb | I think that connection would've been attempted over ipv6 since both rax dfw and opendev.org have ipv6 support | 21:49 |
noonedeadpunk | but then somwhere checkouted position should be stored as well | 21:49 |
noonedeadpunk | and preserved... | 21:50 |
clarkb | but its hard to tell from that log entry as it doesn't indicate the address it failed to connect to only the domain | 21:50 |
clarkb | you can also checkout a sha | 21:50 |
clarkb | or tag or branch name | 21:50 |
noonedeadpunk | well, if there was ara to decode tasks a bit more... | 21:50 |
noonedeadpunk | yeah, but how to checkout back to smth where zuul was together with pulled in patches by depends-on? | 21:51 |
clarkb | thats failing in your nested ansible you can include ara if you like (we do it for system-config jobs that run nested ansible) | 21:51 |
clarkb | noonedeadpunk: you just run a task to checkout what you need. Grenade is doing it | 21:51 |
noonedeadpunk | I need to respect depends-on? | 21:51 |
noonedeadpunk | as then - inside CI I already don't know what I need kinda? | 21:52 |
clarkb | grenade respects depends on by checking out the correct branch name that zuul has prepared. But you don't have to do ti that way. It isn't clear to me which way you're trying to say it should be | 21:52 |
noonedeadpunk | or well. I guess I"m just not sure how to do that | 21:52 |
clarkb | but you could do a relative checkout for example and continue to respect depends on | 21:52 |
clarkb | git checkout stable/foo~1 or is it git checkout stable/foo^ ? I'd have to go read the ref docs | 21:53 |
clarkb | that queued up python39 job has its node request currently assigned to raxflex and at least one previous boot attempt did fail | 21:53 |
noonedeadpunk | yeah, would need to think one more time about that.... | 21:53 |
noonedeadpunk | about ara - well, we do run and collect it, but pretty much nowhere to upload results. | 21:54 |
noonedeadpunk | and HTML consumes too much of Swift storage, as it's just thousands of files with each job | 21:54 |
noonedeadpunk | we disabled it after causing some issue to vexxhost object storage I guess | 21:55 |
jrosser | we were hitting job timeout with the log upload i think, and the ara report was just a huge number of files | 21:55 |
noonedeadpunk | but if you have an ara server to which we can push data.... | 21:55 |
clarkb | we don't have an ara server | 21:58 |
clarkb | you could tar them up and then they would be viewabll locally? | 21:58 |
noonedeadpunk | we kinda upload sqlite db, which with some hackery could be used locally... | 21:59 |
noonedeadpunk | But I even forgot how to spawn it locally in the way to consume that sqlite | 21:59 |
noonedeadpunk | anyway - not that serious issue I guess | 22:00 |
clarkb | corvus: digging into why node request 300-0025281959 is slow and it almost looks like we're waiting an hour before deciding that the launch creation attempt should timeout | 22:00 |
clarkb | corvus: I'll get a paste together momentarily | 22:00 |
noonedeadpunk | node request: 300-0025282092 for "same" job waiting as well | 22:00 |
noonedeadpunk | same project, same job, different patchset | 22:01 |
clarkb | corvus: https://paste.opendev.org/show/b5juJyPa8GFYXoDnJSHr/ | 22:02 |
clarkb | I thought that timeout was much lower like 15 minutes max but we seem to pretty clearly wait an hour there | 22:03 |
clarkb | then appear to have started the next attempt which is not proceeding quickly either | 22:03 |
clarkb | side note I wonder if that would make a good swift feature. upload a tarball and have it expand within swift | 22:05 |
clarkb | the issue is the huge number of operations required to upload many files iirc not the total file count or disk consumption itself | 22:06 |
clarkb | corvus: also the server uuid recorded by nodepool's db for 0038608258 does not match the one that server list against raxflex shows | 22:08 |
clarkb | corvus: I wonder if we're not udpating that record when we boot multiple attempts | 22:08 |
clarkb | and whether or not that impacts our ability to check the status of the server over time? | 22:08 |
clarkb | corvus: https://paste.opendev.org/show/bTRJKIzN73b8uwqijVkO/ I'ev tried to capture those details in this paste | 22:10 |
clarkb | fwiw the server is in a build state on the cloud side so its not like we're ignoring a ready status node | 22:10 |
clarkb | 0038608563 in the same cloud does have matching uuids | 22:12 |
clarkb | and the logs for 0038608563 don't indicate any retries so my best hunch is that data gets mismatched as part of a retry | 22:13 |
clarkb | all of the nodes stuck in a build state are focal nodes | 22:14 |
clarkb | that image is almost 5 days old in raxflex so unlikely to be a newly uploaded image that was either corrupted or a short copy | 22:15 |
clarkb | also unlikely that the cloud would need to be converting it from one format to another at this point (that should be cached if necessary) | 22:15 |
clarkb | so ya focal nodes are in a raxflex purgatory at the moment taking ~3 attempts * 1 hour per attempt to error out and go to the next cloud | 22:16 |
clarkb | it isn't clear why we're waiting up to an hour for that (I thought the timeout was much lower) and it isn't clear why focal seems to be the only image currently affect. And as an aside it looks like nodepool may not update uuids in its db records after reattempts? | 22:16 |
clarkb | I need to finish this review I was doing before I lose all context but then I'll look at the timeout thing first I guess as that is most likely the most straightforward thing and also likely to have a quick impact if we can change it | 22:17 |
clarkb | fungi: I have posted a review on the mm3 upgrade change | 22:20 |
corvus | clarkb: noonedeadpunk the origin remote points to the previous state: https://zuul-ci.org/docs/zuul/latest/job-content.html#git-repositories | 22:22 |
clarkb | zuul's default launch timeout is indeed one hour. I'll get a chanage up to shorten | 22:22 |
clarkb | s/zuul/nodepool | 22:22 |
corvus | clarkb: noonedeadpunk so "git diff origin/master .. master" in a zuul-prepared repo means "tell me the commits between (the queue item ahead or the current master branch) to this queue item" | 22:23 |
corvus | clarkb: i think we would update the id if it succeeded, but i think i see the issue where it won't update it if it fails on the second attempt | 22:27 |
opendevreview | Clark Boylan proposed openstack/project-config master: Set launch-timeout on nodepool providers https://review.opendev.org/c/openstack/project-config/+/930388 | 22:28 |
clarkb | corvus: ok cool I wasn't imaging that | 22:28 |
clarkb | infra-root ^ 930388 should mitigate slowness in assigning nodes to jobs until we figure out why that is happening | 22:29 |
clarkb | and honestly that might be a bit of a see if it gets better from the cloud situation and if not betteri n 24 hours ask them about it | 22:29 |
corvus | yeah... worth noting that at one point... likely when we only had one provider... the default was a reasonable timeout value | 22:30 |
clarkb | to followup on my last summary we now know why we're waiting an hour for each launch attempt 930388 will address that. corvus thinks he sees a bug where we don't record new uuids for subsequent attempts properly, and we still don't know why focal nodes appear to be the only affected nodes in raxflex | 22:32 |
clarkb | but I think we can see if that resolves itself if we mitigate with 930388 and if it persists we escalate to the cloud to see if they have any logs on their end pointing at the slowness (server show doesn't have any detail along those lines just says the server is in a build state) | 22:33 |
corvus | yep. also i'm working on a bugfix locally | 22:34 |
corvus | (for the minor issue of not recording the right uuid; fixing that won't help the actual problem) | 22:34 |
clarkb | noonedeadpunk: going back to the upper constraints gitea serving thing: I think there are two pieces of information that would be helpful in further debugging first is the backed you are hitting (that info is available in the ssl cert altnames list) and second is whether or not you're connecting via ipv4 or ipv6 to the load balancer (so that we can try and do some reproduction with | 22:35 |
clarkb | our servers on different networks with different routes to see if the internet is maybe sad) | 22:35 |
clarkb | corvus: could that affect chceks on whether or not the server has gone ready? or do we have a handle for that from openstacksdk that we keep around and don't need the db for? | 22:35 |
clarkb | in this case I manaully checked the server isn't ready so agree it wouldn't fix this particular issue | 22:35 |
corvus | clarkb: i believe the in-memory external_id inside the state machine is correct; it just doesn't update zk | 22:36 |
corvus | so if any attempt succeeded, it should still complete and it would then actually update the uuid too | 22:36 |
clarkb | thanks | 22:37 |
corvus | only time we should see this is while the second and later attempts are still (really) failing | 22:37 |
clarkb | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1 looking at these graphs there are two events that standout to me. Starting at about 2200 gitea09 transfers a lot of data (I wonder if that coincides with hound updating its data). This doesn't correlate to the log noonedeadpunk shared | 22:42 |
clarkb | the other is the concurrent sessions graphs has a big dip starting about 8 minutes after the log error noonedeadpunk pointed out | 22:42 |
clarkb | so maybe we're logging delayed evidence of whatever it was? | 22:42 |
clarkb | the connection rate doesn't go down though which is odd | 22:44 |
clarkb | or at least not in the same degree | 22:44 |
clarkb | and the frontend error rate is 0 the entire time | 22:45 |
clarkb | gitea14 has a small blip of errors but I wouldn't expect a single backend to cause a drop of ~70% in frontend activity | 22:45 |
clarkb | the lack of frontend and backend errors almost makes me wonder if we simply lost our connectivity to the world | 22:46 |
clarkb | if anyone is wondering PBR gets ~26.5 million downloads from pypi each month | 23:03 |
clarkb | I wonder how much of that is zuul ... | 23:03 |
corvus | like zuul the project or jobs that happen to run in opendev's zuul? | 23:04 |
clarkb | corvus: jobs that happen to run in opendev's zuul | 23:04 |
clarkb | to put that in comparison the second most popular package (urllib3) gets about that many downloads every day | 23:04 |
clarkb | and the top package (boto3) gets double that every day | 23:05 |
corvus | last i looked, i think our build database had that many builds in it, and that's many many many years of builds | 23:05 |
clarkb | corvus: neat so probably not all opendev ci then :) | 23:05 |
clarkb | hatch does 3.25 million per month and setuptools-scm does 37.8 million per month | 23:12 |
clarkb | and poetry is 44 million | 23:12 |
timburke | i wonder how much of it is things like https://opendev.org/openstack/python-openstackclient/src/branch/master/requirements.txt#L5 where we have what should be a build-time dep listed as a runtime dep | 23:17 |
timburke | well, "should be" -- i suppose it *is* imported and used, so the runtime dep makes sense enough. but IMHO it's being used poorly; importlib.metadata would be a better fit for most use-cases and provided by stdlib | 23:19 |
clarkb | timburke: ya I'm sure a good chunk of it is various openstack things depending on it for various reasons. But I was trying to amek sense of whether or not our CI system is the vast majority of that | 23:21 |
clarkb | we cache pypi packages but no longer truly mirror them (due to the explosive growth of the size of pypi) | 23:21 |
clarkb | so in theory we're reducing the total number of downloads by some large portion | 23:21 |
clarkb | timburke: and ya pbr's VersionInfo object is just doing a lookup with importlib.metadata or pkg_resources depending on availability you could do the same sort of thing directly | 23:23 |
clarkb | PBRs version objects also do some pep440 stuff and can be compared. Probably rare for projects to use that though and they just want to find and report htier version | 23:25 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!