Tuesday, 2024-09-24

fungiclarkb: see the beginning of the "Creating a Volume" section at https://opendev.org/opendev/system-config/raw/branch/master/doc/source/afs.rst and compare to the rendered version at https://docs.opendev.org/opendev/system-config/latest/afs.html#creating-a-volume00:00
clarkbthanks00:11
clarkbI've got something working except that dot doesn't order nodes and it keeps putting the fourth column in the third position00:12
clarkband everything I try to do to correct that results in the graph going wild00:12
corvusthere's ordering based on "rank";  https://graphviz.org/docs/attrs/constraint/ may help00:18
clarkbya I've been playing with constraint=false to try and understand how it impacts the behavior and it seems to be a noop here00:27
clarkbI'm sure it isn't but it doesn't seem to meaningfully change the graph output00:27
clarkboh got it. The important thing seemed to be having invisible edges so that order of edge priority is preserved?00:30
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup  https://review.opendev.org/c/zuul/zuul-jobs/+/84711100:31
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup  https://review.opendev.org/c/zuul/zuul-jobs/+/84711100:40
opendevreviewClark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs  https://review.opendev.org/c/opendev/base-jobs/+/93008201:17
clarkbcorvus: I think ^ that works for the most part. The only thing that is missing is the little warning box from the second sequence diagram01:18
clarkbJayF: cid ^ fyi thats my hacked up graphviz replacement for sequence diagrams adapted from a stack overflow example01:18
clarkbfor the warning box I'm thinking rather than do that in the graph we could do it in text below the graph? Or just drop it entirely? Open to feedback and ideas on that01:19
clarkblooks like xlabel may do what we want too?01:22
clarkbgetting that to render nicely is not straightforward the docs just say "somewhere near" and in my example it puts it in an awkward spot01:25
clarkboh wait maybe I managed it01:26
opendevreviewClark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs  https://review.opendev.org/c/opendev/base-jobs/+/93008201:28
opendevreviewClark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs  https://review.opendev.org/c/opendev/base-jobs/+/93008201:30
clarkbhttps://2dc96497c2ccf70a128b-4f9d8814b5757ae609c9c9a4c385ce18.ssl.cf1.rackcdn.com/930082/4/check/opendev-tox-docs/2f734e6/docs/docker-image.html01:34
clarkbhrm I'm noticing that things aren't quite vertical in the first diagram and it seems to be worse in the zuul rendered png compared to locally01:38
clarkbits fairly minor though so maybe we don't care (also that may be a side effect of addingthe _5 nodes to try and make the solid vs dashed lines correct)01:38
clarkbya seems less pronounced if I remove those extra nodes. But again I think its probably fine as is?01:41
JayFeh, I mean, good enough02:16
JayFit works as a diagram02:16
JayFif someone wants to hold a ruler up to their screen, patches accepted? :D 02:16
opendevreviewTony Breeds proposed opendev/zone-opendev.org master: Remove stray whitespace  https://review.opendev.org/c/opendev/zone-opendev.org/+/92669003:33
opendevreviewMerged opendev/zone-opendev.org master: Remove stray whitespace  https://review.opendev.org/c/opendev/zone-opendev.org/+/92669005:58
opendevreviewJonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmation for Ubuntu Noble  https://review.opendev.org/c/opendev/system-config/+/93029412:11
opendevreviewStephen Finucane proposed openstack/project-config master: Remove CI jobs from trio2o  https://review.opendev.org/c/openstack/project-config/+/93030212:37
opendevreviewJonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmation for Ubuntu Noble  https://review.opendev.org/c/opendev/system-config/+/93029412:39
opendevreviewJonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmatian for Ubuntu Noble  https://review.opendev.org/c/opendev/system-config/+/93029412:47
opendevreviewStephen Finucane proposed openstack/project-config master: Remove jobs for openinfra/groups  https://review.opendev.org/c/openstack/project-config/+/93030512:50
opendevreviewStephen Finucane proposed openstack/project-config master: Remove jobs for x/kingbird  https://review.opendev.org/c/openstack/project-config/+/93030612:52
opendevreviewStephen Finucane proposed openstack/project-config master: Remove jobs for x/omni  https://review.opendev.org/c/openstack/project-config/+/93030712:53
opendevreviewStephen Finucane proposed openstack/project-config master: Remove jobs for dead projects  https://review.opendev.org/c/openstack/project-config/+/93030513:06
opendevreviewStephen Finucane proposed openstack/project-config master: Remove references to legacy-sandbox-tag job  https://review.opendev.org/c/openstack/project-config/+/93031913:26
opendevreviewTobias Rydberg proposed opendev/irc-meetings master: Change to odd weeks for irc meetings for publiccloud-sig.  https://review.opendev.org/c/opendev/irc-meetings/+/93033413:56
opendevreviewJames E. Blair proposed opendev/base-jobs master: Fine-tune graphviz sequence diagrams  https://review.opendev.org/c/opendev/base-jobs/+/93035816:18
corvusclarkb: ^ that tweaks a few things; i think the framework you came up with is great and is easy to understand!  :)16:19
clarkboh interesting using an xlabel you were able to avoid the edge line intersecting with the description text?16:24
clarkbfor the change around line 10916:24
clarkband makes sense that being better about horizontal space would make straighter vertical lines possible. Thanks for the update16:25
clarkbcorvus: and are those the same graphs in zuul-jobs / zuul etc? Should make porting easy if we decide to do that16:28
corvusyes, i think they're identical16:30
fungii seem to recall they were basically just forks16:42
fungicopies16:42
clarkbya so if we prefer graphviz as a system dep over unmaintained python deps we can convert all the things16:44
clarkbmade it back from that errand with plenty of time to spare18:24
fungiawesome18:24
clarkbfungi: for the mailman change the updated packages will perform necessary upgrade steps on startup? eg we don't need to manually perform any steps?20:02
clarkba quick skim doesn't show any manual steps so I'm assuming this is the case and I think it was the case for the last upgrade20:02
fungicorrect, necessary upgrades being database migrations, yes20:02
clarkbthanks for confirming20:02
fungithere is actually an explicit step in the container startup for it20:03
fungihttps://opendev.org/opendev/system-config/src/branch/master/docker/mailman/web/docker-entrypoint.sh#L124-L12620:06
fungiclarkb: ^20:06
clarkbah ok so not necessarily part of mailman itself but the containers are dealing wiht it21:01
clarkbor not automatically part of mailman but the migrate tooling is21:01
clarkbfungi: smtp secure mode would be used if we were using a relay/bouncer/proxy (whatever its called with email) ?21:26
clarkbbut since we're emailing directly we expect the remote smtp servrs to accept our connections without auth right?21:27
clarkbthough maybe this is more about doing smtp over tls? but again we can't assume the remote would support that so we're fine as is?21:27
corvusclarkb: "smarthost" is one of the terms you're looking for21:34
corvus=relay21:34
noonedeadpunkhey there! I _think_ there might be something off with one of gitea backends. Haven't you seen anything weird lately?21:34
noonedeadpunkI wasn't able to catch which exactly is misbehaving, as just realized it's likely not my internet issue (which I find very unreliable lately)21:35
clarkbnoonedeadpunk: no I haven't noticed but I last used gitea intensely this morning. Cna you be more specific about what happened?21:36
noonedeadpunkas also seeing quite some issues reaching https://releases.openstack.org/constraints/upper/939e4eea5738ce51571ca85fd97aa2b02474e92a from CI21:36
noonedeadpunkwhich fails with `Connection failure: The read operation timed out`21:36
clarkbthat url redirects to https://opendev.org/openstack/requirements/raw/commit/939e4eea5738ce51571ca85fd97aa2b02474e92a/upper-constraints.txt I can reach that url from all 6 gitea backends at the moment21:37
noonedeadpunkyeah. right now me to, though during the day I was able to open gitea from second-third reload. was throing TLS connection issue or smth like that21:38
noonedeadpunksorry, I didn't gather enough details...21:38
noonedeadpunkbut also - I've spotted some weird scheduling issues in Zuul - like for ` 930377` job is being queued for an hour, while zuul keeps accepting/spawning new workers21:39
noonedeadpunkseems to be the same same for  93038321:39
clarkbnoonedeadpunk: so couple of things the first is that zuul scheduling jobs has nothing to do with gitea so the issues should be entirely decoupled. Second there are a lot of criteria/input that go into scheduling zuul jobs and that isn't necessarily abnormal21:41
clarkbfor example if the jobs are multinode jobs all nodes must be provided by the same provider and it is harder to pack more nodes into any one provider so there may be a delay while we wait for quota to clear out21:42
clarkbthis can also happen if there are semaphores or job order/dependency situations21:42
noonedeadpunkI mean - that's `openstack-tox-py39` job21:42
clarkbwe also try to boot nodes three times in each provider before moving on with a timeout for each attempt. If a provider is struggling we could be delaying due to that21:42
noonedeadpunkbut yeah, I know these are likely different21:43
noonedeadpunkfwiw, example of failed job with u-c fetch failure: https://zuul.opendev.org/t/openstack/build/218aa77846244c31857a394f73a94ffc/log/job-output.txt#1648721:45
noonedeadpunkI think I;ve rechecked today around 4-5 patches due to such error.21:45
noonedeadpunkbut yeah, anyway, I will come back once have something more conclusive rather then FUD :D21:47
clarkbso I know I've said this before but its worth pointing out I guess. Zuul jobs can and do require the requirements project then they can refer to the constraints from the zuul caches21:47
noonedeadpunkoh, yes, we do use cached version except 1 usecase - where we test upgrades21:48
clarkbthat example would be connecting from rackspace in dallas fort worth texas to vexxhost in san jose california. It could be a problem with the servers (maybe those jobs DOS'd us even) or it could be internet problems21:48
noonedeadpunkas then depends-on won't be respected, so it's taken from zuul prepared repo on N but not N-121:49
clarkbyou can always checkout N-121:49
noonedeadpunkand we actually pull u-c just once per job, and cache it locally after21:49
clarkbI think that connection would've been attempted over ipv6 since both rax dfw and opendev.org have ipv6 support21:49
noonedeadpunkbut then somwhere checkouted position should be stored as well21:49
noonedeadpunkand preserved...21:50
clarkbbut its hard to tell from that log entry as it doesn't indicate the address it failed to connect to only the domain21:50
clarkbyou can also checkout a sha21:50
clarkbor tag or branch name21:50
noonedeadpunkwell, if there was ara to decode tasks a bit more...21:50
noonedeadpunkyeah, but how to checkout back to smth where zuul was together with pulled in patches by depends-on?21:51
clarkbthats failing in your nested ansible you can include ara if you like (we do it for system-config jobs that run nested ansible)21:51
clarkbnoonedeadpunk: you just run a task to checkout what you need. Grenade is doing it21:51
noonedeadpunkI need to respect depends-on?21:51
noonedeadpunkas then - inside CI I already don't know what I need kinda?21:52
clarkbgrenade respects depends on by checking out the correct branch name that zuul has prepared. But you don't have to do ti that way. It isn't clear to me which way you're trying to say it should be21:52
noonedeadpunkor well. I guess I"m just not sure how to do that21:52
clarkbbut you could do a relative checkout for example and continue to respect depends on21:52
clarkbgit checkout stable/foo~1 or is it git checkout stable/foo^ ? I'd have to go read the ref docs21:53
clarkbthat queued up python39 job has its node request currently assigned to raxflex and at least one previous boot attempt did fail21:53
noonedeadpunkyeah, would need to think one more time about that....21:53
noonedeadpunkabout ara - well, we do run and collect it, but pretty much nowhere to upload results.21:54
noonedeadpunkand HTML consumes too much of Swift storage, as it's just thousands of files with each job21:54
noonedeadpunkwe disabled it after causing some issue to vexxhost object storage I guess21:55
jrosserwe were hitting job timeout with the log upload i think, and the ara report was just a huge number of files21:55
noonedeadpunkbut if you have an ara server to which we can push data....21:55
clarkbwe don't have an ara server21:58
clarkbyou could tar them up and then they would be viewabll locally?21:58
noonedeadpunkwe kinda upload sqlite db, which with some hackery could be used locally...21:59
noonedeadpunkBut I even forgot how to spawn it locally in the way to consume that sqlite21:59
noonedeadpunkanyway - not that serious issue I guess22:00
clarkbcorvus: digging into why node request 300-0025281959 is slow and it almost looks like we're waiting an hour before deciding that the launch creation attempt should timeout22:00
clarkbcorvus: I'll get a paste together momentarily22:00
noonedeadpunknode request: 300-0025282092 for "same" job waiting as well22:00
noonedeadpunksame project, same job, different patchset22:01
clarkbcorvus: https://paste.opendev.org/show/b5juJyPa8GFYXoDnJSHr/22:02
clarkbI thought that timeout was much lower like 15 minutes max but we seem to pretty clearly wait an hour there22:03
clarkbthen appear to have started the next attempt which is not proceeding quickly either22:03
clarkbside note I wonder if that would make a good swift feature. upload a tarball and have it expand within swift22:05
clarkbthe issue is the huge number of operations required to upload many files iirc not the total file count or disk consumption itself22:06
clarkbcorvus: also the server uuid recorded by nodepool's db for 0038608258 does not match the one that server list against raxflex shows22:08
clarkbcorvus: I wonder if we're not udpating that record when we boot multiple attempts22:08
clarkband whether or not that impacts our ability to check the status of the server over time?22:08
clarkbcorvus: https://paste.opendev.org/show/bTRJKIzN73b8uwqijVkO/ I'ev tried to capture those details in this paste22:10
clarkbfwiw the server is in a build state on the cloud side so its not like we're ignoring a ready status node22:10
clarkb0038608563 in the same cloud does have matching uuids22:12
clarkband the logs for 0038608563 don't indicate any retries so my best hunch is that data gets mismatched as part of a retry22:13
clarkball of the nodes stuck in a build state are focal nodes22:14
clarkbthat image is almost 5 days old in raxflex so unlikely to be a newly uploaded image that was either corrupted or a short copy22:15
clarkbalso unlikely that the cloud would need to be converting it from one format to another at this point (that should be cached if necessary)22:15
clarkbso ya focal nodes are in a raxflex purgatory at the moment taking ~3 attempts * 1 hour per attempt to error out and go to the next cloud22:16
clarkbit isn't clear why we're waiting up to an hour for that (I thought the timeout was much lower) and it isn't clear why focal seems to be the only image currently affect. And as an aside it looks like nodepool may not update uuids in its db records after reattempts?22:16
clarkbI need to finish this review I was doing before I lose all context but then I'll look at the timeout thing first I guess as that is most likely the most straightforward thing and also likely to have a quick impact if we can change it22:17
clarkbfungi: I have posted a review on the mm3 upgrade change22:20
corvusclarkb: noonedeadpunk the origin remote points to the previous state: https://zuul-ci.org/docs/zuul/latest/job-content.html#git-repositories22:22
clarkbzuul's default launch timeout is indeed one hour. I'll get a chanage up to shorten22:22
clarkbs/zuul/nodepool22:22
corvusclarkb: noonedeadpunk so "git diff origin/master .. master" in a zuul-prepared repo means "tell me the commits between (the queue item ahead or the current master branch) to this queue item"22:23
corvusclarkb: i think we would update the id if it succeeded, but i think i see the issue where it won't update it if it fails on the second attempt22:27
opendevreviewClark Boylan proposed openstack/project-config master: Set launch-timeout on nodepool providers  https://review.opendev.org/c/openstack/project-config/+/93038822:28
clarkbcorvus: ok cool I wasn't imaging that22:28
clarkbinfra-root ^ 930388 should mitigate slowness in assigning nodes to jobs until we figure out why that is happening22:29
clarkband honestly that might be a bit of a see if it gets better from the cloud situation and if not betteri n 24 hours ask them about it22:29
corvusyeah... worth noting that at one point... likely when we only had one provider... the default was a reasonable timeout value22:30
clarkbto followup on my last summary we now know why we're waiting an hour for each launch attempt 930388 will address that. corvus thinks he sees a bug where we don't record new uuids for subsequent attempts properly, and we still don't know why focal nodes appear to be the only affected nodes in raxflex22:32
clarkbbut I think we can see if that resolves itself if we mitigate with 930388 and if it persists we escalate to the cloud to see if they have any logs on their end pointing at the slowness (server show doesn't have any detail along those lines just says the server is in a build state)22:33
corvusyep.  also i'm working on a bugfix locally22:34
corvus(for the minor issue of not recording the right uuid; fixing that won't help the actual problem)22:34
clarkbnoonedeadpunk: going back to the upper constraints gitea serving thing: I think there are two pieces of information that would be helpful in further debugging first is the backed you are hitting (that info is available in the ssl cert altnames list) and second is whether or not you're connecting via ipv4 or ipv6 to the load balancer (so that we can try and do some reproduction with22:35
clarkbour servers on different networks with different routes to see if the internet is maybe sad)22:35
clarkbcorvus: could that affect chceks on whether or not the server has gone ready? or do we have a handle for that from openstacksdk that we keep around and don't need the db for?22:35
clarkbin this case I manaully checked the server isn't ready so agree it wouldn't fix this particular issue22:35
corvusclarkb: i believe the in-memory external_id inside the state machine is correct; it just doesn't update zk22:36
corvusso if any attempt succeeded, it should still complete and it would then actually update the uuid too22:36
clarkbthanks22:37
corvusonly time we should see this is while the second and later attempts are still (really) failing22:37
clarkbhttps://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1 looking at these graphs there are two events that standout to me. Starting at about 2200 gitea09 transfers a lot of data (I wonder if that coincides with hound updating its data). This doesn't correlate to the log noonedeadpunk shared22:42
clarkbthe other is the concurrent sessions graphs has a big dip starting about 8 minutes after the log error noonedeadpunk pointed out22:42
clarkbso maybe we're logging delayed evidence of whatever it was?22:42
clarkbthe connection rate doesn't go down though which is odd22:44
clarkbor at least not in the same degree22:44
clarkband the frontend error rate is 0 the entire time22:45
clarkbgitea14 has a small blip of errors but I wouldn't expect a single backend to cause a drop of ~70% in frontend activity22:45
clarkbthe lack of frontend and backend errors almost makes me wonder if we simply lost our connectivity to the world22:46
clarkbif anyone is wondering PBR gets ~26.5 million downloads from pypi each month23:03
clarkbI wonder how much of that is zuul ...23:03
corvuslike zuul the project or jobs that happen to run in opendev's zuul?23:04
clarkbcorvus: jobs that happen to run in opendev's zuul23:04
clarkbto put that in comparison the second most popular package (urllib3) gets about that many downloads every day23:04
clarkband the top package (boto3) gets double that every day23:05
corvuslast i looked, i think our build database had that many builds in it, and that's many many many years of builds23:05
clarkbcorvus: neat so probably not all opendev ci then :)23:05
clarkbhatch does 3.25 million per month and setuptools-scm does 37.8 million per month23:12
clarkband poetry is 44 million23:12
timburkei wonder how much of it is things like https://opendev.org/openstack/python-openstackclient/src/branch/master/requirements.txt#L5 where we have what should be a build-time dep listed as a runtime dep23:17
timburkewell, "should be" -- i suppose it *is* imported and used, so the runtime dep makes sense enough. but IMHO it's being used poorly; importlib.metadata would be a better fit for most use-cases and provided by stdlib23:19
clarkbtimburke: ya I'm sure a good chunk of it is various openstack things depending on it for various reasons. But I was trying to amek sense of whether or not our CI system is the vast majority of that23:21
clarkbwe cache pypi packages but no longer truly mirror them (due to the explosive growth of the size of pypi)23:21
clarkbso in theory we're reducing the total number of downloads by some large portion23:21
clarkbtimburke: and ya pbr's VersionInfo object is just doing a lookup with importlib.metadata or pkg_resources depending on availability you could do the same sort of thing directly23:23
clarkbPBRs version objects also do some pep440 stuff and can be compared. Probably rare for projects to use that though and they just want to find and report htier version23:25

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!