Tuesday, 2024-09-24

fungi	clarkb: see the beginning of the "Creating a Volume" section at https://opendev.org/opendev/system-config/raw/branch/master/doc/source/afs.rst and compare to the rendered version at https://docs.opendev.org/opendev/system-config/latest/afs.html#creating-a-volume	00:00
clarkb	thanks	00:11
clarkb	I've got something working except that dot doesn't order nodes and it keeps putting the fourth column in the third position	00:12
clarkb	and everything I try to do to correct that results in the graph going wild	00:12
corvus	there's ordering based on "rank"; https://graphviz.org/docs/attrs/constraint/ may help	00:18
clarkb	ya I've been playing with constraint=false to try and understand how it impacts the behavior and it seems to be a noop here	00:27
clarkb	I'm sure it isn't but it doesn't seem to meaningfully change the graph output	00:27
clarkb	oh got it. The important thing seemed to be having invisible edges so that order of edge priority is preserved?	00:30
opendevreview	Tristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup https://review.opendev.org/c/zuul/zuul-jobs/+/847111	00:31
opendevreview	Tristan Cacqueray proposed zuul/zuul-jobs master: Update ensure-ghc to use ghcup https://review.opendev.org/c/zuul/zuul-jobs/+/847111	00:40
opendevreview	Clark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082	01:17
clarkb	corvus: I think ^ that works for the most part. The only thing that is missing is the little warning box from the second sequence diagram	01:18
clarkb	JayF: cid ^ fyi thats my hacked up graphviz replacement for sequence diagrams adapted from a stack overflow example	01:18
clarkb	for the warning box I'm thinking rather than do that in the graph we could do it in text below the graph? Or just drop it entirely? Open to feedback and ideas on that	01:19
clarkb	looks like xlabel may do what we want too?	01:22
clarkb	getting that to render nicely is not straightforward the docs just say "somewhere near" and in my example it puts it in an awkward spot	01:25
clarkb	oh wait maybe I managed it	01:26
opendevreview	Clark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082	01:28
opendevreview	Clark Boylan proposed opendev/base-jobs master: Replace blockdiag and seqdiag with graphviz in docs https://review.opendev.org/c/opendev/base-jobs/+/930082	01:30
clarkb	https://2dc96497c2ccf70a128b-4f9d8814b5757ae609c9c9a4c385ce18.ssl.cf1.rackcdn.com/930082/4/check/opendev-tox-docs/2f734e6/docs/docker-image.html	01:34
clarkb	hrm I'm noticing that things aren't quite vertical in the first diagram and it seems to be worse in the zuul rendered png compared to locally	01:38
clarkb	its fairly minor though so maybe we don't care (also that may be a side effect of addingthe _5 nodes to try and make the solid vs dashed lines correct)	01:38
clarkb	ya seems less pronounced if I remove those extra nodes. But again I think its probably fine as is?	01:41
JayF	eh, I mean, good enough	02:16
JayF	it works as a diagram	02:16
JayF	if someone wants to hold a ruler up to their screen, patches accepted? :D	02:16
opendevreview	Tony Breeds proposed opendev/zone-opendev.org master: Remove stray whitespace https://review.opendev.org/c/opendev/zone-opendev.org/+/926690	03:33
opendevreview	Merged opendev/zone-opendev.org master: Remove stray whitespace https://review.opendev.org/c/opendev/zone-opendev.org/+/926690	05:58
opendevreview	Jonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmation for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294	12:11
opendevreview	Stephen Finucane proposed openstack/project-config master: Remove CI jobs from trio2o https://review.opendev.org/c/openstack/project-config/+/930302	12:37
opendevreview	Jonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmation for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294	12:39
opendevreview	Jonathan Rosser proposed opendev/system-config master: reprepro: mirror Ubuntu UCA Dalmatian for Ubuntu Noble https://review.opendev.org/c/opendev/system-config/+/930294	12:47
opendevreview	Stephen Finucane proposed openstack/project-config master: Remove jobs for openinfra/groups https://review.opendev.org/c/openstack/project-config/+/930305	12:50
opendevreview	Stephen Finucane proposed openstack/project-config master: Remove jobs for x/kingbird https://review.opendev.org/c/openstack/project-config/+/930306	12:52
opendevreview	Stephen Finucane proposed openstack/project-config master: Remove jobs for x/omni https://review.opendev.org/c/openstack/project-config/+/930307	12:53
opendevreview	Stephen Finucane proposed openstack/project-config master: Remove jobs for dead projects https://review.opendev.org/c/openstack/project-config/+/930305	13:06
opendevreview	Stephen Finucane proposed openstack/project-config master: Remove references to legacy-sandbox-tag job https://review.opendev.org/c/openstack/project-config/+/930319	13:26
opendevreview	Tobias Rydberg proposed opendev/irc-meetings master: Change to odd weeks for irc meetings for publiccloud-sig. https://review.opendev.org/c/opendev/irc-meetings/+/930334	13:56
opendevreview	James E. Blair proposed opendev/base-jobs master: Fine-tune graphviz sequence diagrams https://review.opendev.org/c/opendev/base-jobs/+/930358	16:18
corvus	clarkb: ^ that tweaks a few things; i think the framework you came up with is great and is easy to understand! :)	16:19
clarkb	oh interesting using an xlabel you were able to avoid the edge line intersecting with the description text?	16:24
clarkb	for the change around line 109	16:24
clarkb	and makes sense that being better about horizontal space would make straighter vertical lines possible. Thanks for the update	16:25
clarkb	corvus: and are those the same graphs in zuul-jobs / zuul etc? Should make porting easy if we decide to do that	16:28
corvus	yes, i think they're identical	16:30
fungi	i seem to recall they were basically just forks	16:42
fungi	copies	16:42
clarkb	ya so if we prefer graphviz as a system dep over unmaintained python deps we can convert all the things	16:44
clarkb	made it back from that errand with plenty of time to spare	18:24
fungi	awesome	18:24
clarkb	fungi: for the mailman change the updated packages will perform necessary upgrade steps on startup? eg we don't need to manually perform any steps?	20:02
clarkb	a quick skim doesn't show any manual steps so I'm assuming this is the case and I think it was the case for the last upgrade	20:02
fungi	correct, necessary upgrades being database migrations, yes	20:02
clarkb	thanks for confirming	20:02
fungi	there is actually an explicit step in the container startup for it	20:03
fungi	https://opendev.org/opendev/system-config/src/branch/master/docker/mailman/web/docker-entrypoint.sh#L124-L126	20:06
fungi	clarkb: ^	20:06
clarkb	ah ok so not necessarily part of mailman itself but the containers are dealing wiht it	21:01
clarkb	or not automatically part of mailman but the migrate tooling is	21:01
clarkb	fungi: smtp secure mode would be used if we were using a relay/bouncer/proxy (whatever its called with email) ?	21:26
clarkb	but since we're emailing directly we expect the remote smtp servrs to accept our connections without auth right?	21:27
clarkb	though maybe this is more about doing smtp over tls? but again we can't assume the remote would support that so we're fine as is?	21:27
corvus	clarkb: "smarthost" is one of the terms you're looking for	21:34
corvus	=relay	21:34
noonedeadpunk	hey there! I _think_ there might be something off with one of gitea backends. Haven't you seen anything weird lately?	21:34
noonedeadpunk	I wasn't able to catch which exactly is misbehaving, as just realized it's likely not my internet issue (which I find very unreliable lately)	21:35
clarkb	noonedeadpunk: no I haven't noticed but I last used gitea intensely this morning. Cna you be more specific about what happened?	21:36
noonedeadpunk	as also seeing quite some issues reaching https://releases.openstack.org/constraints/upper/939e4eea5738ce51571ca85fd97aa2b02474e92a from CI	21:36
noonedeadpunk	which fails with `Connection failure: The read operation timed out`	21:36
clarkb	that url redirects to https://opendev.org/openstack/requirements/raw/commit/939e4eea5738ce51571ca85fd97aa2b02474e92a/upper-constraints.txt I can reach that url from all 6 gitea backends at the moment	21:37
noonedeadpunk	yeah. right now me to, though during the day I was able to open gitea from second-third reload. was throing TLS connection issue or smth like that	21:38
noonedeadpunk	sorry, I didn't gather enough details...	21:38
noonedeadpunk	but also - I've spotted some weird scheduling issues in Zuul - like for ` 930377` job is being queued for an hour, while zuul keeps accepting/spawning new workers	21:39
noonedeadpunk	seems to be the same same for 930383	21:39
clarkb	noonedeadpunk: so couple of things the first is that zuul scheduling jobs has nothing to do with gitea so the issues should be entirely decoupled. Second there are a lot of criteria/input that go into scheduling zuul jobs and that isn't necessarily abnormal	21:41
clarkb	for example if the jobs are multinode jobs all nodes must be provided by the same provider and it is harder to pack more nodes into any one provider so there may be a delay while we wait for quota to clear out	21:42
clarkb	this can also happen if there are semaphores or job order/dependency situations	21:42
noonedeadpunk	I mean - that's `openstack-tox-py39` job	21:42
clarkb	we also try to boot nodes three times in each provider before moving on with a timeout for each attempt. If a provider is struggling we could be delaying due to that	21:42
noonedeadpunk	but yeah, I know these are likely different	21:43
noonedeadpunk	fwiw, example of failed job with u-c fetch failure: https://zuul.opendev.org/t/openstack/build/218aa77846244c31857a394f73a94ffc/log/job-output.txt#16487	21:45
noonedeadpunk	I think I;ve rechecked today around 4-5 patches due to such error.	21:45
noonedeadpunk	but yeah, anyway, I will come back once have something more conclusive rather then FUD :D	21:47
clarkb	so I know I've said this before but its worth pointing out I guess. Zuul jobs can and do require the requirements project then they can refer to the constraints from the zuul caches	21:47
noonedeadpunk	oh, yes, we do use cached version except 1 usecase - where we test upgrades	21:48
clarkb	that example would be connecting from rackspace in dallas fort worth texas to vexxhost in san jose california. It could be a problem with the servers (maybe those jobs DOS'd us even) or it could be internet problems	21:48
noonedeadpunk	as then depends-on won't be respected, so it's taken from zuul prepared repo on N but not N-1	21:49
clarkb	you can always checkout N-1	21:49
noonedeadpunk	and we actually pull u-c just once per job, and cache it locally after	21:49
clarkb	I think that connection would've been attempted over ipv6 since both rax dfw and opendev.org have ipv6 support	21:49
noonedeadpunk	but then somwhere checkouted position should be stored as well	21:49
noonedeadpunk	and preserved...	21:50
clarkb	but its hard to tell from that log entry as it doesn't indicate the address it failed to connect to only the domain	21:50
clarkb	you can also checkout a sha	21:50
clarkb	or tag or branch name	21:50
noonedeadpunk	well, if there was ara to decode tasks a bit more...	21:50
noonedeadpunk	yeah, but how to checkout back to smth where zuul was together with pulled in patches by depends-on?	21:51
clarkb	thats failing in your nested ansible you can include ara if you like (we do it for system-config jobs that run nested ansible)	21:51
clarkb	noonedeadpunk: you just run a task to checkout what you need. Grenade is doing it	21:51
noonedeadpunk	I need to respect depends-on?	21:51
noonedeadpunk	as then - inside CI I already don't know what I need kinda?	21:52
clarkb	grenade respects depends on by checking out the correct branch name that zuul has prepared. But you don't have to do ti that way. It isn't clear to me which way you're trying to say it should be	21:52
noonedeadpunk	or well. I guess I"m just not sure how to do that	21:52
clarkb	but you could do a relative checkout for example and continue to respect depends on	21:52
clarkb	git checkout stable/foo~1 or is it git checkout stable/foo^ ? I'd have to go read the ref docs	21:53
clarkb	that queued up python39 job has its node request currently assigned to raxflex and at least one previous boot attempt did fail	21:53
noonedeadpunk	yeah, would need to think one more time about that....	21:53
noonedeadpunk	about ara - well, we do run and collect it, but pretty much nowhere to upload results.	21:54
noonedeadpunk	and HTML consumes too much of Swift storage, as it's just thousands of files with each job	21:54
noonedeadpunk	we disabled it after causing some issue to vexxhost object storage I guess	21:55
jrosser	we were hitting job timeout with the log upload i think, and the ara report was just a huge number of files	21:55
noonedeadpunk	but if you have an ara server to which we can push data....	21:55
clarkb	we don't have an ara server	21:58
clarkb	you could tar them up and then they would be viewabll locally?	21:58
noonedeadpunk	we kinda upload sqlite db, which with some hackery could be used locally...	21:59
noonedeadpunk	But I even forgot how to spawn it locally in the way to consume that sqlite	21:59
noonedeadpunk	anyway - not that serious issue I guess	22:00
clarkb	corvus: digging into why node request 300-0025281959 is slow and it almost looks like we're waiting an hour before deciding that the launch creation attempt should timeout	22:00
clarkb	corvus: I'll get a paste together momentarily	22:00
noonedeadpunk	node request: 300-0025282092 for "same" job waiting as well	22:00
noonedeadpunk	same project, same job, different patchset	22:01
clarkb	corvus: https://paste.opendev.org/show/b5juJyPa8GFYXoDnJSHr/	22:02
clarkb	I thought that timeout was much lower like 15 minutes max but we seem to pretty clearly wait an hour there	22:03
clarkb	then appear to have started the next attempt which is not proceeding quickly either	22:03
clarkb	side note I wonder if that would make a good swift feature. upload a tarball and have it expand within swift	22:05
clarkb	the issue is the huge number of operations required to upload many files iirc not the total file count or disk consumption itself	22:06
clarkb	corvus: also the server uuid recorded by nodepool's db for 0038608258 does not match the one that server list against raxflex shows	22:08
clarkb	corvus: I wonder if we're not udpating that record when we boot multiple attempts	22:08
clarkb	and whether or not that impacts our ability to check the status of the server over time?	22:08
clarkb	corvus: https://paste.opendev.org/show/bTRJKIzN73b8uwqijVkO/ I'ev tried to capture those details in this paste	22:10
clarkb	fwiw the server is in a build state on the cloud side so its not like we're ignoring a ready status node	22:10
clarkb	0038608563 in the same cloud does have matching uuids	22:12
clarkb	and the logs for 0038608563 don't indicate any retries so my best hunch is that data gets mismatched as part of a retry	22:13
clarkb	all of the nodes stuck in a build state are focal nodes	22:14
clarkb	that image is almost 5 days old in raxflex so unlikely to be a newly uploaded image that was either corrupted or a short copy	22:15
clarkb	also unlikely that the cloud would need to be converting it from one format to another at this point (that should be cached if necessary)	22:15
clarkb	so ya focal nodes are in a raxflex purgatory at the moment taking ~3 attempts * 1 hour per attempt to error out and go to the next cloud	22:16
clarkb	it isn't clear why we're waiting up to an hour for that (I thought the timeout was much lower) and it isn't clear why focal seems to be the only image currently affect. And as an aside it looks like nodepool may not update uuids in its db records after reattempts?	22:16
clarkb	I need to finish this review I was doing before I lose all context but then I'll look at the timeout thing first I guess as that is most likely the most straightforward thing and also likely to have a quick impact if we can change it	22:17
clarkb	fungi: I have posted a review on the mm3 upgrade change	22:20
corvus	clarkb: noonedeadpunk the origin remote points to the previous state: https://zuul-ci.org/docs/zuul/latest/job-content.html#git-repositories	22:22
clarkb	zuul's default launch timeout is indeed one hour. I'll get a chanage up to shorten	22:22
clarkb	s/zuul/nodepool	22:22
corvus	clarkb: noonedeadpunk so "git diff origin/master .. master" in a zuul-prepared repo means "tell me the commits between (the queue item ahead or the current master branch) to this queue item"	22:23
corvus	clarkb: i think we would update the id if it succeeded, but i think i see the issue where it won't update it if it fails on the second attempt	22:27
opendevreview	Clark Boylan proposed openstack/project-config master: Set launch-timeout on nodepool providers https://review.opendev.org/c/openstack/project-config/+/930388	22:28
clarkb	corvus: ok cool I wasn't imaging that	22:28
clarkb	infra-root ^ 930388 should mitigate slowness in assigning nodes to jobs until we figure out why that is happening	22:29
clarkb	and honestly that might be a bit of a see if it gets better from the cloud situation and if not betteri n 24 hours ask them about it	22:29
corvus	yeah... worth noting that at one point... likely when we only had one provider... the default was a reasonable timeout value	22:30
clarkb	to followup on my last summary we now know why we're waiting an hour for each launch attempt 930388 will address that. corvus thinks he sees a bug where we don't record new uuids for subsequent attempts properly, and we still don't know why focal nodes appear to be the only affected nodes in raxflex	22:32
clarkb	but I think we can see if that resolves itself if we mitigate with 930388 and if it persists we escalate to the cloud to see if they have any logs on their end pointing at the slowness (server show doesn't have any detail along those lines just says the server is in a build state)	22:33
corvus	yep. also i'm working on a bugfix locally	22:34
corvus	(for the minor issue of not recording the right uuid; fixing that won't help the actual problem)	22:34
clarkb	noonedeadpunk: going back to the upper constraints gitea serving thing: I think there are two pieces of information that would be helpful in further debugging first is the backed you are hitting (that info is available in the ssl cert altnames list) and second is whether or not you're connecting via ipv4 or ipv6 to the load balancer (so that we can try and do some reproduction with	22:35
clarkb	our servers on different networks with different routes to see if the internet is maybe sad)	22:35
clarkb	corvus: could that affect chceks on whether or not the server has gone ready? or do we have a handle for that from openstacksdk that we keep around and don't need the db for?	22:35
clarkb	in this case I manaully checked the server isn't ready so agree it wouldn't fix this particular issue	22:35
corvus	clarkb: i believe the in-memory external_id inside the state machine is correct; it just doesn't update zk	22:36
corvus	so if any attempt succeeded, it should still complete and it would then actually update the uuid too	22:36
clarkb	thanks	22:37
corvus	only time we should see this is while the second and later attempts are still (really) failing	22:37
clarkb	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1 looking at these graphs there are two events that standout to me. Starting at about 2200 gitea09 transfers a lot of data (I wonder if that coincides with hound updating its data). This doesn't correlate to the log noonedeadpunk shared	22:42
clarkb	the other is the concurrent sessions graphs has a big dip starting about 8 minutes after the log error noonedeadpunk pointed out	22:42
clarkb	so maybe we're logging delayed evidence of whatever it was?	22:42
clarkb	the connection rate doesn't go down though which is odd	22:44
clarkb	or at least not in the same degree	22:44
clarkb	and the frontend error rate is 0 the entire time	22:45
clarkb	gitea14 has a small blip of errors but I wouldn't expect a single backend to cause a drop of ~70% in frontend activity	22:45
clarkb	the lack of frontend and backend errors almost makes me wonder if we simply lost our connectivity to the world	22:46
clarkb	if anyone is wondering PBR gets ~26.5 million downloads from pypi each month	23:03
clarkb	I wonder how much of that is zuul ...	23:03
corvus	like zuul the project or jobs that happen to run in opendev's zuul?	23:04
clarkb	corvus: jobs that happen to run in opendev's zuul	23:04
clarkb	to put that in comparison the second most popular package (urllib3) gets about that many downloads every day	23:04
clarkb	and the top package (boto3) gets double that every day	23:05
corvus	last i looked, i think our build database had that many builds in it, and that's many many many years of builds	23:05
clarkb	corvus: neat so probably not all opendev ci then :)	23:05
clarkb	hatch does 3.25 million per month and setuptools-scm does 37.8 million per month	23:12
clarkb	and poetry is 44 million	23:12
timburke	i wonder how much of it is things like https://opendev.org/openstack/python-openstackclient/src/branch/master/requirements.txt#L5 where we have what should be a build-time dep listed as a runtime dep	23:17
timburke	well, "should be" -- i suppose it is imported and used, so the runtime dep makes sense enough. but IMHO it's being used poorly; importlib.metadata would be a better fit for most use-cases and provided by stdlib	23:19
clarkb	timburke: ya I'm sure a good chunk of it is various openstack things depending on it for various reasons. But I was trying to amek sense of whether or not our CI system is the vast majority of that	23:21
clarkb	we cache pypi packages but no longer truly mirror them (due to the explosive growth of the size of pypi)	23:21
clarkb	so in theory we're reducing the total number of downloads by some large portion	23:21
clarkb	timburke: and ya pbr's VersionInfo object is just doing a lookup with importlib.metadata or pkg_resources depending on availability you could do the same sort of thing directly	23:23
clarkb	PBRs version objects also do some pep440 stuff and can be compared. Probably rare for projects to use that though and they just want to find and report htier version	23:25

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!