Tuesday, 2022-04-05

*** rlandy is now known as rlandy\|out		00:50
*** icey_ is now known as icey		06:59
*** jpena\|off is now known as jpena		07:37
*** ysandeep is now known as ysandeep\|lunch		08:28
*** ysandeep\|lunch is now known as ysandeep		09:00
*** rlandy\|out is now known as rlandy		10:27
*** whoami-rajat__ is now known as whoami-rajat		12:01
noonedeadpunk	hey there! We noticed changed zuul behaviour regardign queues, that invalidates check results and kind of waste CI resources for us	13:07
noonedeadpunk	https://review.opendev.org/c/openstack/openstack-ansible/+/836378 as example	13:07
noonedeadpunk	I tried to check for options and how to set up queues, but I'm not sure I understand the consequences.	13:07
noonedeadpunk	Doesn't mean having same change queue for all repos (projects) we manage, that all patches even if they don't have Depends-On would not run in parallel since they are queued one after another?	13:09
fungi	noonedeadpunk: yes, shared queues in dependent pipelines take all changes ahead of the current change into account when testing. they're still tested in parallel, but their testing has to be reset if there's a failure for a change ahead of the current change so it can be removed from the checkout	13:24
fungi	that's not a change in behavior, it's how zuul has basically always worked for the past 10 years since it was first conceived	13:24
fungi	you use shared queues when your projects have fairly tightly-coupled interrelationships, such that you're concerned a change in one project could break functionality in another	13:25
jrosser	this is the thing i mentioned yesterday, were if someone +2+W before a depends-on has merged, zuul drops a -2	13:26
jrosser	this seems to be new behaviour in the last ~week	13:26
fungi	okay, the verified -2 for a depends-on when approved out of order is new behavior, yes, and is currently being discussed by the zuul maintainers	13:27
jrosser	oh ok i'd missed that - where would I keep across that?	13:27
fungi	i don't know what keep across means	13:27
fungi	if you're asking where to find the zuul maintainers, they're in the #zuul:opendev.org matrix channel	13:28
jrosser	oh sorry - i didnt know it was discussed beyond me mentioning it in #opendev yesterday	13:28
fungi	pretty sure it got brought up in the zuul matrix channel, but i'm still catching up on discussions this morning	13:28
fungi	just to be clear, i two changes in projects which don't share a change queue have a depends-on relationship and the depending change is approved before the dependency merges, then the depending change gets a -2 verified vote. previously, zuul simply ignored the approval on the depending change requiring you to reapprove it, correct?	13:30
noonedeadpunk	fungi: so basically we'd need to define queue and then for each project reference it's name, am I right?	13:31
noonedeadpunk	yep, it is correct	13:31
fungi	noonedeadpunk: yes, that's how you indicate a particular project belongs to a named queue, you can look at the integrated or tripleo queues for examples	13:31
noonedeadpunk	even during some short period, zuul was re-visiting previously ignored +2 and was starting gate jobs without need of manual pushing	13:32
noonedeadpunk	but it was for quite short time - like month or so	13:32
fungi	and where the resource waste comes in is, because the openstack tenant is configured to require a verified +1 from zuul before it can be enqueued into the gate pipeline, this means that another pass through check is required to clear the resulting verified -2	13:32
noonedeadpunk	yep	13:34
noonedeadpunk	I see tripleo uses queue for gate only. but not for checks	13:35
jrosser	thats it - and the out-of-order approval is actually very handy for us with limited reviewers	13:35
*** dasm\|off is now known as dasm		13:59
*** dasm is now known as dasm\|ruck		14:00
clarkb	noonedeadpunk: queue is a project level setting not a pipeline level setting	14:55
clarkb	put another wya tripleo is setting their queue for all pipelines not just gate. It is just that gate being a dependent pipeline has the most visible impact of that	14:55
*** ysandeep is now known as ysandeep\|out		16:03
*** jpena is now known as jpena\|off		16:31
clarkb	noonedeadpunk: jrosser: to followup on this from the zuul matrix room I think it would be good to set up queues appropritaely then reevaluate if zuul is still causing problems. I tsounds like we would need a new zuul reporter type to avoid reporting status messages back without a negative vote and I'm not sure everyone is convinced yet that that is correct	16:48
noonedeadpunk	well, right now it's kind of choice of wasting CI resources by setting +W in worng order vs wasting CI resources for invalidating all changes in queue in case of single accidental gate failure?	16:50
noonedeadpunk	as then in our usecase setting queues will lead to more issues I guess...	16:51
clarkb	well the second thing isn't a waste thats how gating is exopected to work and prevents landing conflicting chnages simultaneously	16:51
clarkb	that was zuul's first feature	16:51
noonedeadpunk	but it works this way even for changes that don't depend on explicitly each other? It's enough to be just in same queue?	16:52
clarkb	basically what zuul is saying is you are currently subverting zuul's expectations so the result you get is less than ideal. The ask is that we not subvert zuul and try it the way zuul is meant to be used and see if it is still a problem	16:52
clarkb	noonedeadpunk: when you have projects that share a queue then they enter shared queues with dependent pipeline managers. In our case that is the gate queue	16:53
clarkb	this means they are tested together to avoid two conflicting changes in different changes from landing at the same time	16:53
clarkb	this also enables upgrade testing and other neat functionality with testing things together without needing to explicitly depends-on everything	16:54
jrosser	does that work outside mono-repos though?	16:55
noonedeadpunk	Hm, I think indeed here we might issue in terms how we do test things... As from beginning we use integrated testing	16:55
clarkb	not sure I understand. We don't host any mono repos as far as I know	16:55
jrosser	where a new feature for us might be one patch to 5 repos then a pretty empty patch to openstack-ansible which depends-on them all	16:55
jrosser	unless i mis-understand "testing things together without needing to explicitly depends-on everything"	16:56
clarkb	jrosser: that is what the queue setting is for	16:56
noonedeadpunk	but for that they all must be in gates at same time....	16:56
clarkb	for example nova, cinder, glance, swift, neutron, etc share a queue	16:56
clarkb	noonedeadpunk: yes that is literally the point it was zuul's reason for existing :)	16:56
noonedeadpunk	clarkb: the thing is that we also need to share queue with _all_ projects we deploy kind of	16:57
clarkb	but this ensures that a change to nova cannot be approved and race a change to neutron that is approved at the same time that conflict with each other	16:57
clarkb	one will be tested before the other and testing will ensure that only one merges	16:57
clarkb	noonedeadpunk: well that i sthe question right? I'm basically saying set up the queue for OSA repos and then lets see if this problem persists?	16:57
clarkb	I don't know your review patterns well enough to be able to predict that, but gathering data should be straightforward	16:58
noonedeadpunk	clarkb: so the concern we have with using queues, is that we have unrealted to our code failures quite often	16:58
clarkb	anyway fixing this problem in zuul is not straightforward and requires adding entirely new features. This is why the ask is we use zuul as it is intended to be used and then evaluate if this problem persists	16:59
noonedeadpunk	clarkb: and then failure of single patch would invalidate everything that is current;y running?	16:59
clarkb	noonedeadpunk: that sounds like something that should be addressed?	16:59
noonedeadpunk	clarkb: how we should address ansible galaxy outage for instance?	16:59
clarkb	yes flaky testing is bad in the gate. Typically we indicate that people should try to remove the flakyness since flakyness is bad for other reasons	16:59
noonedeadpunk	sorry need to run away now	17:00
clarkb	noonedeadpunk: one approach (taken by tripleo I think) is to have zuul hook up to the git repos for ansible roles so that zuul caches them. Another could potentially be to proxy cache galaxy (this is what we do for docker images)	17:00
clarkb	Optimizing the gate so that test failures are acceptable is also sort of contradictory to zuul's expectations	17:00
clarkb	we do have tools to make external resource access less problematic and depending on specific situations take different appraoches	17:01
jrosser	i've previously asked for rabbitmq repos to be mirrored	17:01
jrosser	but anyway	17:01
*** dasm\|ruck is now known as dasm\|ruck\|mtg		17:01
clarkb	jrosser: and I've said we'd be happy to proxy cache them iirc	17:01
clarkb	but no one has written a change for that as far as I know	17:02
jrosser	the trouble is they break them	17:02
clarkb	oh right this is the case where the upstream doesn't know how to run a rpeo at all	17:02
jrosser	invalid apt repos pretty much every time they release new stuff	17:02
jrosser	and then there are galaxy roles which are malformed so can't be installed locally	17:02
clarkb	I think my suggestion for that was to start by working with the upstream to address that	17:02
jrosser	it's not through lack of trying with any of this, really	17:03
clarkb	it isn't difficult to do right if you understand the problem exists. The problem is many people don't realize that deb repos wirk that way and just don't realize it is a problem	17:03
jrosser	we got a ton of pushback on properly versioning some of the core ansible collections	17:03
clarkb	when you say a galaxy role is malformed so can't be installed locally how do they work at all? I thought all galaxy did was take a tarball or similar and put it on disk. Not all that idfferent from checking out a git rpeo?	17:03
jrosser	because whatever $process pushes them into the galaxy backend inserts the relevant metadate	17:04
jrosser	it's otherwise missing in the git repo	17:04
clarkb	I see. I'd be inclined to not use theose dependencies myself if they cannot be built form source	17:04
jrosser	and thats hit / miss depending on which collection	17:04
jrosser	ansible.netcommon ansible.utils for a start :(	17:05
clarkb	proxy caching galaxy is likely also doable. I think tripleo wnet with caching git repos because they were all on github so that was straightforward	17:05
clarkb	I don't know much about the galaxy protocol though and if they subvert caches	17:06
clarkb	docker does this which makes caching docker hub difficult but still possible with theright options	17:06
jrosser	i think i'm still failing to understand how a single queue can understand the relationship between patches without depends-on	17:07
clarkb	jrosser: its an implied relationship that builds the queue based on reviewer activity	17:07
clarkb	jrosser: if you approve change A and then change B they can enqueued in that order	17:07
clarkb	what sharing a queue does is builds that queue for change A and B in the same order if they come from different projects	17:08
clarkb	(within a single project this is always the case due to how git works)	17:08
clarkb	additionally zuul knows that it should check all the rlated projects (as set by queue) for actionable state if a parent or child is approved	17:08
jrosser	and what happens for check rather than gate?	17:09
jrosser	^ where there is no approval, i mean	17:09
clarkb	check is basically the same as before since check is pre review. It tests changes merged to their target branch	17:10
jrosser	so i'd still need the depends-on there?	17:10
clarkb	if there is a strict dependency then yes	17:10
jrosser	i think that we have a very high proportion of our changes being like that	17:10
clarkb	depends-on handles strict dependencies. queue: and cogating handles related projects and the implied relationship preventing things from landing in an improper sequence	17:11
clarkb	they solve two different problems and we seem to be conflating them here which isn't very helpful	17:11
jrosser	no indeed	17:11
clarkb	to go back to the openstack integrated gate example: you use depends on when nova wants to use a new feature in neutron when creating instances. Nova change depends on neutron change to add the new feature. They share an integrated gate to ensure that refactoring the give me a network api call continues to work with nova's changes to booting an instance	17:12
clarkb	depends on are explicit expressions of relationships. The queue is an indication that changes to related projects may interfere with each other so we check them together	17:13
clarkb	and that is why zuul uses queue to determine the list of related projects which it checks for actions when children or parents have actionable events	17:14
jrosser	the thing i am concerned about is when we need to make completely unrelated changes like this one https://review.opendev.org/c/openstack/openstack-ansible-os_cinder/+/835702	17:18
jrosser	they need to be wholly unrelated (because they are) and implying anything from the approval order will almost certainly be counterproductive	17:18
fungi	a piece of makes the recent behavior change "waste" additional resources is the openstack gate pipeline's "clean check" rule, which is considered a bit of an antipattern. other tenants allow to enqueue a change with a negative verified vote (even verified -2) directly into the gate pipeline without first going through check again	17:19
jrosser	as those sort of changes tend to weed out latent brokenness in some of our less active repos	17:19
clarkb	there are always change slike that for example updating docs in nova isn't going to break neutron api use	17:19
clarkb	but we accept that and generally they don't seem to be a major problem because they are simple change sthat test quickly and accurabtley	17:19
clarkb	*accurately	17:19
jrosser	that example i gave is anything but quick to run, hence my concern about implying a relationship that doesnt exist	17:20
clarkb	ok, I' masking that we try it before dismissing it since that is why zuul exists in the first place and other projects use it successfully	17:21
fungi	most (all?) of the non-openstack tenants in our zuul pipeline even conserve resources by cancelling any running check jobs if the change gets approved and enqueued into the gate	17:21
fungi	er, in our zuul deployment	17:21
clarkb	otherwise I think the OSA team should bring this up with zuul and maybe help write the new functionality necessary to address this	17:21
clarkb	(I'm willing to help do that myself but only if we've exhausted the "intended usage pattern doesn't function" first)	17:21
fungi	this also might be a reason for openstack to revisit the "clean check" rule for its gate pipeline	17:22
clarkb	fungi: well they are talking about blind rechecks in the nova room right now and sounds like they are a major problem? clean check was implemented due to blind rechecks landing broken code	17:23
clarkb	Iguess the risk there is if blind rechecks are super common that people will get more stuff landed without any inspection as to why it failed in the first place. But also may be worthy of an experiement	17:23
fungi	sort of. it was implemented because of core reviewers approving untested or failing changes	17:23
fungi	but yes, repeatedly rechecking changes to clear failures without bothering to look into those failures is closely related	17:24
clarkb	fungi: well what happens is you +1 in check. Reviewer +2's then developer can recheck as many times in a row as it takes to land the code	17:24
clarkb	er reviewer +A's after the zuul +1	17:24
clarkb	but trying it and seeing if we end up root causing a bunch of random change sthat were recheked into oblivion when stuff breaks is doable and would be good data gathering	17:25
clarkb	jrosser: re the rabbitmq thing does upstream not acknowledge the problem exists or they don't want help to fix it or? In general if you put package files in place before removing indexes and remove old files after some delay you end up with a repo that doesn't error. Typically that isn't difficult to achive. They build new packages, upload packages to repo, update index,	17:33
clarkb	sleep $TIME, remove old packages. I guess I'm wondering if the problem is that they think the problem doesn't exist at all so updating order of operations is refused?	17:33
clarkb	hrm looks likethey use packagecloud.io?	17:35
clarkb	I wonder if this is a problem with the third party service	17:35
jrosser	we used to get them from the rabbitmq upstream repo	17:36
jrosser	and when they broke i'd tweet them and ~24hours later it would be fixed	17:36
clarkb	Looks like cloudsmith.io also hosts packages. I wonder if one has problems and the other doesn't or maybe they are using the same repo content and then synced from broken state	17:37
jrosser	but it got so bad we now get them from cloudsmith instead	17:37
clarkb	jrosser: looking at their release notes for recent releases they seem to only have cloudsmith and packagecloud listed.	17:37
jrosser	i do wonder if they just found maintaining the repo too much trouble	17:38
clarkb	at the very least it seems they don't want people using some other repo for recent releases	17:39
clarkb	https://packagecloud.io/rabbitmq/rabbitmq-server/install they seem to host something that is theoretically proxyable as a repo	17:41
jrosser	iirc this stuff was bad as well https://www.erlang-solutions.com/downloads/	17:41
jrosser	and we get all of it from cloudsmith now	17:41
clarkb	I'm hoping that a tool named "packagecloud" is building proper debs	17:41
clarkb	er proper deb repos	17:41
clarkb	oh its erlang not rabbit?	17:41
jrosser	well you need both	17:42
clarkb	got it	17:42
jrosser	theres a compatibility matrix that relates the two	17:42
clarkb	https://packagecloud.io/rabbitmq/erlang yup they provide both	17:42
clarkb	looks like cloudsmith hosts deb files but not a repo. packagecloud does expose things as a repo with a gpg key you can trust and source.list entries you can add	17:43
clarkb	and the packages on packgaecloud go back ~6 years.	17:44
clarkb	I think if those packages continue to be unreliable for network access reasons instead of repo conssitency problems then using a proxy cache to packagecloud is reasonable to set up. Then anything else hosted there would be able to be retrieved through the same proxy	17:45
clarkb	looks like they also host npm and maven and other stuff too so potentially useful beyond distro packages as well	17:45
jrosser	i think cloudsmith does have a repo, they just make it somehow not browsable https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu	17:48
clarkb	ah	17:49
jrosser	eg `deb https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu bionic main`	17:49
clarkb	ya so could proxy cache either or both. Likely a matter of preference due to reliability more than anything else	17:50
*** dasm\|ruck\|mtg is now known as dasm\|ruck		17:57
*** rlandy is now known as rlandy\|mtg		18:33
*** rlandy\|mtg is now known as rlandy		19:02
*** rlandy is now known as rlandy\|mtg		20:26
*** dviroel is now known as dviroel\|out		20:36
timburke	is this a good place to mention there seems to be a problem with the fedora-35 mirrors? seeing py310 failures like https://zuul.openstack.org/build/37ab457a35f74e8eaab81af2fea63916/log/job-output.txt#341	20:57
fungi	timburke: thanks for the heads up, i haven't seen anyone else mention it yet	21:05
fungi	we currently mirror from rsync://pubmirror2.math.uh.edu/fedora-buffet/fedora/linux according to this: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L36	21:07
fungi	the timestamp file at the root of our fedora mirror tree says we last updated at 2022-04-05T21:02:04,946404602+00:00	21:08
fungi	so only a few minutes ago, i guess	21:09
fungi	though the indexes which returned a 404 in the linked job are still nonexistent on our mirrors	21:10
fungi	strangely, http://pubmirror2.math.uh.edu/fedora-buffet/fedora/linux/updates/35/Everything/x86_64/repodata/ seems to have newer files than what we're serving	21:12
fungi	including the files the job is looking for	21:12
fungi	i'll check the rsync log	21:12
fungi	https://static.opendev.org/mirror/logs/rsync-mirrors/fedora.log	21:14
fungi	looks like there's some massive upheaval for f35 today	21:14
fungi	i don't see any indication that rsync has picked up the missing indices yet	21:15
fungi	also that log ends 2 hours ago, so is for the prior refresh. i bet we don't flush the output from the latest run to the volume before we release it	21:17
fungi	yeah, the latest log is still in /var on the mirror-update.o.o server	21:18
fungi	hard to say, but i think we've caught the uh.edu mirror in the middle of a large fedora 35 update	21:20
fungi	we could try switching to pull from a different mirror which has already stabilized, or try to ride it out a bit longer	21:22
timburke	👍 thanks for the analysis! i'm content to wait it out -- nothing critical for me	21:23
fungi	if it's still broken in 2-4 hours, then we might want to consider picking a different mirror to pull from	21:24
fungi	unrelated, looks like pypi had a bunch of not-fun earlier: https://status.python.org/incidents/mxgkk3xxr9v7	21:24
clarkb	that mustlve caused the issues we observed with pacakges installs	22:37
clarkb	I think thi sis the first time they've noticed that sort of problem when we do. I guess in thi scase because it was more catastrophic	22:37
*** dasm\|ruck is now known as dasm\|off		22:43
opendevreview	Ghanshyam proposed openstack/project-config master: Remove tempest-lib from infra https://review.opendev.org/c/openstack/project-config/+/836703	22:45
*** rlandy\|mtg is now known as rlandy		23:16
opendevreview	Ghanshyam proposed openstack/project-config master: Retire openstack-health project: end project gating https://review.opendev.org/c/openstack/project-config/+/836707	23:43
*** rlandy is now known as rlandy\|out		23:55

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!