*** rlandy is now known as rlandy|out | 00:50 | |
*** icey_ is now known as icey | 06:59 | |
*** jpena|off is now known as jpena | 07:37 | |
*** ysandeep is now known as ysandeep|lunch | 08:28 | |
*** ysandeep|lunch is now known as ysandeep | 09:00 | |
*** rlandy|out is now known as rlandy | 10:27 | |
*** whoami-rajat__ is now known as whoami-rajat | 12:01 | |
noonedeadpunk | hey there! We noticed changed zuul behaviour regardign queues, that invalidates check results and kind of waste CI resources for us | 13:07 |
---|---|---|
noonedeadpunk | https://review.opendev.org/c/openstack/openstack-ansible/+/836378 as example | 13:07 |
noonedeadpunk | I tried to check for options and how to set up queues, but I'm not sure I understand the consequences. | 13:07 |
noonedeadpunk | Doesn't mean having same change queue for all repos (projects) we manage, that all patches even if they don't have Depends-On would not run in parallel since they are queued one after another? | 13:09 |
fungi | noonedeadpunk: yes, shared queues in dependent pipelines take all changes ahead of the current change into account when testing. they're still tested in parallel, but their testing has to be reset if there's a failure for a change ahead of the current change so it can be removed from the checkout | 13:24 |
fungi | that's not a change in behavior, it's how zuul has basically always worked for the past 10 years since it was first conceived | 13:24 |
fungi | you use shared queues when your projects have fairly tightly-coupled interrelationships, such that you're concerned a change in one project could break functionality in another | 13:25 |
jrosser | this is the thing i mentioned yesterday, were if someone +2+W before a depends-on has merged, zuul drops a -2 | 13:26 |
jrosser | this seems to be new behaviour in the last ~week | 13:26 |
fungi | okay, the verified -2 for a depends-on when approved out of order is new behavior, yes, and is currently being discussed by the zuul maintainers | 13:27 |
jrosser | oh ok i'd missed that - where would I keep across that? | 13:27 |
fungi | i don't know what keep across means | 13:27 |
fungi | if you're asking where to find the zuul maintainers, they're in the #zuul:opendev.org matrix channel | 13:28 |
jrosser | oh sorry - i didnt know it was discussed beyond me mentioning it in #opendev yesterday | 13:28 |
fungi | pretty sure it got brought up in the zuul matrix channel, but i'm still catching up on discussions this morning | 13:28 |
fungi | just to be clear, i two changes in projects which don't share a change queue have a depends-on relationship and the depending change is approved before the dependency merges, then the depending change gets a -2 verified vote. previously, zuul simply ignored the approval on the depending change requiring you to reapprove it, correct? | 13:30 |
noonedeadpunk | fungi: so basically we'd need to define queue and then for each project reference it's name, am I right? | 13:31 |
noonedeadpunk | yep, it is correct | 13:31 |
fungi | noonedeadpunk: yes, that's how you indicate a particular project belongs to a named queue, you can look at the integrated or tripleo queues for examples | 13:31 |
noonedeadpunk | even during some short period, zuul was re-visiting previously ignored +2 and was starting gate jobs without need of manual pushing | 13:32 |
noonedeadpunk | but it was for quite short time - like month or so | 13:32 |
fungi | and where the resource waste comes in is, because the openstack tenant is configured to require a verified +1 from zuul before it can be enqueued into the gate pipeline, this means that another pass through check is required to clear the resulting verified -2 | 13:32 |
noonedeadpunk | yep | 13:34 |
noonedeadpunk | I see tripleo uses queue for gate only. but not for checks | 13:35 |
jrosser | thats it - and the out-of-order approval is actually very handy for us with limited reviewers | 13:35 |
*** dasm|off is now known as dasm | 13:59 | |
*** dasm is now known as dasm|ruck | 14:00 | |
clarkb | noonedeadpunk: queue is a project level setting not a pipeline level setting | 14:55 |
clarkb | put another wya tripleo is setting their queue for all pipelines not just gate. It is just that gate being a dependent pipeline has the most visible impact of that | 14:55 |
*** ysandeep is now known as ysandeep|out | 16:03 | |
*** jpena is now known as jpena|off | 16:31 | |
clarkb | noonedeadpunk: jrosser: to followup on this from the zuul matrix room I think it would be good to set up queues appropritaely then reevaluate if zuul is still causing problems. I tsounds like we would need a new zuul reporter type to avoid reporting status messages back without a negative vote and I'm not sure everyone is convinced yet that that is correct | 16:48 |
noonedeadpunk | well, right now it's kind of choice of wasting CI resources by setting +W in worng order vs wasting CI resources for invalidating all changes in queue in case of single accidental gate failure? | 16:50 |
noonedeadpunk | as then in our usecase setting queues will lead to more issues I guess... | 16:51 |
clarkb | well the second thing isn't a waste thats how gating is exopected to work and prevents landing conflicting chnages simultaneously | 16:51 |
clarkb | that was zuul's first feature | 16:51 |
noonedeadpunk | but it works this way even for changes that don't depend on explicitly each other? It's enough to be just in same queue? | 16:52 |
clarkb | basically what zuul is saying is you are currently subverting zuul's expectations so the result you get is less than ideal. The ask is that we not subvert zuul and try it the way zuul is meant to be used and see if it is still a problem | 16:52 |
clarkb | noonedeadpunk: when you have projects that share a queue then they enter shared queues with dependent pipeline managers. In our case that is the gate queue | 16:53 |
clarkb | this means they are tested together to avoid two conflicting changes in different changes from landing at the same time | 16:53 |
clarkb | this also enables upgrade testing and other neat functionality with testing things together without needing to explicitly depends-on everything | 16:54 |
jrosser | does that work outside mono-repos though? | 16:55 |
noonedeadpunk | Hm, I think indeed here we might issue in terms how we do test things... As from beginning we use integrated testing | 16:55 |
clarkb | not sure I understand. We don't host any mono repos as far as I know | 16:55 |
jrosser | where a new feature for us might be one patch to 5 repos then a pretty empty patch to openstack-ansible which depends-on them all | 16:55 |
jrosser | unless i mis-understand "testing things together without needing to explicitly depends-on everything" | 16:56 |
clarkb | jrosser: that is what the queue setting is for | 16:56 |
noonedeadpunk | but for that they all must be in gates at same time.... | 16:56 |
clarkb | for example nova, cinder, glance, swift, neutron, etc share a queue | 16:56 |
clarkb | noonedeadpunk: yes that is literally the point it was zuul's reason for existing :) | 16:56 |
noonedeadpunk | clarkb: the thing is that we also need to share queue with _all_ projects we deploy kind of | 16:57 |
clarkb | but this ensures that a change to nova cannot be approved and race a change to neutron that is approved at the same time that conflict with each other | 16:57 |
clarkb | one will be tested before the other and testing will ensure that only one merges | 16:57 |
clarkb | noonedeadpunk: well that i sthe question right? I'm basically saying set up the queue for OSA repos and then lets see if this problem persists? | 16:57 |
clarkb | I don't know your review patterns well enough to be able to predict that, but gathering data should be straightforward | 16:58 |
noonedeadpunk | clarkb: so the concern we have with using queues, is that we have unrealted to our code failures quite often | 16:58 |
clarkb | anyway fixing this problem in zuul is not straightforward and requires adding entirely new features. This is why the ask is we use zuul as it is intended to be used and then evaluate if this problem persists | 16:59 |
noonedeadpunk | clarkb: and then failure of single patch would invalidate everything that is current;y running? | 16:59 |
clarkb | noonedeadpunk: that sounds like something that should be addressed? | 16:59 |
noonedeadpunk | clarkb: how we should address ansible galaxy outage for instance? | 16:59 |
clarkb | yes flaky testing is bad in the gate. Typically we indicate that people should try to remove the flakyness since flakyness is bad for other reasons | 16:59 |
noonedeadpunk | sorry need to run away now | 17:00 |
clarkb | noonedeadpunk: one approach (taken by tripleo I think) is to have zuul hook up to the git repos for ansible roles so that zuul caches them. Another could potentially be to proxy cache galaxy (this is what we do for docker images) | 17:00 |
clarkb | Optimizing the gate so that test failures are acceptable is also sort of contradictory to zuul's expectations | 17:00 |
clarkb | we do have tools to make external resource access less problematic and depending on specific situations take different appraoches | 17:01 |
jrosser | i've previously asked for rabbitmq repos to be mirrored | 17:01 |
jrosser | but anyway | 17:01 |
*** dasm|ruck is now known as dasm|ruck|mtg | 17:01 | |
clarkb | jrosser: and I've said we'd be happy to proxy cache them iirc | 17:01 |
clarkb | but no one has written a change for that as far as I know | 17:02 |
jrosser | the trouble is they break them | 17:02 |
clarkb | oh right this is the case where the upstream doesn't know how to run a rpeo at all | 17:02 |
jrosser | invalid apt repos pretty much every time they release new stuff | 17:02 |
jrosser | and then there are galaxy roles which are malformed so can't be installed locally | 17:02 |
clarkb | I think my suggestion for that was to start by working with the upstream to address that | 17:02 |
jrosser | it's not through lack of trying with any of this, really | 17:03 |
clarkb | it isn't difficult to do right if you understand the problem exists. The problem is many people don't realize that deb repos wirk that way and just don't realize it is a problem | 17:03 |
jrosser | we got a ton of pushback on properly versioning some of the core ansible collections | 17:03 |
clarkb | when you say a galaxy role is malformed so can't be installed locally how do they work at all? I thought all galaxy did was take a tarball or similar and put it on disk. Not all that idfferent from checking out a git rpeo? | 17:03 |
jrosser | because whatever $process pushes them into the galaxy backend inserts the relevant metadate | 17:04 |
jrosser | it's otherwise missing in the git repo | 17:04 |
clarkb | I see. I'd be inclined to not use theose dependencies myself if they cannot be built form source | 17:04 |
jrosser | and thats hit / miss depending on which collection | 17:04 |
jrosser | ansible.netcommon ansible.utils for a start :( | 17:05 |
clarkb | proxy caching galaxy is likely also doable. I think tripleo wnet with caching git repos because they were all on github so that was straightforward | 17:05 |
clarkb | I don't know much about the galaxy protocol though and if they subvert caches | 17:06 |
clarkb | docker does this which makes caching docker hub difficult but still possible with theright options | 17:06 |
jrosser | i think i'm still failing to understand how a single queue can understand the relationship between patches without depends-on | 17:07 |
clarkb | jrosser: its an implied relationship that builds the queue based on reviewer activity | 17:07 |
clarkb | jrosser: if you approve change A and then change B they can enqueued in that order | 17:07 |
clarkb | what sharing a queue does is builds that queue for change A and B in the same order if they come from different projects | 17:08 |
clarkb | (within a single project this is always the case due to how git works) | 17:08 |
clarkb | additionally zuul knows that it should check all the rlated projects (as set by queue) for actionable state if a parent or child is approved | 17:08 |
jrosser | and what happens for check rather than gate? | 17:09 |
jrosser | ^ where there is no approval, i mean | 17:09 |
clarkb | check is basically the same as before since check is pre review. It tests changes merged to their target branch | 17:10 |
jrosser | so i'd still need the depends-on there? | 17:10 |
clarkb | if there is a strict dependency then yes | 17:10 |
jrosser | i think that we have a very high proportion of our changes being like that | 17:10 |
clarkb | depends-on handles strict dependencies. queue: and cogating handles related projects and the implied relationship preventing things from landing in an improper sequence | 17:11 |
clarkb | they solve two different problems and we seem to be conflating them here which isn't very helpful | 17:11 |
jrosser | no indeed | 17:11 |
clarkb | to go back to the openstack integrated gate example: you use depends on when nova wants to use a new feature in neutron when creating instances. Nova change depends on neutron change to add the new feature. They share an integrated gate to ensure that refactoring the give me a network api call continues to work with nova's changes to booting an instance | 17:12 |
clarkb | depends on are explicit expressions of relationships. The queue is an indication that changes to related projects may interfere with each other so we check them together | 17:13 |
clarkb | and that is why zuul uses queue to determine the list of related projects which it checks for actions when children or parents have actionable events | 17:14 |
jrosser | the thing i am concerned about is when we need to make completely unrelated changes like this one https://review.opendev.org/c/openstack/openstack-ansible-os_cinder/+/835702 | 17:18 |
jrosser | they need to be wholly unrelated (because they are) and implying anything from the approval order will almost certainly be counterproductive | 17:18 |
fungi | a piece of makes the recent behavior change "waste" additional resources is the openstack gate pipeline's "clean check" rule, which is considered a bit of an antipattern. other tenants allow to enqueue a change with a negative verified vote (even verified -2) directly into the gate pipeline without first going through check again | 17:19 |
jrosser | as those sort of changes tend to weed out latent brokenness in some of our less active repos | 17:19 |
clarkb | there are always change slike that for example updating docs in nova isn't going to break neutron api use | 17:19 |
clarkb | but we accept that and generally they don't seem to be a major problem because they are simple change sthat test quickly and accurabtley | 17:19 |
clarkb | *accurately | 17:19 |
jrosser | that example i gave is anything but quick to run, hence my concern about implying a relationship that doesnt exist | 17:20 |
clarkb | ok, I' masking that we try it before dismissing it since that is why zuul exists in the first place and other projects use it successfully | 17:21 |
fungi | most (all?) of the non-openstack tenants in our zuul pipeline even conserve resources by cancelling any running check jobs if the change gets approved and enqueued into the gate | 17:21 |
fungi | er, in our zuul deployment | 17:21 |
clarkb | otherwise I think the OSA team should bring this up with zuul and maybe help write the new functionality necessary to address this | 17:21 |
clarkb | (I'm willing to help do that myself but only if we've exhausted the "intended usage pattern doesn't function" first) | 17:21 |
fungi | this also might be a reason for openstack to revisit the "clean check" rule for its gate pipeline | 17:22 |
clarkb | fungi: well they are talking about blind rechecks in the nova room right now and sounds like they are a major problem? clean check was implemented due to blind rechecks landing broken code | 17:23 |
clarkb | Iguess the risk there is if blind rechecks are super common that people will get more stuff landed without any inspection as to why it failed in the first place. But also may be worthy of an experiement | 17:23 |
fungi | sort of. it was implemented because of core reviewers approving untested or failing changes | 17:23 |
fungi | but yes, repeatedly rechecking changes to clear failures without bothering to look into those failures is closely related | 17:24 |
clarkb | fungi: well what happens is you +1 in check. Reviewer +2's then developer can recheck as many times in a row as it takes to land the code | 17:24 |
clarkb | er reviewer +A's after the zuul +1 | 17:24 |
clarkb | but trying it and seeing if we end up root causing a bunch of random change sthat were recheked into oblivion when stuff breaks is doable and would be good data gathering | 17:25 |
clarkb | jrosser: re the rabbitmq thing does upstream not acknowledge the problem exists or they don't want help to fix it or? In general if you put package files in place before removing indexes and remove old files after some delay you end up with a repo that doesn't error. Typically that isn't difficult to achive. They build new packages, upload packages to repo, update index, | 17:33 |
clarkb | sleep $TIME, remove old packages. I guess I'm wondering if the problem is that they think the problem doesn't exist at all so updating order of operations is refused? | 17:33 |
clarkb | hrm looks likethey use packagecloud.io? | 17:35 |
clarkb | I wonder if this is a problem with the third party service | 17:35 |
jrosser | we used to get them from the rabbitmq upstream repo | 17:36 |
jrosser | and when they broke i'd tweet them and ~24hours later it would be fixed | 17:36 |
clarkb | Looks like cloudsmith.io also hosts packages. I wonder if one has problems and the other doesn't or maybe they are using the same repo content and then synced from broken state | 17:37 |
jrosser | but it got so bad we now get them from cloudsmith instead | 17:37 |
clarkb | jrosser: looking at their release notes for recent releases they seem to only have cloudsmith and packagecloud listed. | 17:37 |
jrosser | i do wonder if they just found maintaining the repo too much trouble | 17:38 |
clarkb | at the very least it seems they don't want people using some other repo for recent releases | 17:39 |
clarkb | https://packagecloud.io/rabbitmq/rabbitmq-server/install they seem to host something that is theoretically proxyable as a repo | 17:41 |
jrosser | iirc this stuff was bad as well https://www.erlang-solutions.com/downloads/ | 17:41 |
jrosser | and we get all of it from cloudsmith now | 17:41 |
clarkb | I'm hoping that a tool named "packagecloud" is building proper debs | 17:41 |
clarkb | er proper deb repos | 17:41 |
clarkb | oh its erlang not rabbit? | 17:41 |
jrosser | well you need both | 17:42 |
clarkb | got it | 17:42 |
jrosser | theres a compatibility matrix that relates the two | 17:42 |
clarkb | https://packagecloud.io/rabbitmq/erlang yup they provide both | 17:42 |
clarkb | looks like cloudsmith hosts deb files but not a repo. packagecloud does expose things as a repo with a gpg key you can trust and source.list entries you can add | 17:43 |
clarkb | and the packages on packgaecloud go back ~6 years. | 17:44 |
clarkb | I think if those packages continue to be unreliable for network access reasons instead of repo conssitency problems then using a proxy cache to packagecloud is reasonable to set up. Then anything else hosted there would be able to be retrieved through the same proxy | 17:45 |
clarkb | looks like they also host npm and maven and other stuff too so potentially useful beyond distro packages as well | 17:45 |
jrosser | i think cloudsmith does have a repo, they just make it somehow not browsable https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu | 17:48 |
clarkb | ah | 17:49 |
jrosser | eg `deb https://dl.cloudsmith.io/public/rabbitmq/rabbitmq-erlang/deb/ubuntu bionic main` | 17:49 |
clarkb | ya so could proxy cache either or both. Likely a matter of preference due to reliability more than anything else | 17:50 |
*** dasm|ruck|mtg is now known as dasm|ruck | 17:57 | |
*** rlandy is now known as rlandy|mtg | 18:33 | |
*** rlandy|mtg is now known as rlandy | 19:02 | |
*** rlandy is now known as rlandy|mtg | 20:26 | |
*** dviroel is now known as dviroel|out | 20:36 | |
timburke | is this a good place to mention there seems to be a problem with the fedora-35 mirrors? seeing py310 failures like https://zuul.openstack.org/build/37ab457a35f74e8eaab81af2fea63916/log/job-output.txt#341 | 20:57 |
fungi | timburke: thanks for the heads up, i haven't seen anyone else mention it yet | 21:05 |
fungi | we currently mirror from rsync://pubmirror2.math.uh.edu/fedora-buffet/fedora/linux according to this: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror-update/files/fedora-mirror-update#L36 | 21:07 |
fungi | the timestamp file at the root of our fedora mirror tree says we last updated at 2022-04-05T21:02:04,946404602+00:00 | 21:08 |
fungi | so only a few minutes ago, i guess | 21:09 |
fungi | though the indexes which returned a 404 in the linked job are still nonexistent on our mirrors | 21:10 |
fungi | strangely, http://pubmirror2.math.uh.edu/fedora-buffet/fedora/linux/updates/35/Everything/x86_64/repodata/ seems to have newer files than what we're serving | 21:12 |
fungi | including the files the job is looking for | 21:12 |
fungi | i'll check the rsync log | 21:12 |
fungi | https://static.opendev.org/mirror/logs/rsync-mirrors/fedora.log | 21:14 |
fungi | looks like there's some massive upheaval for f35 today | 21:14 |
fungi | i don't see any indication that rsync has picked up the missing indices yet | 21:15 |
fungi | also that log ends 2 hours ago, so is for the prior refresh. i bet we don't flush the output from the latest run to the volume before we release it | 21:17 |
fungi | yeah, the latest log is still in /var on the mirror-update.o.o server | 21:18 |
fungi | hard to say, but i think we've caught the uh.edu mirror in the middle of a large fedora 35 update | 21:20 |
fungi | we could try switching to pull from a different mirror which has already stabilized, or try to ride it out a bit longer | 21:22 |
timburke | 👍 thanks for the analysis! i'm content to wait it out -- nothing critical for me | 21:23 |
fungi | if it's still broken in 2-4 hours, then we might want to consider picking a different mirror to pull from | 21:24 |
fungi | unrelated, looks like pypi had a bunch of not-fun earlier: https://status.python.org/incidents/mxgkk3xxr9v7 | 21:24 |
clarkb | that mustlve caused the issues we observed with pacakges installs | 22:37 |
clarkb | I think thi sis the first time they've noticed that sort of problem when we do. I guess in thi scase because it was more catastrophic | 22:37 |
*** dasm|ruck is now known as dasm|off | 22:43 | |
opendevreview | Ghanshyam proposed openstack/project-config master: Remove tempest-lib from infra https://review.opendev.org/c/openstack/project-config/+/836703 | 22:45 |
*** rlandy|mtg is now known as rlandy | 23:16 | |
opendevreview | Ghanshyam proposed openstack/project-config master: Retire openstack-health project: end project gating https://review.opendev.org/c/openstack/project-config/+/836707 | 23:43 |
*** rlandy is now known as rlandy|out | 23:55 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!