fungi | oh, good point, the packaging doesn't know to remove the per-site initscripts | 00:08 |
---|---|---|
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Move grubenv to EFI dir, add a symlink back https://review.opendev.org/c/openstack/diskimage-builder/+/804000 | 05:36 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Support grubby and the Bootloader Spec https://review.opendev.org/c/openstack/diskimage-builder/+/804002 | 05:36 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: RHEL/Centos 9 does not have package grub2-efi-x64-modules https://review.opendev.org/c/openstack/diskimage-builder/+/804816 | 05:36 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add policycoreutils package mappings for RHEL/Centos 9 https://review.opendev.org/c/openstack/diskimage-builder/+/804817 | 05:37 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add reinstall flag to install-packages, use it in bootloader https://review.opendev.org/c/openstack/diskimage-builder/+/804818 | 05:37 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Add DIB_YUM_REPO_PACKAGE as an alternative to DIB_YUM_REPO_CONF https://review.opendev.org/c/openstack/diskimage-builder/+/804819 | 05:37 |
*** ysandeep|away is now known as ysandeep | 06:32 | |
*** jpena|off is now known as jpena | 07:32 | |
*** rpittau|afk is now known as rpittau | 07:54 | |
*** ykarel is now known as ykarel|lunch | 09:27 | |
*** diablo_rojo is now known as Guest4602 | 10:34 | |
*** ykarel|lunch is now known as ykarel | 10:38 | |
*** dviroel|out is now known as dviroel|ruck | 11:26 | |
*** jpena is now known as jpena|lunch | 11:39 | |
*** jpena|lunch is now known as jpena | 12:42 | |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 12:46 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Make elastic recheck compatible with rdo elasticsearch https://review.opendev.org/c/opendev/elastic-recheck/+/803897 | 12:53 |
*** ykarel is now known as ykarel|away | 14:41 | |
*** jpena is now known as jpena|off | 15:26 | |
*** jpena|off is now known as jpena | 15:27 | |
*** jpena is now known as jpena|off | 15:48 | |
*** ysandeep is now known as ysandeep|dinner | 16:02 | |
clarkb | fungi: can you think of other testing we might want to do on that server? should we manually run newlist maybe and see if your mail server receives it? (I expect mine wont) | 16:05 |
clarkb | s/that server/the test lists.kc.io server/ | 16:05 |
fungi | we'll need to turn mailman and exim on, but sure can give that a shot. should be safe | 16:05 |
fungi | i'm heating up some lunch right now but can test stuff shortly | 16:06 |
clarkb | no rush, just trying to think of additional checks we can safely do | 16:06 |
fungi | still need to dig up logs for the suspected zuul configuration cache bug reported in #openstack-infra today as well | 16:06 |
clarkb | fungi: did you also check with /etc/hosts override that the archives are visible? | 16:07 |
fungi | yes, seemed fine | 16:07 |
clarkb | oh yes I meant to read up on that more closely now that meetings are done | 16:07 |
*** marios is now known as marios|out | 16:08 | |
*** ysandeep|dinner is now known as ysandeep|out | 16:21 | |
*** rpittau is now known as rpittau|afk | 16:31 | |
clarkb | corvus: fungi: I think there may be another issue with running parallel CD jobs on bridge and that is the system-config repo start? | 16:48 |
clarkb | s/start/state/ every job is going to try and update the repo | 16:48 |
clarkb | In theory they will all be trying to update to the same thing but I could see git lock failures? | 16:49 |
fungi | maybe we could do a lbyl on it? | 16:49 |
fungi | though i suppose there's still an initial race | 16:49 |
clarkb | lbyl? | 16:50 |
fungi | look before you leap | 16:50 |
fungi | if two builds start at the same time and both see system-config is behind and want to update it | 16:50 |
clarkb | ah yup. | 16:50 |
fungi | then they would still potentially collide | 16:50 |
clarkb | I'm not sure what a good appraoch would be yet. Was just thinking through other points of conflict and realized this is likely one | 16:50 |
fungi | the lbyl idea was check whether system-config is up to date, and then only try to update it if not | 16:50 |
fungi | we could also add a wait lock around updates to that checkout, to deal with the race | 16:51 |
clarkb | another approach would be to have the locking job at the start of the buildset update git and then pause | 16:51 |
clarkb | then the subsequent jobs do not touch git | 16:51 |
fungi | quickly getting to be complicated coordination though | 16:51 |
fungi | yeah, maybe if all the other builds need to depend on the base playbook anyway, then we just update there and not any other time. though it makes testing those playbooks outside the deploy scope more complex since we always need a prerequisite system-config update before running | 16:52 |
clarkb | fungi: we already have it split out of the service playbooks | 16:53 |
clarkb | but it runs in every job currently | 16:53 |
clarkb | in good news we did cut out almost 20 minutes per hourly deploy by moving the cloud launcher | 16:54 |
clarkb | are we actively using zuul preview? I'm wondering if that is another potential optimisation point | 16:55 |
fungi | what's the context? i mean, zuul-preview is used by jobs | 17:03 |
clarkb | fungi: we run its playbook hourly along with nodepool, zuul, and the docker registry jobs | 17:04 |
clarkb | but I thought we had to turn it off | 17:04 |
clarkb | I'm wondering if there is any value to running the job currently as a result | 17:04 |
fungi | the server is running and responding | 17:07 |
fungi | http://zuul-preview.opendev.org/ | 17:07 |
clarkb | yes I think the server is up but we stopped returning it in zuul artifacts | 17:07 |
fungi | oh, got it | 17:07 |
clarkb | and I thought the apache might have been stopped? | 17:07 |
fungi | nah, apache is running and listening on 80/443 (though we only allow 80 through iptables) | 17:08 |
clarkb | I don't recall details but I want to say it shouldn't be | 17:08 |
fungi | but yeah, codesearch says this is the only master branch use of it currently: https://opendev.org/inaugust/inaugust.com/src/branch/master/.zuul.yaml#L19 | 17:10 |
mordred | yeah - I use it :) | 17:24 |
mordred | oh - but even that is wrong | 17:25 |
mordred | because success-url is not the right way to do that anymore | 17:25 |
clarkb | yes | 17:27 |
fungi | i found where it was originally disabled | 17:29 |
fungi | https://meetings.opendev.org/irclogs/%23openstack-infra/%23openstack-infra.2019-08-08.log.html#t2019-08-08T15:02:28 | 17:29 |
fungi | the suggestion was there were signs of use as an open proxy in its logs | 17:29 |
clarkb | I think faeda1ab850f11da0ed7df4fac985ff9e96454b3 may have fixed that in zuul-preview | 17:31 |
clarkb | now the question goes back to "do we need hourly deploys of this service if nothing is (correctly) using it right now?" | 17:32 |
fungi | looks like https://review.opendev.org/717870 was the expected solution | 17:32 |
fungi | configuration for it doesn't need to update that frequently, i expect | 17:32 |
fungi | maybe the hourly was to catch new zuul-preview container images more quickly, but that also seems like daily would be plenty | 17:33 |
clarkb | fungi: yup hourly jobs in most of these cases are to update container images in a reasonable maount of time | 17:33 |
clarkb | the cloud launcher and the puppet jobs are the exceptions to that rule. | 17:33 |
fungi | also, as some reassurance the open proxy situation was resolved, i've analyzed the past month worth of logs and don't see any successful requests through it | 17:34 |
fungi | (also proof that it's not really being used at all though) | 17:34 |
clarkb | fungi: checking on prod lists.kc.io ua status still shows it is enrolled | 17:44 |
clarkb | I suspect they may check things like IP addrs and uuids and such | 17:44 |
clarkb | whcih is good because it means things are happy on our end :) | 17:44 |
clarkb | Looking at remote-puppet-else more closely I think we can probably improve the files list on that one a bit and run it in deploy and daily only | 17:52 |
clarkb | We're really not expecting puppet side effects unless we update the puppet directly. The reason for this is puppet things largely only deal with pacakges and not containers | 17:52 |
clarkb | Then we should be able to trim the hourly stuff to be nodepool and zuul which are things that tend to be under active development and have external updates we don't have good triggers for | 17:53 |
fungi | also the puppeted things are, on the whole, not getting a lot of updates these days | 17:54 |
clarkb | ya I think the reason it is in there is we may update the various puppet moduels then want those updates to hit production | 17:56 |
clarkb | but reality is we don't make a lot of updates to the external puppet modules anymore | 17:56 |
clarkb | and if we had a one off we can always wait for the daily or manually run the playbook | 17:56 |
clarkb | the thing that sets zuul apart is it mgiht get many updates every day all week | 17:56 |
clarkb | Pure brainstorm: we could also drop the hourly pipeline entirely and if we want things quicker than daily either land a change (could be a noop change) to force jobs to run or manually run playbooks | 17:59 |
fungi | corvus: clarkb: so on the stale project pipeline config for the openstack/glance_store stable/victoria branch, i can't find evidence that zuul saw the change-merged event for that (i see plenty of other change-merged events logged, but none for 804606,2 when it merged). i'm going to check the gerrit log for any errors around that time | 18:27 |
clarkb | fungi: you might also look for ssh connection errors in the zuul scheduler log since those events are ingested via the ssh event stream | 18:28 |
fungi | thanks, good idea | 18:29 |
fungi | corvus: clarkb: so given the timing, i think this exception could be related to the missed change-merged event, but there's not enough context in the log to be certain (it occurs roughly 19 seconds after the submit action at 17:08:26,569) | 18:47 |
fungi | http://paste.openstack.org/show/808156/ | 18:47 |
clarkb | interesting. maybe we failed to process the merge event | 18:49 |
clarkb | and that is the exception related to that? | 18:50 |
clarkb | we only get the ssh event stream event over ssh then everything else is http | 18:50 |
fungi | i can't find any smoking gun in the gerrit error log either | 18:51 |
fungi | anyway, i'm inclined to say this doesn't point to a problem with config caching, but seems to be related to a lost stream event | 18:51 |
fungi | worth noting, there are 17 "Exception moving Gerrit event" lines i yesterday's scheduler log | 18:52 |
clarkb | though the behavior indicates the cache is affected | 18:52 |
clarkb | because pushing a new change without the history of the old change didn't run the job that was removed | 18:53 |
fungi | and 16 so far in today's log | 18:53 |
clarkb | it does seem like the cache should try and be resilient to these issues if possible? | 18:53 |
fungi | turns out it wasn't "pushing a new change without the history of the old change" so much as "pushing a change which edits the zuul config" | 18:54 |
clarkb | ah | 18:54 |
fungi | which makes sense, in that case zuul creates a speculative layout based on the change | 18:54 |
clarkb | because that forces a caceh refresh | 18:54 |
clarkb | yup | 18:54 |
fungi | but it doesn't actually refresh the cache, seems like, because changes which don't alter the zuul config remain brokenly running the removed job | 18:55 |
clarkb | ya it will only apply to the change's speculative state until it merges | 18:55 |
fungi | so seems more like the cache is avoided when a speculative layout is created | 18:55 |
clarkb | I think it still caches it | 18:55 |
clarkb | for the next time that chagne runs jobs | 18:55 |
fungi | er, or it's caching separately right | 18:55 |
fungi | but not updating the cache of the branch state config | 18:56 |
corvus | i'm not really here yet -- but if we missed an event, it's not a cache issue -- that's why i suggested that's the place to look | 18:56 |
corvus | it's just that zuul is running with an outdated config; this would have happened in zuul v3 too | 18:57 |
clarkb | corvus: I don't know that we missed the event as much as threw and exception processing the event based on what fungi pasted | 18:57 |
clarkb | functioanlly it is like missing an event | 18:57 |
corvus | okay, i'll dig in more later | 18:57 |
corvus | well, okay, i seem to be toggling back and forth on understanding what you're saying... | 18:58 |
fungi | as i said, i don't think we have sufficient context in the log to know that the exception there was related to the lost change-merged event, it's 19 seconds after the submit, which makes me suspect it might not be | 18:58 |
corvus | question #1 for debugging this is: "did the zuul scheduler's main loop process a change-merged event and therefore correctly determine that it should update its config?" | 18:59 |
fungi | but that in conjunction with me not finding any log of seeing the change-merged event for it makes it enough of a possibility i'm not ruling it out | 18:59 |
clarkb | 2021-08-16 17:08:45,780 ERROR zuul.GerritEventConnector: Exception moving Gerrit event: <- that is what happened in fungi's paste | 18:59 |
corvus | if the answer to that is "yes" then we look at the new zk config cache stuff. if the answer is "no" (for whatever reason -- TCP error, exception moving event, etc) then we don't look at the config cache, we look at the event processing. | 19:00 |
corvus | anyway, sorry, i haven't caught up yet, and i'm still working on ansible stuff... i was just trying to help avoid dead ends. like i said, i can pitch in after lunch. | 19:00 |
fungi | yes, i think it's event processing. the config cache states are simply explaining the behavior we saw, i don't see any reason to suspect they're part of the problem | 19:01 |
corvus | merging another config change to that file or performing a full reconfig should fix the immediate issue | 19:02 |
corvus | (in case it's becoming an operational issue) | 19:02 |
fungi | it's not, i don't think, but yes i already assumed those were the workarounds. thanks for confirming! | 19:03 |
yoctozepto | fungi: as a side note to matrix discussion: I will be advocating for more flexibility - otherwise we will have a hard time migrating ever ;/ | 19:19 |
fungi | yes, i think it's good to acknowledge that people will communicate however is convenient for them to do so, and it's better that we not pretend they're active and available somewhere they aren't | 19:20 |
yoctozepto | ++ | 19:20 |
yoctozepto | though is mainlaind China happy with matrix? it might be seen as overly encrypted, no? | 19:21 |
fungi | element/matrix.org homeservers are apparently not generally accessible from behind the great firewall, no | 19:22 |
yoctozepto | meh | 19:22 |
fungi | however there was mention of the existence of a matrix wechat bridge, someone still needs to look into viability | 19:22 |
fungi | i think that wouldn't work for getting to the zuul matrix channel from wechat, but might make some wechat channels available via matrix | 19:23 |
clarkb | also the beijing linux user group has matrix instructions | 19:24 |
clarkb | we might be able to reach out to groups like that to get an informed opinion on usability | 19:24 |
yoctozepto | ++ on that | 19:24 |
yoctozepto | would help openstack tc as well | 19:24 |
yoctozepto | aaand is the zuul matrix-native channel on matrix.org or elsewhere? | 19:25 |
clarkb | its on a small opendev homeserver that we don't intend on using for accounts just rooms | 19:25 |
yoctozepto | ack | 19:26 |
opendevreview | Ian Wienand proposed opendev/system-config master: borg-backup: randomise time on a per-server basis https://review.opendev.org/c/opendev/system-config/+/804916 | 19:26 |
fungi | yoctozepto: also we're not running the homeserver ourselves, the foundation is paying element to host one for us | 19:26 |
yoctozepto | so, there could be a chance that this one can be reach via the great firewall | 19:26 |
fungi | but we could host it on our own servers later if we decide that's warranted | 19:26 |
yoctozepto | ack | 19:26 |
yoctozepto | thanks for the insights | 19:26 |
fungi | i think with element hosting the homeserver for us, it's unlikely to work directly from mainland china | 19:27 |
yoctozepto | duh | 19:27 |
ianw | sorry, i got distracted on the UA redirect | 20:01 |
fungi | ianw: oh, don't apologize, i was meaning to work on it too. i've commented on the change with the ip address of the held server, feel free to fiddle with the vhost config on it if you like | 20:02 |
corvus | fungi, clarkb: looking at the traceback, it does seem likely that the event was not processed due to the connection error. i think a retry loop in queryChange should help (and would apply equally to http and ssh query methods). | 20:04 |
opendevreview | Ian Wienand proposed opendev/system-config master: [wip] redirect pastebinit https://review.opendev.org/c/opendev/system-config/+/804918 | 20:23 |
marc-vorwerk | /msg NickServ REGISTER <password> <e-mail> | 20:38 |
*** dviroel|ruck is now known as dviroel|ruck|out | 21:15 | |
corvus | infra-root: i'd like to restart zuul now to pick up some bugfixes | 21:56 |
fungi | corvus: fine by me, status looks calm and i don't see any openstack release changes in-flight | 21:58 |
fungi | i'll give #openstack-release a heads up | 21:58 |
corvus | re-enqueing now | 22:13 |
fungi | thanks! | 22:13 |
corvus | complete | 22:20 |
corvus | #status log restarted all of zuul on commit 6eb84eb4bd475e09498f1a32a49e92b814218942 | 22:20 |
opendevstatus | corvus: finished logging | 22:20 |
opendevreview | Paul Belanger proposed opendev/system-config master: Add mitogen support to ansible https://review.opendev.org/c/opendev/system-config/+/804922 | 22:40 |
pabelanger | o/ | 22:46 |
pabelanger | I pushed up a change to add mitogen support to ansible deploys | 22:47 |
pabelanger | should make things a little faster | 22:47 |
pabelanger | will check back soon to debug the change failure if any | 22:47 |
fungi | thanks pabelanger! | 22:48 |
clarkb | thanks. It will be interseting to see how that performs compared to the default | 22:51 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run infra-prod-service-zuul-preview daily instaed of hourly https://review.opendev.org/c/opendev/system-config/+/804925 | 22:58 |
opendevreview | Clark Boylan proposed opendev/system-config master: Run remote-puppet-else daily instead of hourly https://review.opendev.org/c/opendev/system-config/+/804926 | 22:58 |
opendevreview | Clark Boylan proposed opendev/system-config master: Stop requiring puppet things for afs, eavesdrop, and nodepool https://review.opendev.org/c/opendev/system-config/+/804927 | 22:58 |
clarkb | infra-root ^ more attempts at cleaning up the current deploy job situation | 22:59 |
pabelanger | clarkb: ah, you are on ansible-core 2.12. Sadly, mitogen doesn't support it yet | 23:10 |
pabelanger | I think they only support 2.10 | 23:10 |
pabelanger | I'm still on 2.9 | 23:10 |
clarkb | pabelanger: ok I was worried about that. You can set the ansible back to 2.9 or 2.10 in the change too just for comparison | 23:10 |
clarkb | I don't know that we will downgrade in production but having the data would still be useful I think | 23:10 |
fungi | is that actually what we've got in production? | 23:11 |
clarkb | fungi: ya we upgraded a couple months ago iirc | 23:11 |
pabelanger | k, let me do it for the test job | 23:11 |
fungi | indeed, --version says ansible [core 2.11.1] | 23:12 |
fungi | so >2.10 anyway | 23:12 |
opendevreview | Paul Belanger proposed opendev/system-config master: WIP: Add mitogen support to ansible https://review.opendev.org/c/opendev/system-config/+/804922 | 23:13 |
pabelanger | how come https://zuul.opendev.org/t/openstack/status/change/804922,2 is in the openstack tenant and not opendev tenant? | 23:14 |
fungi | long story | 23:15 |
pabelanger | guessing legacy reasons? | 23:15 |
fungi | some | 23:15 |
fungi | also our main zuul config is in openstack/project-config still | 23:16 |
corvus | Short version: No one has moved everything that needs to be moved | 23:17 |
clarkb | it is getting easier and easier to move as we remove less dependencies on a bunch of repos | 23:19 |
clarkb | the ansible work is consolidating a lot of stuff into sysetm-config which amkes this simpler | 23:19 |
opendevreview | Paul Belanger proposed opendev/system-config master: WIP: Add mitogen support to ansible https://review.opendev.org/c/opendev/system-config/+/804922 | 23:30 |
clarkb | ianw: left a note on https://review.opendev.org/c/opendev/system-config/+/804918 if that didn't need a new patchset after testing Id' probably leave it as is but since a new patchset is necessary anyway... | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!