clarkb | mordred: ok I think that is really close but some of the puppet stuff still needs updating comments on the chnage | 00:09 |
---|---|---|
mordred | clarkb: responded | 00:11 |
mordred | clarkb: and no - those are remote paths | 00:11 |
clarkb | oh I'm going to need to melt my brain again I guess | 00:12 |
mordred | clarkb: (I had to check myself) | 00:12 |
mordred | clarkb: I actually think we should completely rework the puppet tests to be based on remote_puppet_else | 00:12 |
clarkb | mordred: mgmt_ is bridge? and not mgmt_ is remote? | 00:12 |
mordred | yup | 00:13 |
clarkb | mordred: ok so the way this would work is we just copy from /home/zuul/etc into /opt/system-config/production on the remote and nothing else changes? | 00:13 |
clarkb | I guess that simplifies things for making changes onbridge | 00:13 |
mordred | like - I think it would be nice to get rid of the current puppet jobs completely - make per-service jobs that are essentially "run remote-puppet-else but with only host X" - then we'll be set for each service we transition | 00:13 |
mordred | clarkb: yah | 00:13 |
clarkb | mordred: ++ on the job idea | 00:13 |
mordred | clarkb: becuase also we need thsoe legacy puppet jobs to die anyway | 00:14 |
mordred | clarkb: I mean - really - we could start making service-foo playbooks for everything too - just with roles: - puppet in them | 00:15 |
mordred | and completely get rid of else | 00:15 |
mordred | corvus: if you have a sec for a re-review of the first patch in the stack: https://review.opendev.org/#/c/719186 - I can land those when I'm watching in the morning | 00:17 |
clarkb | similarly if someone is willing to review those docker-compose upgrade changes I'm happy to babysit those tomorrow as they go in (assuming I get a second +2) | 00:18 |
mordred | infra-root: that's https://review.opendev.org/#/c/720030/ and https://review.opendev.org/#/c/719589/ ^^ | 00:28 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704 | 00:48 |
openstackgerrit | Merged zuul/zuul-jobs master: ensure-pip: Add role https://review.opendev.org/717639 | 01:09 |
openstackgerrit | Merged opendev/system-config master: Write out db config for root user https://review.opendev.org/719192 | 01:11 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111 | 01:27 |
openstackgerrit | Merged openstack/project-config master: Move suse builds to nb04, drop pip-and-virtualenv https://review.opendev.org/718299 | 01:45 |
*** ysandeep|away is now known as ysandeep|rover | 02:11 | |
openstackgerrit | Ian Wienand proposed openstack/project-config master: AFS Grafana : add mirror release timers https://review.opendev.org/720122 | 03:11 |
openstackgerrit | Ian Wienand proposed openstack/project-config master: AFS Grafana : add mirror release timers https://review.opendev.org/720122 | 03:45 |
ianw | dirk / clarkb: so suse has built on nb04 now | 03:48 |
ianw | i'd like to, and will be available to, push on anything needed to get things working without pip-and-virtualenv. as i've said, i think the ensure-pip stack is ready | 03:49 |
*** DSpider has joined #opendev | 03:51 | |
ianw | cmurphy: ^ might also affect as i saw some things fly by about certs | 03:53 |
cmurphy | ianw: ooh good to know, a new image might help me avoid needing https://review.opendev.org/720053 | 03:58 |
ianw | ahh yeah that was what i was thinking of. i'm not going to make a prediction, but maybe? :) | 03:59 |
openstackgerrit | Merged openstack/project-config master: AFS Grafana : add mirror release timers https://review.opendev.org/720122 | 04:03 |
*** ysandeep|rover is now known as ysandeep|BRB | 04:08 | |
*** ysandeep|BRB is now known as ysandeep|rover | 04:23 | |
*** ykarel|away is now known as ykarel | 04:25 | |
openstackgerrit | Merged openstack/project-config master: Revert "Revert "Introduce job for granular GitHub mirroring"" https://review.opendev.org/719047 | 05:30 |
AJaeger | ianw: reviewed the stack and gave my +2s, I did not approve - wanted you do do the honours yourself when you're around. Thanks! | 05:41 |
*** roman_g has joined #opendev | 05:42 | |
ianw | AJaeger: ok, thanks, i'll do that in the morning then to avoid pushing anything before i disappear :) | 05:43 |
AJaeger | ianw: enjoy your evening ;) | 05:46 |
openstackgerrit | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/720126 | 06:03 |
*** roman_g has quit IRC | 06:15 | |
prometheanfire | should glean be updated (tox wise) to py36/38? | 06:19 |
*** hashar has joined #opendev | 06:26 | |
*** hashar has quit IRC | 06:42 | |
*** dpawlik has joined #opendev | 06:49 | |
AJaeger | ianw, cmurphy, dirk, keystone is now failing openSUSE tests, see https://review.opendev.org/715688 | 06:58 |
AJaeger | RETRY_LIMIT - and no log files ;( | 06:59 |
openstackgerrit | Matthew Thode proposed opendev/glean master: write one resolv config https://review.opendev.org/717339 | 07:00 |
*** roman_g has joined #opendev | 07:00 | |
prometheanfire | ok, that passes tests locally ^ | 07:00 |
*** lpetrut has joined #opendev | 07:07 | |
*** hashar has joined #opendev | 07:07 | |
openstackgerrit | Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/720126 | 07:10 |
*** ralonsoh has joined #opendev | 07:14 | |
ianw | AJaeger: ok ... hrm that seems before anything i'd even expect to have changed wrt pip-and-virtualenv | 07:14 |
ianw | AJaeger, dirk, cmurhpy: this seems to be the relevant bit -> http://paste.openstack.org/show/792138/ | 07:18 |
ianw | "msg": "Data could not be sent to remote host \\"149.202.187.58\\". Make sure this host can be reached over ssh: Permission denied | 07:18 |
ianw | # cat /etc/dib-builddate.txt | 07:21 |
ianw | 2020-04-15 04:38 | 07:21 |
ianw | i'm logged into a opensuse host that was built today though ... | 07:22 |
ianw | opensuse-15-rax-dfw-0015944531 | 07:22 |
AJaeger | so, you can login but Zuul cannot? | 07:23 |
ianw | hrm, maybe? | 07:23 |
*** tosky has joined #opendev | 07:23 | |
ianw | zuul@opensuse-15-inap-mtl01-0015944709:~> cat .ssh/authorized_keys | 07:25 |
ianw | /var/lib/nodepool/.ssh/id_rsa.pub | 07:25 |
ianw | that .. does not look right? like it's a file and not the actual public key? | 07:26 |
ianw | 2020-04-15 02:02:16.244 | + /opt/dib_tmp/dib_build.ujqmkwxc/hooks/extra-data.d/60-zuul-user:main:16 : echo /var/lib/nodepool/.ssh/id_rsa.pub | 07:26 |
*** rpittau|afk is now known as rpittau | 07:28 | |
AJaeger | shouldn't that be cat? | 07:28 |
ianw | i think so, but ... | 07:30 |
openstackgerrit | Ian Wienand proposed openstack/project-config master: Add ZUUL_USER_SSH_PUBLIC_KEY to opensuse-15 image https://review.opendev.org/720136 | 07:30 |
ianw | AJaeger: ^ that should fix it. i'll have to think about the echo/cat thing | 07:30 |
ianw | if we want to merge that, i can come back and kick off a build soon, or maybe frickler could babysit it if around? | 07:31 |
ianw | this is *exactly* why i did the abstract job/inheritance thing in nodepool config, so wouldn't forget stuff like this. still have to get back to convert the file | 07:32 |
AJaeger | ianw: thanks, approved | 07:32 |
ianw | i feel like opendev-prod-hourly might be stuck | 07:43 |
*** ysandeep|rover is now known as ysandeep|lunch | 07:48 | |
openstackgerrit | Merged openstack/project-config master: Add ZUUL_USER_SSH_PUBLIC_KEY to opensuse-15 image https://review.opendev.org/720136 | 07:52 |
ianw | AJaeger: ok, i pulled that manually and triggered a build | 07:54 |
*** ykarel is now known as ykarel|lunch | 07:54 | |
ianw | https://nb04.opendev.org/opensuse-15-0000086893.log <- this one | 07:55 |
*** hashar has quit IRC | 08:02 | |
openstackgerrit | Sorin Sbarnea proposed opendev/gerritlib master: Switch to ensure-docker role https://review.opendev.org/720145 | 08:14 |
*** ykarel|lunch is now known as ykarel | 08:37 | |
*** ysandeep|lunch is now known as ysandeep|rover | 08:40 | |
openstackgerrit | Sorin Sbarnea proposed zuul/zuul-jobs master: Improve 404 error message on download-logs.sh https://review.opendev.org/720035 | 08:52 |
openstackgerrit | Merged opendev/irc-meetings master: Update OpenDev meeting location and name https://review.opendev.org/720060 | 08:58 |
openstackgerrit | Roman Gorshunov proposed openstack/project-config master: Retire airship-in-a-bottle https://review.opendev.org/720160 | 09:03 |
openstackgerrit | Roman Gorshunov proposed openstack/project-config master: Retire airship-in-a-bottle https://review.opendev.org/720160 | 09:04 |
*** hashar has joined #opendev | 09:14 | |
*** roman_g has quit IRC | 09:25 | |
openstackgerrit | Marcin Juszkiewicz proposed openstack/project-config master: Add CentOS 8 AArch64 nodes https://review.opendev.org/720167 | 09:52 |
openstackgerrit | Merged zuul/zuul-jobs master: Support ssh-enabled windows hosts in add-build-sshkey https://review.opendev.org/653712 | 10:00 |
*** rpittau is now known as rpittau|bbl | 10:23 | |
ttx | Test GitHub replication on release-test repository: http://zuul.openstack.org/build/96b02fef3f6345ed89f2f44283d49022/log/job-output.txt | 10:33 |
AJaeger | \o/ | 10:36 |
ttx | fungi, corvus, mnaser: please review ^ -- I'm wondering about all those deleted references and created branches | 10:36 |
ttx | I mean those branches definitely correspond to the opendev repo... just wondering why they weren't already up | 10:37 |
ttx | (maybe it's just a log artifact) | 10:37 |
ttx | Like... That list of deleted refs could be quite long in a more active repo | 10:38 |
* ttx is tempted to queue a second test | 10:39 | |
AJaeger | yeah, interesting to see | 10:40 |
ttx | ok, sending a new one in | 10:40 |
openstackgerrit | Merged zuul/zuul-jobs master: Improve 404 error message on download-logs.sh https://review.opendev.org/720035 | 10:46 |
*** ysandeep|rover is now known as ysandeep|coffee | 11:00 | |
ttx | http://zuul.openstack.org/build/8915e9ae33494257b1fb4928c16ec215/log/job-output.txt only has the additional change mentioned | 11:03 |
ttx | so yeah I fear that for large projects we may end up deleting thousands of reference, which might or might not be costly | 11:05 |
*** ysandeep|coffee is now known as ysandeep|rover | 11:17 | |
openstackgerrit | Merged openstack/project-config master: Add devstack-plugin-ceph notifications to manila channel https://review.opendev.org/720097 | 11:54 |
AJaeger | ttx, are those changes all on github? Did you double check? | 12:03 |
ttx | They are, but then since Gerrit-wide replication was not turned off, that does not mean much | 12:05 |
ttx | AJaeger: oh, you mean the refs? | 12:05 |
AJaeger | yes | 12:05 |
ttx | let me do a recent clone | 12:05 |
ttx | AJaeger: on a fresh clone there aren't any refs on GitHub other than refs/remotes/origin/HEAD and refs/heads/master (+ branches) | 12:10 |
ttx | no refs/changes | 12:10 |
*** factor has joined #opendev | 12:16 | |
Eighth_Doctor | hey, is it normal that a repo like openstack/nova would have 182116 refs? | 12:18 |
Eighth_Doctor | there's this refs/changes thing and refs/users thing... | 12:19 |
ttx | Eighth_Doctor: where did you clone from? | 12:29 |
Eighth_Doctor | https://opendev.org/openstack/nova | 12:30 |
Eighth_Doctor | I did a `git clone --mirror` | 12:30 |
ttx | from opendev or github? | 12:30 |
ttx | (or both) | 12:30 |
Eighth_Doctor | opendev | 12:31 |
Eighth_Doctor | I only pulled from opendev | 12:35 |
ttx | I suspect opendev has a full mirror of the Gerrit repo, which keeps all the refs/changes | 12:38 |
ttx | while the new job pushes a mirror of a clone, so it does get rid of refs/changes in the process | 12:39 |
Eighth_Doctor | ttx, well, it certainly exposed some interesting things about doing a full mirror from there to stg.pagure.io | 12:41 |
Eighth_Doctor | also, wow, `git reflog` does not like this repo on my computer :/ | 12:41 |
Eighth_Doctor | I was taking a look at it due to a convo I had with mordred, clarkb, and fungi about using pagure as the source code browser frontend for opendev.org instead of gitea | 12:44 |
Eighth_Doctor | processing all those refs at once was a bit painful on the machine that stg.pagure.io runs on... | 12:48 |
Eighth_Doctor | but at least now it's there: https://stg.pagure.io/openstack/nova | 12:48 |
Eighth_Doctor | this is probably going to turn into a good test case, actually, since I hadn't encountered a repo like this before | 12:48 |
*** rpittau|bbl is now known as rpittau | 12:49 | |
*** roman_g has joined #opendev | 12:50 | |
mnaser | ttx: i wonder if the reason why it does this because we don't do a deep mirror clone by zuul into the executor | 12:53 |
mnaser | ttx: and so because we have a shallow clone that doesnt include all the refs (because that would take a long time and probably not needed) | 12:54 |
ttx | yep | 12:54 |
ttx | that's what I meant by "pushes a mirror of a clone" | 12:54 |
*** ykarel is now known as ykarel|afk | 12:55 | |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Update update_constraints for Py3.8 https://review.opendev.org/720197 | 12:58 |
Eighth_Doctor | ttx: well, it took four days to push all those refs | 13:08 |
Eighth_Doctor | and most of the git command line tools seem to be rather unhappy with the repo on my machine because of all the refs | 13:09 |
Eighth_Doctor | but it's a nice test case, so it's not all bad | 13:09 |
ttx | lol... Yeah I expect it will also take days to delete them if we end up mirroring nova with the new per-repo system | 13:10 |
ttx | hence my question up there | 13:10 |
Eighth_Doctor | ttx: if gerrit+zuul was directly managing the pagure git repository, I don't think this would be a problem | 13:22 |
Eighth_Doctor | otherwise, probably should be somehow not sending those refs when pushing, because damn they're expensive | 13:22 |
mordred | ttx, mnaser we _do_ have a full mirror on the executor - however, the refs/changes thing might be a smidge interesting | 13:25 |
mordred | because I'm not sure each executor is always going to fetch refs/changes it doesn't happen to work with - so in any given push we may not get the full story of the refs/changes/* | 13:25 |
mordred | although maybe it's fine that they're not there | 13:26 |
Eighth_Doctor | mordred: my theory at least is that this would only be painful once | 13:26 |
mordred | I'm a little concerned about that origin/stable/train -> origin/stable/train and friends | 13:26 |
mordred | Eighth_Doctor: well for the gitea/pagure case it's a little different - we use those also so that people can browse proposed changes too - so we need all of the refs/changes to be in that system | 13:27 |
mordred | for github mirroring - meh, I don't think it's actually important | 13:27 |
Eighth_Doctor | though gitea looks like it's not happy with me doing a git fetch right now | 13:27 |
frickler | ianw: AJaeger: cmurphy: new opensuse image seems to work better, but now fails with "virtualenv: command not found" | 13:27 |
frickler | https://a454580e587cac547c7e-cfcb5348d0a5bd4d7cf82711ec310965.ssl.cf1.rackcdn.com/715688/6/check/keystone-dsvm-py3-functional-federation-opensuse15/152dd76/ | 13:28 |
Eighth_Doctor | mordred: at least with gitea, refs/changes are not visible | 13:28 |
mordred | ttx: I think the created branches are a logic bug | 13:28 |
Eighth_Doctor | it wouldn't be hard to extend pagure to show you the refs/changes stuff, but I'm not sure how useful it would be given that the refs have no context | 13:28 |
Eighth_Doctor | I'm not even sure what the numbering scheme is here | 13:28 |
mordred | Eighth_Doctor: yeah - they're hidden refs - those are how gerrit stores proposed changes | 13:29 |
Eighth_Doctor | yeah | 13:29 |
ttx | mordred: not very concerned with the branches really. Just don't want the script to block executors for one day deleting 182,116 refs every time Nova is synced | 13:30 |
Eighth_Doctor | pagure PRs work similarly, except they're stored in an adjacent repo for pull requests | 13:30 |
ttx | (refs/changes) | 13:30 |
mordred | so - https://review.opendev.org/#/c/719186/9 is going to be in refs/changes/86/719186/9 | 13:30 |
Eighth_Doctor | where does `86` come from? | 13:31 |
Eighth_Doctor | is it just the last two digits? | 13:31 |
mordred | the last 2 digits | 13:31 |
mordred | it's a dir hashing scheme | 13:31 |
Eighth_Doctor | okay | 13:31 |
mordred | but that ref can be seen in gitea: https://opendev.org/opendev/system-config/commit/c117c1106df8ff30aee7b8a118811bf239f3dcf8 | 13:31 |
mordred | so we push them there, but since they aren't branches they don't show up in the branches list | 13:32 |
*** ysandeep|rover is now known as ysandeep|away | 13:32 | |
Eighth_Doctor | right | 13:33 |
Eighth_Doctor | that should work the same way with pagure, I think | 13:33 |
mordred | ttx: yeah - I think we might want to come up with a $something to do in git config to control refs/ interactions | 13:33 |
mordred | ttx: or - we could do an offline script to push up refs/changes deletions for all of them | 13:34 |
mordred | ttx: so that we just stop caring about those refs on github completely | 13:34 |
mordred | they're not exactly browseable anyway | 13:34 |
ttx | yeah, it's just tricky to do without freezing mirroring for a bit | 13:35 |
ttx | Like 1/ disable Gerrit-wide replication, 2/run refs/changes deletion script other a thousand repos and 700,000 refs, 3/ enable per-repo mirroring | 13:36 |
ttx | I have no idea how long 2 will take :) | 13:37 |
Eighth_Doctor | I wonder if we could be clever here in pagure, and make it so that when those refs/changes things show up, they make a link to Gerrit? | 13:37 |
Eighth_Doctor | ttx: four days at least on nova :) | 13:37 |
ttx | Eighth_Doctor: it was to create them, hopefully deleting is faster :) | 13:37 |
Eighth_Doctor | actually, would the Change-Id be a better thing to process and hyperlink than the refs? | 13:38 |
ttx | Damn it's more than 700,000, it's one per patchset | 13:38 |
Eighth_Doctor | ttx: yeah, it's a _lot_ | 13:39 |
Eighth_Doctor | Change-Ids are unique to Gerrit and are the way it tracks those things, is there a way to use that to link to the change review? | 13:39 |
ttx | It deleted 80 in 50ms in the script | 13:39 |
ttx | so about 14 hours for a million patchsets | 13:41 |
ttx | assuming 3 revs per change (average from fungi), would take about a day | 13:42 |
ttx | napkin math | 13:43 |
Eighth_Doctor | mordred: so the way that pagure renders commits doesn't seem to make the refs thing useful :( | 13:44 |
Eighth_Doctor | https://stg.pagure.io/openstack/nova/c/9dcc0941f1371c6e6852ad53bc6e6b04e0677d4d | 13:44 |
Eighth_Doctor | that might be worth fixing, not sure | 13:44 |
Eighth_Doctor | the commits list typically has these things, so it might be worth extending that view to support it | 13:44 |
Eighth_Doctor | mordred: what do you think would be more useful? a link via change-id (assuming that's possible) or a population in the commits view of refs/changes/* that link to gerrit? | 13:46 |
mordred | Eighth_Doctor: I think a view of a given refs/change is really only useful as something you might look at if you follow the link _from_ gerrit | 13:48 |
mordred | that said - I do think a link from change-id back to gerrit could be useful for people browsing normal commits | 13:48 |
mordred | Eighth_Doctor: https://review.opendev.org/#/q/I0d3b92506fab8f973bffe082cbfb2ab29cb0b8d0 is how you go to a change via change-id | 13:49 |
Eighth_Doctor | okay, that's neat | 13:49 |
Eighth_Doctor | I'm going to log that as an RFE and take a look at adding a feature for supporting that in pagure | 13:50 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Upgrade to gitea 1.11.4 https://review.opendev.org/720202 | 13:55 |
Eighth_Doctor | mordred: https://pagure.io/pagure/issue/4812 | 14:01 |
mordred | cool | 14:03 |
mordred | infra-root: I'm landing the patches to run zuul prod patches from zuul checkout - I'll be watching to make sure it all happens properly | 14:03 |
*** ykarel|afk is now known as ykarel | 14:03 | |
corvus | ttx: i agree with your analysis; we may be able to reconfigure gerrit not to replicate refs/changes, so if we did that, we could modify your process to: reconfigure gerrit to not replicate refs/changes; delete refs/changes asynchronously; enable zuul replication; disable gerrit. that would avoid a replication outage. | 14:06 |
mordred | ++ | 14:07 |
corvus | mordred: #zuul -> we're about to need to make a moderately complex change to the zuul deployment in order to support zk tls | 14:10 |
fungi | Eighth_Doctor: still catching up, but the most effective way to link back to gerrit reviews from the git repository is via the git "notes" it stores | 14:10 |
fungi | they used to be displayed by default by cgit, i think we need to configure gitea to do it (they didn't support alternative notes trees until somewhat recently and we haven't had time to revisit it since upgrading) | 14:11 |
corvus | mordred: but we don't have a solution for running the executor in docker yet, so i don't think we can convert everything to docker; should we do the new work in ansible instead of puppet? should we use windmill? | 14:11 |
mordred | corvus: re: gerrit - setting 'push' to +refs/heads/*:refs/heads/* should do the tric | 14:11 |
corvus | ttx: ^ | 14:11 |
mordred | corvus: I mean - I've got all the config bits converted - so I think it would be easier to just change the executor to pip install in that instead of trying to use windmill | 14:12 |
fungi | the way i had imagined that replication job was that it would just push the current head or tag when triggered, not try to push a full mirror every time it's invoked | 14:12 |
mordred | also - we'd have to add zk tls support to windmill and I don't know what paul's status for stuff like that is atm | 14:12 |
fungi | so i'm surprised it was deleting anything | 14:12 |
corvus | mordred: what do you mean you've got all the config bits converted? isn't zuul.conf still written by puppet? | 14:13 |
mordred | corvus: (we could also change all of it to run via pip instead of docker) | 14:13 |
mordred | corvus: my zuul patch ... one sec | 14:13 |
fungi | assuming it's in sync already, the job should be triggered for any update to a branch anyway (and a tag once we add it to the right pipelines) | 14:13 |
corvus | mordred: re windmill -- someone needs to, right? doesn't the ansible zuul run via windmill? | 14:13 |
mordred | corvus: https://review.opendev.org/#/c/717620/ | 14:13 |
openstackgerrit | Merged opendev/system-config master: Update install-ansible away from /opt/system-config https://review.opendev.org/719186 | 14:13 |
openstackgerrit | Merged opendev/system-config master: Run playbooks out of zuul checkout https://review.opendev.org/719190 | 14:13 |
mordred | corvus: that's a stab at converting our pupppet use to using ansible instead - although it's obviously not going to work for the exeutor because of docker - but it's a start. but we could also use windmill - I didn't do that in this case because it seemed like a harder step | 14:14 |
corvus | mordred: oh! you sure did write that patch. :) | 14:15 |
corvus | mordred: i agree, updating that to s/docker/pip/ is probably easy and gets us to a place to use zk tls quickest | 14:15 |
mordred | corvus: do you think I should do that everywhere? or just on ze? | 14:15 |
corvus | mordred: well, it seems nodepool -> docker is already in progress, so a mixed env is a given; therefore, maybe just do that on ze | 14:17 |
mordred | corvus: kk. I'll update the patch | 14:25 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 14:28 |
mordred | that's just a rebase | 14:28 |
*** ysandeep|away is now known as ysandeep | 14:31 | |
Eighth_Doctor | fungi: git notes? | 14:33 |
fungi | Eighth_Doctor: notes refs | 14:33 |
fungi | https://git-scm.com/docs/git-notes | 14:33 |
fungi | that stuff | 14:33 |
Eighth_Doctor | hmm | 14:34 |
fungi | though gerrit doesn't use the default refs/notes/commits tree in case you're already using that for other purposes | 14:34 |
fungi | it uses a refs/notes/review tree | 14:35 |
fungi | but it stores the numeric vote values/dates/users, review link and related data in there | 14:36 |
ttx | mordred, corvus: would limiting the push not result in refs deletion ? (like what happens during the replication process for refs/changes already on GitHub ?) | 14:36 |
Eighth_Doctor | fungi: interesting | 14:36 |
ttx | or is it just additive | 14:36 |
fungi | ttx: i'm surprised we actually built that job to mirror all refs in the first place, i had thought it was just going to push refs for the branch or tag which triggered the build | 14:38 |
ttx | also where is that setting 'push' to +refs/heads/*:refs/heads/* happening ? replication.config? | 14:40 |
corvus | ttx: oh, that's a good point, it may well do that. | 14:40 |
ttx | answering my last question :yes | 14:40 |
fungi | ttx: yeah, it'll be in the replication config | 14:40 |
ttx | push = +refs/heads/*:refs/heads/* | 14:41 |
ttx | push = +refs/tags/*:refs/tags/* | 14:41 |
ttx | probably both | 14:41 |
corvus | ttx: we still might want to consider that though; i think we dedicate one thread to github replication; we could increase that to two, which would mean replication is slow, but would allow other work to happen while nova was 'stuck' | 14:41 |
ttx | ok will give it some extra thought | 14:42 |
mordred | I'm not sure it would push deletes | 14:42 |
mordred | I think with teh mirror script it's mirroring all refs, so that means it's going to try to mirror in the refs/changes namespace, meaning pushing deletes | 14:42 |
ttx | on mirroring it definitely deletes remote extra refs | 14:43 |
mordred | if we limit the ref namespaces gerrit is pushing | 14:43 |
mordred | then I don't think it would push empty refs/changes to delete things | 14:43 |
mordred | that would be pushing ref information for a namespace we told it not to push | 14:43 |
corvus | yeah, i don't know for sure. that sounds plausible. | 14:43 |
mordred | we can try this out on review-dev | 14:43 |
corvus | ++ | 14:44 |
mordred | but I'm gonna put my money on it being safe to configure gerrit to just stop replicating them | 14:44 |
mordred | and then being able to run a cleanup script | 14:44 |
*** mlavalle has joined #opendev | 14:47 | |
fungi | it's still not clear to me, why have the job replicate everything each time it runs and not just the branch or tag for which the build was triggered? | 14:47 |
fungi | branch and tag updates won't happen outside gerrit typically anyway, so zuul will receive events for those and then run the job | 14:48 |
corvus | fungi: that's a fair question. perhaps to catch up after previous errors? maybe that's low-risk though? | 14:49 |
corvus | or maybe it could be configured not to delete | 14:49 |
fungi | the only one i'd worry about is missed tags, but maybe if the job is triggered by a tag then push all tags, but branches will eventually get new commits | 14:49 |
fungi | just seems unnecessary to have it try to replicate the entire repository when the triggering event was a new commit merging to a single branch, and the job gets run each time that happens | 14:50 |
fungi | (to be honest, i had it in my head that was the design, and didn't realize until now that wasn't how it was working) | 14:51 |
corvus | speaking of replication... i think we may have a gitea backend out of sync; i'm seeing different data pulling zuul updates | 14:51 |
cmurphy | frickler: ianw dirk "virtualenv: command not found" the pip-and-virtualenv element was removed from the image build https://review.opendev.org/718299 ??? | 14:52 |
cmurphy | can we put it back? keystone needs this | 14:52 |
fungi | cmurphy: the idea was that the tox parent job would start installing virtualenv, i think | 14:53 |
AJaeger | cmurphy: ianw is working on this stack: https://review.opendev.org/#/c/718224/ | 14:54 |
fungi | (and no, a big part of the delay in the suse image updates was so that we didn't have to work out installing pip, virtualenv and tox into the system context, since we're going that direction for the other distros as well, fedora is already like that apparently) | 14:54 |
AJaeger | once that's merged all should be green again | 14:54 |
AJaeger | cmurphy: best discuss with ianw once he's awake. We expected that what we had would work already as is. | 14:55 |
cmurphy | AJaeger: great! can we merge that asap? | 14:55 |
cmurphy | :( ianw won't be awake till the end of my day | 14:55 |
AJaeger | cmurphy: ianw wants to merge once he's around - but corvus just left a -1 on https://review.opendev.org/#/c/717663/24 | 14:56 |
corvus | i thought the "plain" image was being used to work through this? | 14:57 |
corvus | i didn't think anything was removed from the main images yet | 14:57 |
AJaeger | corvus: https://review.opendev.org/718299 - we had problems building the opensuse images as well | 14:58 |
AJaeger | So, between a rock and a hard place ;( | 14:58 |
corvus | perhaps we should revert that as cmurphy says? because if we override my -1 we're going to break other zuul installs | 14:59 |
corvus | i haven't looked into how long it would take to fix my -1, it's probably not too hard, but i'm certainly not up to speed | 15:00 |
AJaeger | corvus: 718299 was needed to fix image builds that have been broken for ages ;( | 15:00 |
fungi | we can roll back to months-old images maybe, if we still have them hanging around | 15:02 |
cmurphy | the months-old image was semi-working for me with workarounds | 15:03 |
mordred | corvus: if you have a sec - could you look at https://zuul.opendev.org/t/openstack/build/da3cec0713204f22982e65d5ac420a8c/log/job-output.txt#78 ? | 15:03 |
mordred | corvus: that's trying to use mirror-workspace-git-repos when talking to bridge - it seems to be having a sad but I'm not 100% sure what the issue would be - did I use the wrong role here? | 15:04 |
mordred | corvus: I'm starting to think maybe I was supposed to use prepare-workspace instead | 15:04 |
mordred | corvus: no - I guess prepare-workspace does the synchronize | 15:05 |
corvus | mordred: in a bit... | 15:05 |
*** lpetrut has quit IRC | 15:05 | |
corvus | what should we do about the keystone situation? | 15:05 |
corvus | revert and rollback? or are we going to say "sorry it's broken for a day or two"? | 15:06 |
corvus | are there any other options? | 15:06 |
mordred | I think revert and rollback | 15:07 |
mordred | and then the re-revert needs to take this issue in to account | 15:08 |
mordred | becuase I think it was the assumption that this wouldn't break things | 15:08 |
corvus | are we talking about the opensuse-15 image? | 15:08 |
mordred | I believe so? | 15:08 |
yoctozepto | morning folks | 15:09 |
yoctozepto | it's etherpad again :-( | 15:09 |
yoctozepto | https://etherpad.opendev.org/p/KollaWhiteBoard | 15:09 |
yoctozepto | no worky | 15:09 |
yoctozepto | only loady | 15:09 |
corvus | it looks like we have deleted all of the months-old opensuse-15 images | 15:09 |
corvus | oh wait | 15:10 |
yoctozepto | An error occurred | 15:10 |
yoctozepto | The error was reported with the following id: 'LxotxdY5BrhtpIZtbDud' | 15:10 |
corvus | we still have one that's 69 days old on nb02 | 15:10 |
mordred | corvus: well, that one won't have the pip-and-virtualenv element removed - maybe that's the most recent? | 15:10 |
corvus | yeah, i think if we revert back to nb02, it'll upload those | 15:11 |
fungi | priteau just mentioned in #openstack-infra that etherpad.opendev.org is unresponsive. i'll take a look | 15:12 |
mordred | fungi: while you're looking, see issue from yoctozepto above about that etherpad too | 15:12 |
openstackgerrit | James E. Blair proposed openstack/project-config master: Revert "Move suse builds to nb04, drop pip-and-virtualenv" https://review.opendev.org/720223 | 15:12 |
corvus | mordred: i think the local apache mirrors for gerrit may be out of date | 15:13 |
corvus | mordred: maybe we missed a bind mount? | 15:13 |
yoctozepto | mordred, fungi: oh well, I asked priteau if it worked for him :-) | 15:13 |
mordred | corvus: looking | 15:13 |
corvus | AJaeger, fungi, mordred, cmurphy: see 720223 ^ | 15:13 |
yoctozepto | did not think he would crossreport :-) | 15:13 |
fungi | yoctozepto: oh, well thanks, i missed your mention of it in here, sorry, there's been a lot of discussion going on | 15:14 |
yoctozepto | fungi: sure, no problem | 15:14 |
AJaeger | thanks, corvus | 15:15 |
mordred | corvus: yes. | 15:15 |
fungi | the server itself is definitely up and reachable over ssh | 15:15 |
cmurphy | thanks corvus | 15:15 |
fungi | and "node node_modules/ep_etherpad-lite/node/server.js" is running since some time on monday | 15:15 |
fungi | and it's suddenly responding to me again vi browser, i didn't change anything | 15:16 |
fungi | load average is low | 15:16 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add /opt/lib/git to the volume mounts https://review.opendev.org/720225 | 15:17 |
fungi | nothing going haywire with the kernel per dmesg | 15:17 |
fungi | cacti doesn't show anything particularly anomalous either | 15:19 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use prepare-workspace-git in production playbook https://review.opendev.org/720227 | 15:19 |
mordred | corvus: ^^ I believe the first of those will fix the gerrit local git replica issue | 15:20 |
mordred | corvus: and the second should fix my git repo replication issue | 15:21 |
mordred | actually ... let me change that | 15:21 |
fungi | yoctozepto: so far i'm not finding anything on the server to explain the temporary outage. https://etherpad.opendev.org/p/KollaWhiteBoard is loading now too | 15:22 |
fungi | might have been network-related, but i'm going to dig deeper in logs | 15:22 |
yoctozepto | fungi: yes, thanks; I can only offer this id LxotxdY5BrhtpIZtbDud | 15:22 |
yoctozepto | maybe it's greppable or something :D | 15:23 |
fungi | i'm checking | 15:23 |
yoctozepto | duck, I got another failure | 15:24 |
yoctozepto | hPmlyLrRaK3KZKl6i4OY | 15:24 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Just use synchronize to sync the repos https://review.opendev.org/720227 | 15:24 |
corvus | those are both "Uncaught TypeError: Cannot read property 'setStateIdle' of null" | 15:26 |
mordred | corvus: ^^ I think that's a better approach for our use on bridge | 15:26 |
fungi | i find a couple of recent proxy errors apache logged at 15:06:22z | 15:27 |
fungi | "AH01102: error reading status line from remote server localhost:9001" and "AH00898: Error reading from remote server returned by /socket.io/" | 15:27 |
fungi | those may be unrelated though | 15:27 |
yoctozepto | this must be nodejs looking at that message | 15:28 |
mordred | yup | 15:28 |
fungi | do we use docker-compose to view etherpad's service logs now? are those written to disk in the chroot or spewed on stdout/stderr? | 15:28 |
mordred | fungi: spewed | 15:28 |
mordred | fungi: cd /etc/etherpad-docker ; docker-compose logs | 15:28 |
mordred | fungi: will get you the spew | 15:28 |
mordred | (-f will tail) | 15:28 |
fungi | appreciated | 15:29 |
openstackgerrit | Marcin Juszkiewicz proposed openstack/project-config master: Add CentOS 8 AArch64 nodes https://review.opendev.org/720167 | 15:29 |
corvus | fungi: i ran: "docker logs etherpaddocker_etherpad_1|grep -C 8 LxotxdY5BrhtpIZtbDud" to get the error message above | 15:29 |
*** ykarel is now known as ykarel|away | 15:29 | |
corvus | https://github.com/ether/etherpad-lite/issues/3405 is relevant | 15:29 |
mordred | corvus, fungi: I know there'sa. ton of things going on - but could I get a quick +A on 720227? we're dead in the water on bridge without it, and that means we can't land the nodepool revert that we need for the keystone issue | 15:30 |
mordred | (when it rains it pours) | 15:30 |
corvus | mordred: oh :( well i wanted to really look into that | 15:31 |
corvus | but if we need to just merge it to put out fires sure | 15:31 |
corvus | how did we get into that situtation though? | 15:31 |
clarkb | mordred: is there a tl;dr of nodepool issue? | 15:31 |
corvus | oh i guess this only runs in prod | 15:31 |
yoctozepto | 17:30:39 <mordred> (when it rains it pours) | 15:31 |
yoctozepto | ++ | 15:31 |
mordred | corvus: we merged the "run from git" patch - and it failed being unable to sync the git repos to bridge | 15:31 |
mordred | yeah | 15:31 |
*** redrobot has joined #opendev | 15:32 | |
clarkb | and nodepool change I assume is related to the opensuse things? | 15:32 |
clarkb | note I think opensuse like fedora3X was not actually building on the old setups | 15:32 |
mordred | corvus: yeah - it's unfortunate timing - when I clicked +A it was quiet | 15:32 |
clarkb | so a revert is unlikely to fix anything | 15:32 |
corvus | clarkb: revert+rollback is the proposal | 15:32 |
mordred | clarkb: there is a 69 day old opensuse image | 15:32 |
clarkb | ah ok if we still have old image then we are good | 15:32 |
mordred | yeah | 15:32 |
mordred | clarkb: but we need https://review.opendev.org/#/c/720227/ to be able to land the revert | 15:33 |
clarkb | in the case of opensuse it isn't building due to python2 changes | 15:33 |
mordred | clarkb: so if you have a quick morning second | 15:33 |
clarkb | so its directly related to the change made to the image build, not to anything in the builder itself | 15:33 |
clarkb | basically you can't have a working oepnsuse with python2 now or something | 15:33 |
corvus | the main difference i see betwen our etherpad config and https://github.com/ether/etherpad-lite/issues/2318#issuecomment-63548542 is we don't have a timeout setting | 15:34 |
mordred | corvus: I agree re: timeout | 15:35 |
corvus | mordred: wait i don't understand your comment about "delete: false" | 15:35 |
corvus | mordred: that just means that synchronize won't delete files (which could cause errors) | 15:36 |
clarkb | corvus: oh I was just getting to that :) | 15:36 |
corvus | i mean, i'm still okay with +2 meaning -1 just to try to dig out of this hole | 15:37 |
clarkb | I think in the context of a git repo thats not a good thing to have set to false | 15:37 |
corvus | yeah. it will probably work okay for the next couple of changes we land | 15:37 |
mordred | corvus: oh - want me to put that back in? I was mostly just thinking we don't want to delete and repush project-config over and over | 15:37 |
clarkb | mordred: well I think the best way to handle it is to use git push | 15:37 |
corvus | i don't know why that would "delete and repush" | 15:37 |
clarkb | not rsync | 15:37 |
mordred | corvus: not all jobs have project-config in their required-projects | 15:38 |
corvus | that just means "delete files on the remote side that aren't on this side" | 15:38 |
corvus | oh | 15:38 |
corvus | let's just merge this and replace it with the right role | 15:38 |
mordred | kk | 15:38 |
mordred | yeah - this should work until we can breathe and dig in better | 15:38 |
clarkb | ok I've approved it | 15:38 |
mordred | thx | 15:38 |
clarkb | but ya I think we want a role that does a git push | 15:38 |
clarkb | (and it can skip pushing if the source doesn't exist) | 15:39 |
corvus | do we need to make sure that 720223 lands after that? | 15:39 |
mordred | corvus: yes | 15:39 |
yoctozepto | eh, etherpad does not like me :/ | 15:39 |
openstackgerrit | James E. Blair proposed openstack/project-config master: Revert "Move suse builds to nb04, drop pip-and-virtualenv" https://review.opendev.org/720223 | 15:39 |
mordred | corvus: +A | 15:40 |
mordred | corvus: thanks - and sorry for timing there | 15:40 |
corvus | mordred: np; what was wrong with prepare-workspace-git? | 15:40 |
mordred | corvus: it might be the right choice - we were using mirror-workspace-git | 15:40 |
corvus | mordred: here's a handy cheat sheet: http://lists.zuul-ci.org/pipermail/zuul-discuss/2020-April/001216.html | 15:41 |
mordred | corvus: I wasn't 100% sure prepare-workspace-git was the right thing to use and figured the simple rsync would _definitely_ work in this case | 15:41 |
corvus | mordred: prepare-workspace-git calls mirror-workspace-git | 15:41 |
corvus | so what went wrong with mirror-workspace-git? | 15:41 |
mordred | corvus: there were no git repos on the remote side to push to | 15:42 |
corvus | mordred: ah, then prepare-workspace-git may well work | 15:42 |
mordred | it tried to git config them and got an error "you can't do that without a git repo" | 15:42 |
corvus | because prepare-workspace-git does the "use a cache if it's there, otherwise git init" step i believe | 15:42 |
mordred | corvus: yeah. I believe that's accurrate | 15:42 |
mordred | ++ | 15:43 |
corvus | mordred: okay, want to push up a prepare-workspace-git change, and we can merge it after the nodepool change lands? | 15:43 |
mordred | fwiw - we could handle the "delete and re-push project-config over and over again" by having the bridge playbook maintain an /opt/git cache of both | 15:43 |
mordred | corvus: ++ | 15:43 |
corvus | mordred: i think using this role will avoid the issue; it's not going to delete any repos already in the workspace | 15:44 |
mordred | it will - it'll call mirror-workspace-git at the end which will do the rsync --delete | 15:44 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: dhall-diff: add new job https://review.opendev.org/718694 | 15:44 |
corvus | mordred: mirror-workspace-git-repos doesn't use rsync | 15:45 |
AJaeger | config-core, could you review https://review.opendev.org/720197 - needed for release, please | 15:45 |
mordred | corvus: sigh. so you are right :) | 15:46 |
mordred | corvus: so yay | 15:46 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Switch to prepare-workspace-git https://review.opendev.org/720231 | 15:47 |
mordred | clarkb corvus :^^ | 15:47 |
clarkb | AJaeger: I guess we are ok with dealing with cases where 3.6 doesn't imply 3.8 when they come up? | 15:48 |
fungi | tailing etherpad logs, i see quite a few errors related to that KollaWhiteBoard pad | 15:49 |
yoctozepto | hZUCejCWzEctlM9uquL5 - another token of despise from etherpad | 15:49 |
yoctozepto | fungi: it's probably me | 15:49 |
yoctozepto | it fails for me and other folks | 15:49 |
fungi | not just "Uncaught TypeError: Cannot read property \'setStateIdle\' of null" but also some others | 15:49 |
yoctozepto | we are having a kolla meeting, must be the reason | 15:49 |
AJaeger | clarkb: for now yes | 15:49 |
clarkb | fungi: yoctozepto the other day we were theorizing with subline that it is the client | 15:49 |
yoctozepto | chrome 81? :D | 15:50 |
yoctozepto | too fast? to slow? | 15:50 |
mordred | clarkb: that some browsers are doing a bad thing? | 15:50 |
yoctozepto | too awesome? | 15:50 |
clarkb | yoctozepto: not specific versions of browsers but state in your browser | 15:50 |
yoctozepto | lemme try another | 15:50 |
fungi | "Error: Can't apply USER_CHANGES, because Trying to submit changes as another author in changeset ..." | 15:50 |
clarkb | so try another or try private browsing mode | 15:50 |
yoctozepto | yeah, I tried incognito already | 15:50 |
clarkb | fungi found a bug from etherpad that showed etherpad is super sensitive to client activity too :/ | 15:51 |
prometheanfire | ian: fungi: have time to look at https://review.opendev.org/717339 ? (glean systemd-resolved thing) | 15:51 |
fungi | "[ERROR] console - (node:1) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'colorId' of null" | 15:51 |
clarkb | and the bug fix proposed was client side too I think | 15:51 |
clarkb | rather than making the server robust | 15:51 |
fungi | [WARN] client - TypeError: pad.collabClient is null" | 15:52 |
yoctozepto | hmm, it works in firefox, but seems sluggish | 15:53 |
corvus | fwiw, the kolla pad has been working for me without issue in ff for a while, but i haven't been writing | 15:53 |
yoctozepto | (to load) | 15:53 |
clarkb | AJaeger: also that sed seems to do the same replacement? | 15:53 |
clarkb | AJaeger: it replaces python_version==3.8 to python_version==3.8 can you double check that? | 15:54 |
AJaeger | clarkb: that's correct - it should update versions. Let me double check... | 15:55 |
AJaeger | clarkb: we need to use '$VERSION' - that's the difference | 15:56 |
clarkb | AJaeger: got it | 15:56 |
AJaeger | thx | 15:58 |
openstackgerrit | Merged zuul/zuul-jobs master: Adds roles to install and run hashicorp packer https://review.opendev.org/709292 | 16:01 |
fungi | saw a similar setStateIdle warning pop up for an unrelaetd pad | 16:02 |
fungi | unrelated | 16:02 |
fungi | i wonder if we're just running into tuning errors and today is the first day we've got the new deployment under typical load | 16:03 |
yoctozepto | yikes, it finally loaded | 16:04 |
fungi | i picked another pad i saw scroll by in the logs and am getting indefinite "loading" from it | 16:04 |
yoctozepto | fungi could be right | 16:04 |
fungi | okay, the one i was trying to load finally loaded | 16:05 |
openstackgerrit | Brian Rosmaita proposed openstack/project-config master: Change gerrit ACLs for cinder-tempest-plugin https://review.opendev.org/720235 | 16:05 |
AJaeger | fungi, could you review https://review.opendev.org/720197 , please? | 16:05 |
openstackgerrit | Brian Rosmaita proposed openstack/project-config master: Change gerrit ACLs for cinder-tempest-plugin https://review.opendev.org/720235 | 16:09 |
fungi | okay, spotted a "[WARN] client - TypeError: r.dropdowns is undefined" for https://etherpad.opendev.org/p/octavia-priority-reviews which is likely related to https://github.com/ether/etherpad-lite/issues/3464 and the later https://github.com/ether/etherpad-lite/issues/3861 | 16:14 |
fungi | checking to see if that pad is broken | 16:14 |
openstackgerrit | Merged openstack/project-config master: Update update_constraints for Py3.8 https://review.opendev.org/720197 | 16:19 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 16:22 |
mordred | corvus: ^^ ok - that's now updated to use pip to install the executor | 16:23 |
mordred | YAY the system-config fix patch failed on a puppet module remote repo being unreachable | 16:24 |
corvus | i'll re-enqueue it | 16:25 |
mordred | ok - cool | 16:25 |
corvus | mordred: it doesn't make sense to land 231 right after 227 | 16:27 |
corvus | mordred: which do you want? | 16:28 |
mordred | corvus: why don't we just do 231 | 16:28 |
corvus | i'm starting to think we should just put all of nodepool in the emergency file and manually fix it | 16:28 |
mordred | corvus: yeah | 16:29 |
corvus | because we're now at 1 hour past our decision to rollback and have made no progress on actually doing it | 16:29 |
mordred | corvus: i'll add nodepool to emergency | 16:29 |
corvus | i'll start logging into the builders | 16:30 |
mordred | corvus: we just need the builders right? | 16:31 |
corvus | mordred: yeah | 16:32 |
mordred | k. done | 16:32 |
*** rpittau is now known as rpittau|afk | 16:33 | |
fungi | to follow up on my earlier speculation, https://etherpad.opendev.org/p/octavia-priority-reviews doesn't seem permanently broken (it loaded for me at least) so that "r.dropdowns is undefined" warning is apparently not always accompanied by a broken pad | 16:34 |
johnsom | fungi FYI, we just noticed that pad won't open for some of us anymore. It times out. | 16:35 |
johnsom | Rough time to lose our priority planning etherpad I have to say. | 16:35 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Switch to prepare-workspace-git https://review.opendev.org/720231 | 16:35 |
fungi | johnsom: yeah, i'm starting to suspect tuning issues. we just switched out deployment to use containers so may be hitting different performance limitations, but we also upgraded to a newer etherpad release so could just be hitting new regressions in the software | 16:36 |
corvus | this is nice -- it's easy to revert nb04 since its nodepool.yaml is a git checkout | 16:37 |
corvus | but i have to copy the file on nb01 and nb02 | 16:37 |
mordred | corvus: yay for things being nicer in the future | 16:38 |
mordred | corvus: actually - I think the not-yet-landed project-config would make it back in to files - but I think "I want to easily revert a change in an emergency" is a good use case, so I'll make sure we retain the nb04 behavior when we roll that out | 16:39 |
mordred | corvus: nope - nevermind - it'll stay a git repo | 16:40 |
johnsom | https://www.irccloud.com/pastebin/afzjDqgX/ | 16:42 |
johnsom | fungi FYI ^^^ | 16:43 |
fungi | johnsom: yep, that's another of the "[WARN] client - Uncaught TypeError: Cannot read property 'setStateIdle' of null" events | 16:48 |
clarkb | fungi: does the apache status page show us filling up on connections? | 16:49 |
openstackgerrit | Merged opendev/system-config master: Just use synchronize to sync the repos https://review.opendev.org/720227 | 16:50 |
corvus | clarkb, fungi, mordred: okay i think the configs are reverted on nb01, nb02, and nb04 | 16:50 |
fungi | clarkb: i was just working on trying to connect to it, we do seem to be flat-lining at 500 established per cacti, lower than typical before the switch as you can see at http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=116&rra_id=all | 16:50 |
corvus | i think i should now delete the opensuse-15-0000086893 and opensuse-15-0000086892 dib images? | 16:51 |
corvus | that should then prompt nb02 to upload the opensuse-15-0000053000 image? | 16:51 |
mordred | corvus: yes - and nb should upload the old one | 16:51 |
mordred | yeah | 16:51 |
clarkb | fungi: I wonder if our old tuning isn't applying beacuse bionic apache uses a new mpm worker system compared to xenial | 16:52 |
clarkb | fungi: but ya that seems like a good thread to pull on | 16:52 |
fungi | i've confirmed that the new deployment at least hasn't broken reaching https://etherpad.opendev.org/server-status from a local shell on the server (and also hasn't inadvertently exposed it to the public) | 16:54 |
fungi | firefox always likes to pick the worst possible times to tell me it needs to restart for an upgrade | 16:56 |
fungi | the scoreboard still has quite a few open slots | 16:58 |
corvus | okay, we have no opensuse-15 images now; i don't see an upload happening yet | 16:58 |
fungi | 148 requests currently being processed, 2 idle workers | 16:59 |
fungi | claims we're still using the "event" mpm | 16:59 |
corvus | i wonder if it's because there's still an deleted image in vexxhost for it | 16:59 |
mordred | corvus: I didn't think we blocked on that - but maybe I'm wrong? | 17:01 |
corvus | instance d2d73e84-d988-4605-a596-b0ddef9b2b23 in vexxhost has been deleting for 18 days | 17:01 |
mordred | that didn't block it from uploading the new image last night | 17:01 |
clarkb | corvus: thats "normal" beacuse openstack | 17:01 |
corvus | mordred: right but this is an *old* image | 17:01 |
mordred | good point | 17:01 |
corvus | it's an image that already has existing zk records because it "exists" because it's "deleting" | 17:02 |
corvus | can anyone try deleting the that instance while i try to figure out what nodepool should do in this case? | 17:02 |
clarkb | fungi: taking a quick look at the server we have /var/log/apache2 logs for gerrit vhost (I think that must be copy paste error taking gerrit ansible and adopting it for etherpad) we should clean that up | 17:02 |
mordred | corvus: I'll take a stab at it | 17:03 |
fungi | clarkb: yeah, i saw that too | 17:03 |
clarkb | fungi: I'll take a look at that now while I'm thinking about it | 17:03 |
fungi | interestingly we're logging traffic in those | 17:04 |
clarkb | mordred: corvus ime there are two states. One is where volume is attached to a server that does not exist. That we can clean up by removing the attachment and deleting the volume. The other is server refuses to delete which keeps the whole resource chain alive. That requires cloud intervention | 17:04 |
fungi | oh, but they're etherpad access requests | 17:05 |
clarkb | also I do not think that would affect nodepool's ability to make new images | 17:05 |
corvus | we don't want it to make a new image | 17:05 |
corvus | we want it to upload an old image | 17:05 |
openstackgerrit | Merged openstack/project-config master: Revert "Move suse builds to nb04, drop pip-and-virtualenv" https://review.opendev.org/720223 | 17:06 |
mordred | corvus: I don't see that instance in vexxhost - by instance you mean server here right? | 17:07 |
corvus | mordred: yes | 17:08 |
corvus | | 0014437332 | vexxhost-sjc1 | opensuse-15 | d2d73e84-d988-4605-a596-b0ddef9b2b23 | 38.108.68.90 | 2604:e100:3:0:f816:3eff:fe52:b724 | deleting | 18:02:57:09 | locked | | 17:08 |
mordred | corvus: thanks | 17:08 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Fix etherpad port 80 logging https://review.opendev.org/720245 | 17:08 |
clarkb | fungi: ^ | 17:08 |
corvus | mordred, clarkb, fungi, AJaeger, cmurphy: i figured out why nodepool isn't uploading the old image | 17:08 |
mordred | yeah? | 17:09 |
corvus | it is no longer on the filesystem of the nodepool image builder | 17:09 |
mordred | oh | 17:09 |
corvus | so we've lost it | 17:09 |
AJaeger | oops ;( | 17:09 |
fungi | ahh, right, we "fixed" that in nodepool | 17:09 |
corvus | fungi: we what? | 17:09 |
clarkb | corvus: fungi the change was once all images were in a deleting state we could delete the image from disk | 17:09 |
fungi | because before, it would pile up local copies of images on the builders until they could be completely deleted from all the providers | 17:09 |
corvus | it's not in a deleting state, it's ready. | 17:09 |
clarkb | rather than waiting for the image to delete from the cloud because what was happening is vexxhost was failing to delete many images and then our disks filled up | 17:09 |
corvus | this was deleted out from under nodepool. | 17:09 |
fungi | oh, then that's different | 17:10 |
corvus | though, also, that's a really unfortunate nodepool behavior | 17:10 |
fungi | what got fixed was to have the builder delete the local copy once it told all providers to delete their copies, whether or not the delete command was sucessful/completed | 17:10 |
clarkb | | 0000053000 | 0000000002 | vexxhost-sjc1 | opensuse-15 | opensuse-15-1580919581 | c5b3b55a-4c74-4d41-998c-265342ab3afc | deleting | 33:14:40:58 | | 17:10 |
clarkb | is it that image? beacuse it is deleting | 17:10 |
clarkb | fungi: right becuse otherwise we'd need like 10TB fo disk | 17:11 |
corvus | that's the *upload* not the diskimage | 17:11 |
corvus | | opensuse-15-0000053000 | opensuse-15 | nb02 | qcow2,raw,vhd | ready | 70:01:24:58 | | 17:11 |
corvus | that is the image that nodepool told us is ready to be uploaded | 17:11 |
fungi | because nodepool has no control over whether providers actually follow through on image delete requests, so we were filling up the hard drives when providers failed to be able to process a delete for various reasons | 17:11 |
corvus | yeah, i get it | 17:11 |
corvus | so 1 of 2 things happened here: either one of us deleted the image from disk behind nodepool's back to free up space | 17:11 |
corvus | or, somehow this new behavior change we made to nodepool did apply to this case, in which case, we seem to have programmed our software to lie to us | 17:12 |
clarkb | corvus: I think that may be the case because we can't remove the image record until all uploads are done due to the zk fs hierarchy? and maybe thats a bug where we need to update the state on the dib build as a result? | 17:12 |
corvus | either way, we just blew 2 hours of work | 17:12 |
corvus | because it said "ready" when it wasn't | 17:12 |
corvus | clarkb: if nodepool deleted the diskimage, then there is no excuse for it saying "ready". we have "deleting" for that. | 17:13 |
clarkb | basically the record can't go away until all the uploads go away so we need to update that record state and it may be a bug that we don't (I haven't checked that in the code) | 17:13 |
clarkb | corvus: I get that, but code has buigs | 17:13 |
clarkb | its clearly not intentional if that is the case | 17:13 |
corvus | mordred: can you check the openstack state of the image with id c5b3b55a-4c74-4d41-998c-265342ab3afc ? | 17:14 |
mordred | corvus: it shows active | 17:14 |
mordred | corvus: is that the image we want? | 17:14 |
corvus | clarkb: yes, we agree that if that is the case, then it's a bug | 17:15 |
clarkb | mordred: yes that that the copy of the image in vexxhost sjc1 | 17:15 |
mordred | corvus: we can download it from openstack | 17:15 |
clarkb | *that is | 17:15 |
corvus | mordred: yes. so we may be able to convince nodepool to continue to use that | 17:15 |
mordred | corvus: why don't we download it as well, just to be on the safe side | 17:15 |
corvus | mordred: first thing: were you successful in deleting that instance? | 17:15 |
clarkb | I wasn't around when all of this was originally debugged. Did we decide we can't roll forward for some reasno (thinking about options here) | 17:15 |
fungi | clarkb: there are issues raised with the next steps job config changes | 17:16 |
corvus | clarkb: see my -1 on https://review.opendev.org/717663 | 17:16 |
clarkb | fungi: right but one that is easily fixable | 17:16 |
mordred | corvus: no - I do not see if thwen I look for it - which is very strange to me | 17:16 |
corvus | mordred: neat, at least we're in stasis | 17:16 |
corvus | mordred: then yes, let's start by downloading that | 17:17 |
mordred | corvus: ok. I'm going to do that now | 17:17 |
fungi | clarkb: seemed mostly a decision as to whether it would be more work/faster/better guaranteed to return to a known state | 17:17 |
fungi | though i think it was also assumed at the time that rolling back to the older image would be relatively easy | 17:18 |
corvus | yep, and that is SOP in situations like this | 17:18 |
clarkb | ya, I think the thing that makes this odd is we haven't been able to build that image for months (very similar to the fedora situation) | 17:19 |
clarkb | normally I would agree | 17:19 |
clarkb | and probably would have this morning. Just wanting ot make sure the other options were considered too (and if so what counted against them) | 17:19 |
mordred | corvus: I am downloading the image to /opt/nodepool_dib/opensuse-image.save.raw on nb02 | 17:21 |
mordred | ~/osc/bin/openstack --os-cloud=vexxhost --os-region-name=sjc1 image save c5b3b55a-4c74-4d41-998c-265342ab3afc --file=/opt/nodepool_dib/opensuse-image.save.raw | 17:21 |
mordred | fwiw | 17:21 |
corvus | mordred: cool, i think when that finishes we probably want to make md5 and sha256 files, then copy that to opensuse-15-0000053000.raw | 17:23 |
corvus | nodepool also expecs qcow2 and vhd | 17:24 |
corvus | maybe we can just let it fail those uploads? | 17:24 |
corvus | or maybe we can edit the zk record | 17:24 |
mordred | we could convert them | 17:24 |
corvus | or that | 17:24 |
mordred | we have the conversion tools on the host after all | 17:24 |
clarkb | note that nodepool may try to delete them again if that image does end up deleting (periodic cleanup by provider maybe) | 17:24 |
fungi | checking the etherpad apache server-status periodically, we have 5 of the currently 11 running workers perpetually in "stopping" state due to being on an old config generation, so not accepting connections. though i don't think that's currently causing issues because there are as many open slots for more worker processes too | 17:27 |
fungi | i take that to mean we've updated the apache config since the parent started, and those workers are in a graceful shutdown but still have existing clients who haven't closed out (or where the line has gone dead and apache doesn't know they're never coming back) | 17:29 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use project-config from zuul instead of direct clones https://review.opendev.org/719343 | 17:29 |
corvus | mordred: not the speediest process is it? | 17:31 |
mordred | corvus: nope | 17:32 |
mordred | corvus, clarkb : ^^ I had to rebase that patch due to merge conflict | 17:33 |
corvus | mordred: do you know how to do those conversions? | 17:34 |
fungi | i spot-checked one of the "Cannot read property 'setStateIdle' of null" hits in the log just now and found it correlated to a request which started for the old domain (determined through correlation with /var/log/apache2/etherpad.openstack.org_access.log since we're logging that redirect vhost separately). will try to see if that is consistent | 17:34 |
mordred | corvus: I can pull it out of the dib source | 17:35 |
corvus | mordred: cool, if we want to do that, now's probably a good time to get that ready | 17:35 |
corvus | mordred: is the d/l finished? | 17:36 |
mordred | corvus: yes - it just finished | 17:36 |
mordred | cp $TMP_IMAGE_PATH $1-intermediate | 17:36 |
mordred | vhd-util convert -s 0 -t 1 -i $1-intermediate -o $1-intermediate | 17:36 |
mordred | vhd-util convert -s 1 -t 2 -i $1-intermediate -o $1-new | 17:36 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Check if pip is preinstalled before installing it https://review.opendev.org/720254 | 17:36 |
corvus | mordred: cool, i'll do the work for the raw image if you want to convert | 17:36 |
mordred | that's the "convert to vhd" steps | 17:36 |
corvus | mordred: or i can do those too | 17:36 |
mordred | I can do the converts | 17:36 |
corvus | ok | 17:36 |
mordred | you want me to wait until you've renamed? | 17:36 |
mordred | or should we cp and keep this as-is just in case? | 17:37 |
corvus | mordred: i was going to copy, in order to avoid nodepool deleting it :) | 17:37 |
mordred | yeah | 17:37 |
clarkb | I think https://review.opendev.org/720254 addresses corvus and tristanC's concern with ensure-pip | 17:37 |
clarkb | I'm hoping that existing tests will help point me in the right direction as far as testing goes (because so many of our images come preinstalled with pip) | 17:37 |
corvus | i was told there was a 'plain' image to verify this stuff | 17:38 |
clarkb | corvus: oh right that one should be "clean" lets see if it runs tests yet | 17:38 |
mordred | clarkb: we should make sure that the 3pci shows it doesn't try to reinstall | 17:38 |
mordred | corvus: wow. even just copying the image is slow | 17:39 |
corvus | mordred: i'm several minutes into an md5sum | 17:39 |
clarkb | mordred: ya if it clears our initial tests I can rebase into ianw's stack and that should haev 3pci run it | 17:39 |
tristanC | clarkb: commented | 17:39 |
clarkb | tristanC: because `pip` shoudl always be present regardless of python version | 17:40 |
clarkb | then we check version specifics based on what is enabled | 17:40 |
tristanC | clarkb: i meant the change assigns shell variable using `if` jinja statement, and then it evaluate content based on `if` shell statement. couldn't the type command be selected by the jinja `if` statement? | 17:41 |
clarkb | tristanC: it could but I thought it was easier to set flags (basically translate yaml truthyness to bash truthyness) then evaluate the results in a bash context | 17:42 |
clarkb | this way you don't have to parse jinjayamlbash | 17:42 |
clarkb | and instead its jinjayaml then bash | 17:43 |
tristanC | clarkb: hmm ok | 17:44 |
tristanC | iirc, a tox user who want python2 needs to set both `tox_prefer_python2: true` and `ensure_pip_from_packages_with_python2` ? | 17:47 |
clarkb | tristanC: or ensure_pip_from_upstream and ensure_pip_from_upstream_interpreters has python2 in it | 17:48 |
*** dpawlik has quit IRC | 17:48 | |
clarkb | though maybe what you mean is if you want python packages you need that? since pip_from_upstream doesn't imply python packages | 17:49 |
smcginnis | Hopefully quick and easy question - do we have any nodes with py38 available yet? | 17:49 |
clarkb | smcginnis: I believe the bionic nodes can do that with special packages (the tox-py38 enables them) | 17:50 |
smcginnis | Perfect, thanks clarkb. | 17:50 |
mordred | corvus: qcow2 image convert should be: qemu-img convert -f raw -O qcow2 opensuse-15-0000053000.raw opensuse-15-0000053000.qcow2 | 17:51 |
fungi | smcginnis: yeah, if you just use the py38 jobs they should work magically | 17:52 |
mordred | corvus: I am currently doing the second stage of the vhd convert | 17:52 |
smcginnis | fungi, clarkb: Would that include openstack-tox-functional jobs? | 17:52 |
clarkb | smcginnis: I don't know. You may have to add the package installs that tox-py38 does to tox-functional if it isn't already there | 17:53 |
fungi | smcginnis: yeah, no clue, i've only seen folks using the tox-py38 job so far | 17:53 |
smcginnis | Guess we'll find out. | 17:54 |
fungi | but that installs the python3.8 package on the default image | 17:54 |
fungi | er, default node type | 17:54 |
*** ralonsoh has quit IRC | 17:55 | |
corvus | mordred: nodepool is attempting to upload images now (but failing since not all files are in place) | 17:56 |
corvus | so either the md5sum file or the "vhd-new" file is enough for it to think there's an image there | 17:56 |
corvus | anyway, i think that's good, harmless, but chatty :) | 17:56 |
corvus | okay, all 3 raw pieces are in place | 17:57 |
mordred | corvus: cool | 17:58 |
corvus | and it looks like we're really uploading to vexxhost now | 17:59 |
mordred | corvus: I'll do the qcow2 conversion as soon as the vhd conversion is done | 17:59 |
corvus | mordred: cool -- want me to do the checksums and rename for vhd, or you? | 17:59 |
mordred | corvus: if you could do the checksums that would be neat | 18:00 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Check if pip is preinstalled before installing it https://review.opendev.org/720254 | 18:00 |
corvus | will do; lemme know when it's ready | 18:00 |
mordred | will do | 18:00 |
clarkb | re image conversions for qcow2 you might need to set the compatibility flag. I 'm not sure if we ever managed to decide if that was or wasn't needed anymore | 18:01 |
clarkb | the --compare=0.10 or something similar flag | 18:01 |
mordred | clarkb: I believe we stopped doing it | 18:01 |
mordred | corvus: done | 18:01 |
mordred | clarkb: we were only doing that for hp cloud anyway | 18:01 |
clarkb | mordred: ya at this point it would surprise me if there were any qemu-imgs in the wild old enough to trip voer that | 18:02 |
mordred | clarkb: I also don't see us setting QEMU_IMG_OPTIONS | 18:02 |
mordred | corvus: I am now doing the qcow conversion | 18:03 |
corvus | ack | 18:03 |
clarkb | mordred: well if some clouds are unhappy with it without the flag we'll learn something :) | 18:03 |
clarkb | (and probably be able to suggest strongly that people upgrade qemu) | 18:03 |
mordred | clarkb: re-review https://review.opendev.org/#/c/719343/ ? | 18:11 |
* mordred would like to get that done since there's a manual transition step and we're sort of in the awkward half-rolled-out stage :) | 18:12 | |
clarkb | mordred: is that gonna need a new rebase when the git role changes ? | 18:12 |
clarkb | maybe we should decide on an order there with some depends on? | 18:12 |
mordred | clarkb: or else the git role change will need a rebase | 18:13 |
mordred | clarkb: I'd like to get the zuul one landed first (deleing the extra ansible.cfg is important) - then I'll rebase the other one | 18:13 |
mordred | clarkb: (turns out that ansible.cfg in the root of the repo was a bad idea) | 18:14 |
mordred | corvus: qcow2 is done | 18:14 |
corvus | neat, still waiting on the sha256 from vhd :) | 18:15 |
mordred | cool | 18:15 |
mordred | corvus: have brainspace for a rebase re-review of https://review.opendev.org/#/c/719343/ while we wait? (ok if not) | 18:15 |
openstackgerrit | Merged opendev/system-config master: Fix etherpad port 80 logging https://review.opendev.org/720245 | 18:16 |
mordred | hrm. that patch was unhappy in deploy ... why | 18:17 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Remove infra-prod-update-system-config from etherpad https://review.opendev.org/720261 | 18:18 |
mordred | fungi, clarkb : ^^ | 18:18 |
corvus | vhd summed and moved into place | 18:19 |
mordred | corvus: wot | 18:19 |
mordred | woot | 18:19 |
corvus | mordred: did you create the qcow2 as the final name? | 18:20 |
mordred | corvus: oh - I did. cause I'm dumb | 18:20 |
corvus | mordred: i think it's been uploading the qcow2 for 15 minutes, but it only finished converting 5 min ago | 18:20 |
corvus | i don't know what that's going to do | 18:20 |
corvus | that's kna1, mtl01, limestone, openedge, ovh | 18:21 |
corvus | maybe it'll just worke? | 18:21 |
mordred | corvus: maybe? maybe it'll just be reading from a file that's being appended to | 18:22 |
corvus | i don't know if it does anything with sizes or checksums beforehand though | 18:22 |
clarkb | corvus: it does, but I don't think it checks any of that except for on rax | 18:23 |
clarkb | and there its just checking the checksum for reuploading purposes? | 18:23 |
mordred | yah | 18:23 |
corvus | okay, checksum files for qcow2 are in place | 18:25 |
corvus | it looks like we're now really uploading everywhere | 18:25 |
mordred | corvus: woot | 18:27 |
mordred | clarkb: heh. your zuul-jobs fix for opensuse failed on there being no opensuse images | 18:34 |
clarkb | mordred: yup, it also failed on -plain and centos ps1 but ps2 looks good | 18:35 |
clarkb | I think tjat implies our testing has reasonable coverage | 18:36 |
mordred | yeah. I agree | 18:36 |
AJaeger | team, I'm puzzled https://docs.openstack.org/python-cinderclient/latest/ gives me a 404 - but https://docs.openstack.org/python-cinderclient/ussuri/ exists | 18:36 |
AJaeger | looking at the last promote job via https://review.opendev.org/#/c/719080/ - everything looks fine. | 18:38 |
AJaeger | can we run the promote job again? | 18:40 |
openstackgerrit | Donny Davis proposed openstack/project-config master: Adding custom label to OE for airship support https://review.opendev.org/720263 | 18:41 |
AJaeger | or has anybody an idea why after the upload there's no content? | 18:42 |
clarkb | AJaeger: I think the job log records what it rsyncs? /me is lokoing | 18:42 |
clarkb | https://zuul.opendev.org/t/openstack/build/01ec599f1d4b4aa5a8e1297d20f24e3a/log/job-output.txt#137 heh I guess not | 18:43 |
clarkb | AJaeger: is ^ that the job that needs to be rerun? | 18:43 |
AJaeger | yes - and rsync output ishttps://zuul.opendev.org/t/openstack/build/01ec599f1d4b4aa5a8e1297d20f24e3a/console#1/0/23/localhost | 18:44 |
AJaeger | https://zuul.opendev.org/t/openstack/build/01ec599f1d4b4aa5a8e1297d20f24e3a/console#1/0/23/localhost | 18:44 |
clarkb | AJaeger: hrm that seems to show files being copied to the correct place. We need an index.html for your url to work right? | 18:46 |
clarkb | (and there is an index.html copied) | 18:47 |
clarkb | AJaeger: if you look in afs the files are there | 18:48 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Use TOX_CONSTRAINTS_FILE in release script https://review.opendev.org/720265 | 18:48 |
AJaeger | clarkb: they are in afs - but not displayed on docs.o.o? | 18:49 |
AJaeger | clarkb: do you also get a 404 on https://docs.openstack.org/python-cinderclient/latest/ ? | 18:49 |
clarkb | AJaeger: I do | 18:49 |
clarkb | if I try to navigate to /afs/openstack.org/docs/python-cinderclient/latest on static01.o.o that fails | 18:50 |
clarkb | so I think this is an afs issue | 18:50 |
AJaeger | ;( | 18:50 |
clarkb | perhaps related to [Wed Apr 15 00:08:56 2020] afs: Waiting for busy volume 536871090 () in cell openstack.org | 18:50 |
clarkb | I'm going to try and invaldiate the cache things for that path | 18:51 |
clarkb | I just have to remember hwo to do that | 18:51 |
fungi | oh, maybe the vos release hasn't completed? | 18:54 |
clarkb | maybe? fwiw `fs flush` on that path fails becuse it thinks it doesn't exist | 18:55 |
clarkb | flushing the parent dir didn't help | 18:55 |
fungi | /afs/openstack.org/docs/python-cinderclient/latest wouldn't exist, would it? i thought that was a redirect | 18:55 |
clarkb | fungi: its perfectly navigable on hosts that are not static01.opendev.org | 18:56 |
fungi | oh, i see | 18:56 |
clarkb | fungi: and the log AJaeger shared shows we copy directly into it | 18:56 |
fungi | and also /afs/.openstack.org/docs/python-cinderclient/latest is navigable from static | 18:56 |
clarkb | I don't think it is a redirect | 18:56 |
clarkb | listvldb shows there isn't a release in progress | 18:56 |
fungi | oh, right, we redirect *to* latest | 18:57 |
fungi | yeah, so this does seem like a cache problem if other clients see it | 18:57 |
fungi | yesterday's kernel upgrade seems to have broken my local openafs lkm | 18:57 |
clarkb | we could try restarted openafs services on static or rebooting it | 18:57 |
clarkb | we could also try a flushvolume | 18:58 |
clarkb | which is the more heavy handed version of flush that applies to the volume entirely | 18:58 |
clarkb | shoudl I try fs flushvolume first? that seems like maybe the least heavy hadned thing we can do next | 18:58 |
clarkb | didn't hel | 18:59 |
clarkb | *help | 18:59 |
corvus | AJaeger, clarkb, fungi, mordred, cmurphy: some uploads of the old image have completed, so i think we should be back to where we were yesterday | 19:00 |
fungi | thanks corvus, mordred! | 19:00 |
clarkb | corvus: I can recheck my ensure-pip change to check | 19:00 |
AJaeger | thanks, corvus and mordred ! | 19:00 |
clarkb | also looks like systemctl stop openafs-client ; systemctl start openafs-client might be the next thing to try on static? | 19:01 |
clarkb | that will blip everythign though | 19:01 |
fungi | less of a blip that a reboot at least | 19:01 |
fungi | but yeah, that's where i'd go next unless corvus has suggestions | 19:01 |
cmurphy | thanks corvus | 19:02 |
fungi | clarkb: is the kernel logging anything | 19:02 |
fungi | ahh, just the "Waiting for busy volume" | 19:02 |
clarkb | fungi: kern.log just shows those waiting for busy volume | 19:02 |
clarkb | ya | 19:02 |
fungi | clarkb: more than one volume though | 19:02 |
clarkb | let me wee what volume that id belongs to | 19:02 |
fungi | looks like they were all for volume 536870992 in previous weeks, but 536871090 is the one from earlier today | 19:03 |
*** hashar has quit IRC | 19:04 | |
clarkb | project.airship maybe? its got 536871091 and 536871092 now | 19:04 |
fungi | that's for the https://docs.airshipit.org/ site | 19:06 |
clarkb | ya | 19:07 |
clarkb | which isn't where python-cinderclient docs are stored so could eb those warnings are just noise? | 19:07 |
fungi | i'm suspecting they may be unrelated, yes | 19:07 |
fungi | especially since they're occurring infrequently | 19:08 |
fungi | there's only one entry in dmesg from today, and it was around 08:00z if memory serves | 19:08 |
AJaeger | http://zuul.opendev.org/t/zuul/stream/3d52a5bad3f643528d1ab115d12756bc?logfile=console.log is an opensuse-15 log ;) | 19:11 |
clarkb | I'm not coming up with anything better than stop starting openafs-client. Except for maybe use the rw volume for now | 19:11 |
clarkb | (and that will let us debug further) | 19:11 |
AJaeger | and clarkb's change passed now | 19:13 |
*** factor has quit IRC | 19:14 | |
*** factor has joined #opendev | 19:14 | |
corvus | oy, there's another fire? /me catches up on afs stuff | 19:14 |
clarkb | fwiw I checked lsof against that path and its parent and it says nothing has parent open and child doesn't exist | 19:15 |
clarkb | (just in case there would be clues in the kernel file tables) | 19:15 |
openstackgerrit | Merged opendev/system-config master: Use project-config from zuul instead of direct clones https://review.opendev.org/719343 | 19:16 |
openstackgerrit | Merged opendev/system-config master: Remove infra-prod-update-system-config from etherpad https://review.opendev.org/720261 | 19:16 |
clarkb | mordred: ^ fyi | 19:17 |
mordred | clarkb: woot | 19:17 |
mordred | I have renamed the zuulcd user and moved the home dir - so that _should_ run without issue | 19:18 |
clarkb | I need to find lunch. On static.o.o's /afs/openstack.org/docs/python-cinderclient/latest issue my only current input is that maybe we need to restart openafs-client there. I can't find anything in logs or vos output saying that it is unhappy. But it definitely doesn't seem to stat | 19:20 |
clarkb | the dir does stat and is navigable on other hosts | 19:20 |
corvus | clarkb: i have run some flush commands as root and they made it better | 19:20 |
clarkb | corvus: hrm I ran fs flush on the cinderclient/ and cinderclient/latest paths as well as flushvolume on cinderclient/ and cinderclient/latest | 19:21 |
corvus | clarkb: as root? | 19:21 |
clarkb | corvus: I ran those from static01. did you do differently? | 19:21 |
clarkb | yes | 19:21 |
corvus | huh. then maybe the 'fs checkvolumes' command helped | 19:21 |
corvus | i ran that as non-root, but initially didn't think it did anything, but i may have been mistaken | 19:22 |
AJaeger | https://docs.openstack.org/python-cinderclient/latest/ is working now - thanks! | 19:22 |
corvus | at any rate, some combination of those 3 commands run as some combination of non-root and root seem to have helped | 19:22 |
clarkb | corvus: http://paste.openstack.org/show/792181/ that is what I ran | 19:22 |
corvus | if it happens again, maybe we can narrow it down more | 19:22 |
AJaeger | let me spider again ;) | 19:23 |
AJaeger | (openstack-manuals merge does some sanity check for indices) | 19:23 |
corvus | clarkb: me too, though i did it from the python-cinderclient directory against '.' | 19:23 |
corvus | mordred: i think we can remove nb from the emergency file now, yeah? | 19:24 |
clarkb | so ya maybe checkvolumes was what we needed. I'll keep thati n mind for testing if this comes up again | 19:24 |
clarkb | (basically try that first then test paths I guess) | 19:24 |
mordred | corvus: yes - I agree - I'll do that in just a bit | 19:28 |
mordred | corvus, clarkb : the project-config chagne did not work - we hit retry limit on it in deploy pipeline - I'm looking on the zuul scheduler to try to figure out why | 19:29 |
clarkb | mordred: k, I'm making a burger but can help after lunch | 19:29 |
dirk | corvus: ajaeger: cmurphy: the original issue is fixed id we'd get a new dib release | 19:32 |
dirk | There is a fix in there.that would make pip-and-virtualenv element work again and then we have time to figt out things | 19:33 |
mordred | clarkb: I may need it - I'm not sure what I'm looking for :( | 19:33 |
clarkb | mordred: usually if you grep the job name you find the jobs that ran. They'll have an event id in the logs then you grep that id and do a trace | 19:33 |
clarkb | at least thats been how I've debugged similar in the past. Also you can look in logstash if we are caught up | 19:34 |
clarkb | but it will only have info if there were logs published | 19:34 |
openstackgerrit | Merged openstack/project-config master: Adding custom label to OE for airship support https://review.opendev.org/720263 | 19:35 |
mordred | clarkb: I'm dumb | 19:36 |
mordred | clarkb: I missed a rename | 19:36 |
mordred | clarkb: turns out - when you rename a user in /etc/passwd - you ALSO need to rename the user in /etc/shadow :) | 19:36 |
*** factor has quit IRC | 19:36 | |
mordred | clarkb: I want enqueue-ref for re-triggering the deploy pipeline right? | 19:38 |
mordred | corvus: ^^ ? | 19:39 |
mordred | does zuul enqueue-ref --pipeline deploy --ref refs/changes/43/719343/19 --trigger gerrit --tenant openstack --project opendev/system-config look reasonable? | 19:41 |
mordred | or I need newrev and oldrev don't I? | 19:42 |
*** factor has joined #opendev | 19:43 | |
mordred | it's a change-merged trigger - so I think I don't | 19:44 |
corvus | for change merged you want 'enqueue' | 19:44 |
mordred | ah - cool | 19:45 |
corvus | should be just like a check/gate enqueue | 19:45 |
*** osmanlicilegi has quit IRC | 19:45 | |
mordred | corvus: zuul enqueue --pipeline deploy --change719343 --trigger gerrit --tenant openstack --project opendev/system-config | 19:45 |
mordred | so that look ... sigh. with a = | 19:45 |
mordred | zuul enqueue --pipeline deploy --change 719343 --trigger gerrit --tenant openstack --project opendev/system-config | 19:46 |
mordred | that look more sane? | 19:46 |
corvus | --change 719343,19 | 19:46 |
mordred | k. hopefully it'll work more better this time | 19:47 |
mordred | thanks | 19:47 |
mordred | corvus: it is at least running - so yay! | 19:48 |
mordred | corvus: I have removed nodepool from emergency - so we should get a nodepool ansible run this time too | 19:50 |
mordred | corvus, clarkb : infra-prod-install-ansible has run successfully from /home/zuul | 19:51 |
mordred | \o/ | 19:51 |
corvus | mordred: woot! | 19:51 |
mordred | corvus, clarkb : we should be able to land https://review.opendev.org/#/c/720231/ now | 19:52 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add /opt/lib/git to the volume mounts https://review.opendev.org/720225 | 19:53 |
mordred | corvus, clarkb also that ^^ which should fix the local mirror issue | 19:55 |
*** jkt has quit IRC | 20:09 | |
AJaeger | config-core, please review https://review.opendev.org/720265 - small cleanup for release | 20:10 |
*** jkt has joined #opendev | 20:10 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 20:17 |
corvus | mordred: i'm about to start looking into your question on ^, ok? | 20:17 |
corvus | mordred: i suspect the answer is "no it's not still accurate, and we are using the normal python3 gear install that we get in the ansible envs" | 20:19 |
corvus | mordred: yep, pretty sure that's the case | 20:22 |
mordred | corvus: cool! | 20:23 |
mordred | corvus: that excites me | 20:23 |
mordred | corvus: I'll remove that from the comment in the next iteration then :) | 20:23 |
corvus | mordred, clarkb, fungi: so... i can't remember if i asked this quesction or not -- i do remember i was typing it into irc right as all the fires exploded. for the zk tls work, we can either (a) run zk-ca.sh manually on bridge and copy the resulting keys into private hostvars (like we did for the gear certs). or (b) we could (all in ansible) run zk-ca.sh on bridge and slurp up the keys to put them on the | 20:25 |
corvus | zuul hosts. | 20:25 |
corvus | i kind of like (b) -- the keys aren't precious, it just means that if we lose bridge, we will end up rekeying zuul. that doesn't sound like a big deal. | 20:26 |
mordred | corvus: because zk-ca.sh won't make new certs if we already have old certs, right? | 20:27 |
corvus | yep, it's idempotent | 20:27 |
mordred | yeah. so I like b | 20:27 |
mordred | I don't see any point in managing them in private hostvars if we don't have to | 20:27 |
corvus | cool, i'll start down that path then | 20:28 |
corvus | (i will make followup changes to 717620) | 20:28 |
mordred | cool | 20:28 |
mordred | fwiw - manage-projects did not run well this trigger | 20:29 |
mordred | I am now investigating | 20:29 |
mordred | it failed on synchronize with no logs because no_loig | 20:31 |
mordred | I'm going to say "shrug" | 20:31 |
corvus | mordred: was that on the superceded patch? | 20:31 |
corvus | where we replaced synchronize with the git role? | 20:31 |
clarkb | corvus: how does the slupring in b) operate? is it different than putting things in private vars? | 20:32 |
mordred | no - the superceeded patch landed - I just re-approved the patch to replace it with the git role | 20:32 |
corvus | mordred: i mean, what 'synchronize' operation failed? | 20:33 |
mordred | the synchronize that we're replacing with the git role | 20:33 |
corvus | ok, that was my question, sorry for being unclear. i agree that shrug is the right answer | 20:33 |
fungi | clarkb: i was assuming copying files | 20:33 |
mordred | yeah - if the other thign fails, I'll debug _that_ | 20:33 |
corvus | clarkb: yeah, it would mean a task to copy the file from bridge to the remote zuul/nodepool node | 20:33 |
fungi | corvus: plan b sounds safe, and less hands-on | 20:34 |
clarkb | corvus: gotcha so major difference is not tracking it in git history | 20:34 |
clarkb | ya I think that is fine for this use case | 20:34 |
fungi | not as fantastical as plan 9, but then what is? | 20:34 |
corvus | oh, we're going to need all of nodepool out of puppet for this too | 20:34 |
mordred | plan 💩 | 20:34 |
corvus | is there anything preventing rolling nb01/02/03 into containers now? | 20:35 |
corvus | (afaik nb04 is good, with no outstanding issues) | 20:35 |
mordred | corvus: I don't think so - I think ianw was going to start rolling each of them out | 20:35 |
fungi | corvus: yeah, i think that was ianw's plan next, once the pip-and-virtualenv bits are settled | 20:35 |
mordred | corvus: I think we need to do all of zk too, yeah? | 20:36 |
corvus | cool, i'll go ahead and write the skeleton of this change, but clearly we won't be able to land it until that happens | 20:36 |
corvus | mordred: yeah | 20:36 |
mordred | corvus: cool. are you doing that bit in your change? or want me to start working on a change for that. also - for nodepool-launcher | 20:36 |
corvus | mordred: i'll focus on the CA aspects for now, and deploying to zuul; if you want to start on zk and nodepool-launcher, that'd be great; i can pitch in on that when this is done | 20:37 |
corvus | then if all that's done, we can help ianw with the nb rollout :) | 20:37 |
corvus | mordred: oh, i just did a bunch of docker testing for zk, let me grab my docker-compose file | 20:38 |
clarkb | before we start deploying more services with docker compose it might be a good idea to land https://review.opendev.org/#/c/719589/ and its child | 20:38 |
corvus | clarkb: the names are changing? | 20:38 |
mordred | corvus: yeah - isn't that swell? | 20:39 |
clarkb | corvus: yes docker-compose was chomping the - in dir names but now it doesn't | 20:39 |
corvus | nice | 20:39 |
corvus | clarkb: what happens with the upgrade? | 20:39 |
clarkb | https://review.opendev.org/#/c/719682/ is my attempt at testing that upgrade path | 20:39 |
clarkb | corvus: ^ seems to show everything works even with the name change, but reviewing this upgrade changei s probably worthwhile too | 20:39 |
clarkb | I was also hoping I could spend a bit more time trying to formalize what that change does into a generic upgrade testing job/tool | 20:40 |
corvus | what does "work" mean? does it restart/recreate containers or does it just recognize old names as its own containers still? | 20:40 |
clarkb | corvus: based on testing it stopped the old containers and started the new containers with no problems despite the name check. When I didn't do the updated test sed's in that change we failed testinfra tests beacuse those old containers did not exist anymore | 20:41 |
clarkb | corvus: that implies to me that its stopping old name properly, then starting new name properly | 20:41 |
clarkb | (the job runs everything with old version, upgrades docker-compose, runs docker-compose up --force-restart, then reruns testinfra) | 20:42 |
corvus | clarkb: does --force-recreate cause the restart? | 20:42 |
corvus | we don't normally run that, right? | 20:42 |
clarkb | corvus: ya its the flag that says stop and start even if container images haven't changed | 20:42 |
corvus | i'm just trying to figure out what happens to gerrit when we land https://review.opendev.org/719589 | 20:42 |
clarkb | correct we normally rely on images to have changed in order to triggerthe restarts | 20:42 |
corvus | so if that's omitted, and we upgrade docker-compose, do we know what happens? | 20:43 |
clarkb | corvus: I see you're thinking that maybe new docker-compose will restart even without the force | 20:43 |
clarkb | we can test that :) one moment I'll get a patchset up for that case | 20:43 |
corvus | yeah, it might (a) do nothing (yay) (b) restart without any prompting (meh) (c) run a second copy (boo) | 20:43 |
corvus | my guess based on your test so far is (a), but would be good to confirm that, because (c) would be bad. | 20:44 |
mordred | four legs good, c bad | 20:45 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: DNM Test docker-compose upgrade https://review.opendev.org/719682 | 20:46 |
clarkb | corvus: mordred ^ that runs docker-compose up -d which is what we normally run. Then it runs testinfra against the old names (we should expcet this to pass), then it udpates testinfra to check for new names and runs testinfra. This last testinfra run should fail | 20:46 |
clarkb | if the last testinfra run passes it implies we are running both sets of containers | 20:47 |
clarkb | and if the second to last fails it implies a restart happens even though we don't force it to | 20:47 |
corvus | mordred: http://paste.openstack.org/show/792186/ zk docker compose and config file from my testing -- for the first pass of containerization, we should drop all the tls stuff obviously | 20:48 |
corvus | mordred: that's based on the upstream documentation for using the container images with docker-compose, so it's shiny and new | 20:48 |
corvus | clarkb: ack sounds good, thx | 20:48 |
corvus | mordred: and we have actual real different hosts, so we don't need to worry about the ports and docker-based hostnames and stuff | 20:49 |
clarkb | also actual different hosts are important for taking advantage of reliability there | 20:49 |
corvus | and we probably have some tuning in our current config we should make sure not to lose | 20:50 |
clarkb | (though I guess we can't guaruntee they are on differeny hypversors) | 20:50 |
corvus | so all in all, maybe a few lines of that paste will be useful, but it's a good reference :) | 20:50 |
clarkb | corvus: ya we force it to rotate the journal and bump up the write to disk time | 20:50 |
openstackgerrit | Merged opendev/system-config master: Switch to prepare-workspace-git https://review.opendev.org/720231 | 20:58 |
mordred | corvus: ++ | 20:58 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 21:05 |
clarkb | apparently my new yaml in that test change isn't valid for jinja? | 21:12 |
clarkb | its the ' unbalancing again | 21:13 |
clarkb | I should just start typing without any 's and the issue will go away | 21:13 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: DNM Test docker-compose upgrade https://review.opendev.org/719682 | 21:13 |
clarkb | trying again | 21:13 |
corvus | clarkb: be like data; no contractions | 21:13 |
*** DSpider has quit IRC | 21:14 | |
corvus | mordred: there seems to be a chunk of puppet in the ansible for the zuul-scheduler role :) | 21:15 |
clarkb | corvus: what about compression? | 21:17 |
mordred | corvus: you're just imagining that | 21:17 |
corvus | clarkb: i believe his upper spinal support is a poly-alloy, designed to withstand extreme stress | 21:18 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 21:18 |
corvus | mordred: left a second comment on that too | 21:19 |
mordred | agree | 21:19 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 21:20 |
openstackgerrit | Merged opendev/system-config master: Add /opt/lib/git to the volume mounts https://review.opendev.org/720225 | 21:26 |
mordred | clarkb, corvus : we're going to need to restart the gerrit container to pick that up | 21:31 |
mordred | maybe we shoudl wait until the compose change lands too | 21:31 |
mordred | so that we just do one restart | 21:32 |
corvus | mordred: i'm inclined to restart asap -- i have clones from those urls and they are several days out of date. maybe i'm the only one, but if not, then it's a service impact | 21:33 |
corvus | (also, we're going to need to trigger a full replication of everything to that after restarting) | 21:33 |
clarkb | ya I think the only reason we need to wait is if we are worried about not restarting gerrit gracefully | 21:34 |
clarkb | because the docker-compose stack also addressse ^ | 21:34 |
corvus | mordred: and our container is going to take up a lot of extra space -- so maybe we should --recreate it? | 21:34 |
corvus | clarkb: is there a way to gracefully shut it down now? with a plain docker comand maybe? | 21:35 |
mordred | we could do a docker-compose exec to send the hup | 21:35 |
mordred | corvus: and yes - let's do recreate for sure | 21:35 |
corvus | and we don't have 'restart: always'? | 21:35 |
clarkb | corvus: ya I'm not sure. What we want to do si hup it then wait long enough for it to stop on its own | 21:36 |
clarkb | which is less than a minute with our version of gerrit iirc | 21:36 |
corvus | right i'm just wondering if we do that will docker restart it | 21:37 |
clarkb | oh mordred ^ | 21:37 |
mordred | we do not have restart: always | 21:37 |
mordred | so I think it will not | 21:37 |
corvus | ok | 21:37 |
corvus | i guess we're waiting for that to land on disk | 21:38 |
mordred | oh fun | 21:38 |
mordred | https://zuul.opendev.org/t/openstack/build/90c6d2ddf8204800aab15a26e05952e8 | 21:38 |
corvus | how does that not have a default value | 21:40 |
corvus | mordred: i guess just add that to the role invocation? | 21:40 |
clarkb | corvus: mordred maybe because we add host the server | 21:40 |
clarkb | I bet you can set it when you add host? | 21:41 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Add port and user_dir to add_host in prod playbook https://review.opendev.org/720293 | 21:41 |
corvus | ohhh | 21:41 |
corvus | clarkb wins | 21:41 |
mordred | yeah - I think that should do it | 21:41 |
mordred | I added ansible_user_dir - just from looking at the role for other things it wants | 21:41 |
clarkb | I feel like I'm learning a lto about ansible :) | 21:41 |
mordred | clarkb: me too! | 21:41 |
mordred | (we could also add a default(22) to the role there) | 21:42 |
clarkb | mordred: should system-config-run-nodepool have a parent of system-config-run-containers? or does it not matter because it is consuming images from the zuul tenant? | 21:43 |
mordred | clarkb: that's right - that base job is only for jobs where we're dpeending on containers we're building | 21:43 |
mordred | (that's right- it doesn't matter) | 21:44 |
corvus | mordred: we should run our zuul containers as non-root users | 21:44 |
corvus | 10001 is set up as the zuul user in the container | 21:47 |
corvus | er in the image | 21:47 |
mordred | corvus: ++ | 21:47 |
mordred | we should run them as that | 21:47 |
corvus | and likewise, same number is the nodepool user in the np images | 21:47 |
mordred | corvus: the images set USER already ... so don't these start as that user absent other intervention? | 21:48 |
corvus | do they? | 21:48 |
corvus | i didn't see that they did | 21:48 |
mordred | oh - I guess not | 21:48 |
clarkb | what does the USER directive do in that case? | 21:49 |
mordred | we don't do one | 21:49 |
corvus | clarkb: the user directive says what user to run as | 21:49 |
corvus | in the image | 21:49 |
mordred | yeah- so if we DID do a USER, it would run as that - but we don't, so we need to set it in the compose | 21:49 |
corvus | the current state of the nodepool/zuul images is that they have a unix user created in the filesystem of the image, but they run as root by default. but we can tell docker to run as that user. | 21:49 |
clarkb | corvus: oh its for build time | 21:50 |
corvus | clarkb: USER affects build and run | 21:50 |
corvus | (you can use it during build to switch users for build activities; and the last USER line also says what it will run as by default) | 21:51 |
clarkb | got it | 21:51 |
corvus | which makes a weird sort of sense when you think of building and running images as the same thing, which docker does | 21:51 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 21:52 |
mordred | user: zuul added | 21:52 |
mordred | good catch | 21:52 |
corvus | we should be able to run an 10001 everywhere except zuul-fingergw, which still probably wants to be run as root since we run in host networking; that way it can grab the port and drop | 21:52 |
mordred | oh. yeah. lemme fix fingergw - I forgot about port drop | 21:53 |
corvus | oh, and i have no idea about nodepool-builder :) | 21:53 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 21:53 |
mordred | corvus: I think just running n-b as root makes sense- otherwise it's just going to be sudoing all over the place anyway | 21:53 |
mordred | oh - hah. we run as nodepool but with privileged: true on | 21:54 |
corvus | huh, we apparently run the builders as the nodepool user | 21:54 |
mordred | yeah | 21:54 |
mordred | so I guess diskimage-builder sudos where necessary? | 21:54 |
mordred | I mean - whatever it's doing is apparently working | 21:55 |
clarkb | ya it should sudo | 21:55 |
clarkb | corvus: mordred it appears to have recreated the containers | 22:00 |
clarkb | thats painful | 22:00 |
clarkb | I'll get links to lgos once the buildset reports | 22:00 |
corvus | mordred: we have some files with owner: zuul... we may want to change that to owner: 10001 ? | 22:01 |
corvus | (maybe later we could re-id the zuul / nodepool user as 10001?) | 22:02 |
corvus | mordred: i'm looking at the 'add github key' task in your change | 22:02 |
mordred | clarkb: ok. so we need to emergency review when landing that | 22:03 |
mordred | corvus: the zuul/nodepool user already is | 22:03 |
clarkb | https://zuul.opendev.org/t/openstack/build/521f38be044647e59b4c621749841bbd/log/job-output.txt#17194 its recreating there then we fail when checking the old names at https://zuul.opendev.org/t/openstack/build/521f38be044647e59b4c621749841bbd/log/job-output.txt#17477 | 22:03 |
mordred | corvus: we set them as 10001 in the images because that's what they are in opendev :) | 22:03 |
clarkb | mordred: well review doesn't actually docker-compose up ever I think | 22:03 |
clarkb | thats always manual | 22:03 |
corvus | mordred: no way. wow. cool. | 22:04 |
clarkb | but all of the other services we'll need to have a think about? | 22:04 |
corvus | clarkb: yeah, but i suspect they should all be okay. except we'll probably leak something in nb04. but we do anyway. | 22:04 |
clarkb | ya so maybe this is a "land it when there haven't been fires all day and we can pay attention to things as it goes in change" | 22:05 |
openstackgerrit | James E. Blair proposed opendev/system-config master: WIP: add Zookeeper TLS support https://review.opendev.org/720302 | 22:05 |
clarkb | I'll WIP the change now | 22:05 |
fungi | clarkb: i'm looking forward to a day with no fires | 22:06 |
corvus | mordred: ^ if you have a quick second to look at 720302 as an early draft, that'd be great | 22:06 |
clarkb | gitea is the one I worry about most since our restart process relies on a new image building available | 22:06 |
clarkb | we might be able to coincide the docker-compose update with a new image somehow and have it run through its normal updates | 22:07 |
corvus | mainly looking for feedback about how i set it up for delegation. the role is heavyweight, so that using it should basically be a one-liner to each of the zuul/nodepool service roles, then updating their config files to point to the locations. | 22:07 |
corvus | (and yeah, i'm thinking of having the nodepool and zookeeper config files point to /etc/zuul/certs/cert.pem) | 22:08 |
corvus | (cause why not) | 22:08 |
clarkb | fungi: ya it might be wishful thinking. I just want to balance "restart all the things" against "we probably need to make this transition at some point so better when all the things is relatively small" | 22:09 |
clarkb | we could do it service by service too fwiw | 22:09 |
clarkb | then only merge docker-compose install into the ensure-docker role once all existing services use new docker-compose | 22:10 |
mordred | corvus: that's the flock incantation that waits for the lock? | 22:10 |
clarkb | infra-root ^ would you prefer I split it up that way and we can iterate through it? | 22:10 |
corvus | mordred: yep, it's exclusive and waits by default | 22:10 |
mordred | corvus: cool - I think that approach looks good | 22:10 |
fungi | clarkb: what's the list of services we're currently deploying that way? | 22:11 |
fungi | gitea, gerrit, etherpad, one of the nodepool builders... | 22:11 |
clarkb | fungi: https://review.opendev.org/#/c/719589/ the list of services are roughyl represented by the playbooks/roles files there | 22:11 |
fungi | just trying to judge possible impact | 22:11 |
mordred | clarkb: honestly - I think I'd go with the bandaid myself - we already serialize gitea, so it shoudl be fine | 22:11 |
mordred | we don't do gerrit by default, so it should be fine | 22:11 |
mordred | so we're really just talking about etherpad and nb04 | 22:12 |
clarkb | etherpad, gerrit, gitea, haproxy, jitsi, nodepool-builder, docker registry, zuul-preview | 22:12 |
mordred | (as things where a restart might have a noticable impact we should worrry about) | 22:12 |
fungi | clarkb: yep, basically the set i was thinking of | 22:12 |
fungi | okay | 22:12 |
clarkb | mordred: thats a good point re gitea. We may haev to do a replication to everything after but thats relatively low effort | 22:13 |
fungi | and yeah, the current set seems small enough we can probably just juggle them all in one go | 22:13 |
mordred | yeah - mostly seems like the review/land burden of doing them one at a time might actually be more costly on the team | 22:13 |
mordred | but - definitely not today | 22:14 |
clarkb | ya I'll leave the WIP in place for now but if things are calmer tomorrow maybe we give ti a go then | 22:14 |
mordred | ++ | 22:15 |
clarkb | https://review.opendev.org/#/c/720030/ is a related chagne that is completely safe to alnd now if anyone wants to look at it (ensures we run jobs when updating dockerfiles) | 22:15 |
mordred | corvus: left one thought on there - it's not important, just a thing we might want to think of as a followup | 22:17 |
*** prometheanfire has quit IRC | 22:17 | |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: ensure-tox: use ensure-pip role https://review.opendev.org/717663 | 22:18 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Update Fedora to 31 https://review.opendev.org/717657 | 22:18 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701 | 22:18 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704 | 22:18 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111 | 22:18 |
corvus | mordred: cool yeah, i don't like the number in there either. we'll just want to make it work for both zuul and nodepool | 22:19 |
corvus | i'm well past eod now, so i'm going to, well, eod. | 22:19 |
ianw | i just noticed a revert for the suse change ... what's the plan? | 22:19 |
clarkb | ianw: basically get https://review.opendev.org/#/c/720254/ in once sf.io 3pci confirms it works against https://review.opendev.org/717663 | 22:20 |
clarkb | ianw: then we can land the zuul-jobs stack you've got (I think this was the only objections that came up) and then we can retry with new images for suse | 22:20 |
clarkb | ianw: as an alternative midway step dirk asserts that a dib release would make existing builds work | 22:21 |
corvus | when we retry, we should keep the gap between image builds and landing that stack small -- keystone broke which is why we rolled back | 22:21 |
fungi | it was specifically keystone's functional test job, yeah? | 22:22 |
fungi | something which expects virtualenv to be present but isn't a typical tox unit test/linter/whatever model | 22:22 |
openstackgerrit | Merged opendev/system-config master: Add port and user_dir to add_host in prod playbook https://review.opendev.org/720293 | 22:23 |
clarkb | ianw: fwiw now that the docker-compose thing is on semi hold I'm available to keep pushing on the suse things | 22:23 |
clarkb | at least for a few more hours | 22:23 |
fungi | pizza time is just about over and then i can get back to looking at etherpad/apache logs | 22:23 |
ianw | the only thing with rolling back is that new images won't work because pip-and-virtualenv is broken ... i've been trying to avoid making a dib release with a pip-and-virtualenv that only sort of works by accident | 22:24 |
fungi | so far the handful of spot checks i did showed each of the characteristic etherpad warnings was preceded by a request for that pad at the old domain name roughly a minute prior | 22:25 |
ianw | at the time the ensure-pip stack was fully reviewed, so i'd hoped we could push forward with it, that was my thinking, anyway. | 22:25 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run Zuul using Ansible and Containers https://review.opendev.org/717620 | 22:25 |
clarkb | fungi: fwiw mordred was wondering if we should rop the redirect and just contineu to serve at the old name too | 22:25 |
clarkb | fungi: maybe we put it in emergency and try that? | 22:25 |
mordred | yeah - maybe something something cookies something state something sad | 22:26 |
clarkb | ianw: not sure I understand your second to last message | 22:26 |
clarkb | pip and virtualenv is broken but would build proper suse images? | 22:26 |
clarkb | likely broken for a different platform I guess | 22:26 |
mordred | ianw: also - not related to suse or pip - we started working on getting zuul+nodepool+zk all up on the ansible so we can roll out zk auth. just as an fyi | 22:26 |
fungi | clarkb: maybe that would be okay... though could make getting people to use the new domain harder and prolong the problem if it's their existing cookies. still if it clears up the problem that's at least a data point | 22:26 |
corvus | mordred, fungi, clarkb: i'd like to keep the redirect... | 22:27 |
fungi | corvus: as would i | 22:27 |
corvus | maybe we can confirm that's the problem before doing that | 22:27 |
* mordred would also like to keep it | 22:27 | |
corvus | maybe by asking people to clear cookies, restart browser, and directly go to the new url... things like that | 22:27 |
ianw | clarkb: https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/pip-and-virtualenv/install.d/pip-and-virtualenv-source-install/04-install-pip#L63 | 22:27 |
mordred | knowing how to reproduce the issue at all would be super great | 22:27 |
clarkb | ianw: oh so we need package lists | 22:28 |
ianw | clarkb: like how tumbleweed is a python3 only platform, but _do_py3 is commented out, so it's using the python2 logic to install the python3 path, and making links with tools with "2" in them and stuff | 22:28 |
clarkb | hrm tumbleweed has python2 | 22:29 |
ianw | but not python2 packages i think? | 22:30 |
ianw | anyway ... i don't want anyone to invest a lot of time fixing things up, and i don't want to spend a lot of time reviewing it, when we want to get rid of it asap | 22:30 |
clarkb | thats fair | 22:31 |
clarkb | hrm git/gerrit/zuul don't like my ensure-pip change being set as a depenods on | 22:32 |
clarkb | maybe I have to rebase it in properly | 22:32 |
clarkb | working on that now | 22:32 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: ensure-pip: export ensure_pip_virtualenv_command https://review.opendev.org/718224 | 22:34 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: fetch-zuul-cloner: use ensure-pip https://review.opendev.org/717882 | 22:34 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: fetch-subunit-output test: use ensure-pip https://review.opendev.org/718225 | 22:34 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: ensure-tox: use ensure-pip role https://review.opendev.org/717663 | 22:34 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Update Fedora to 31 https://review.opendev.org/717657 | 22:34 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Make ubuntu-plain jobs voting https://review.opendev.org/719701 | 22:34 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Document output variables https://review.opendev.org/719704 | 22:34 |
openstackgerrit | Clark Boylan proposed zuul/zuul-jobs master: Python roles: misc doc updates https://review.opendev.org/720111 | 22:34 |
clarkb | there was a conflict between my change and https://review.opendev.org/718224 so ya needed to be rebased :/ | 22:34 |
*** prometheanfire has joined #opendev | 22:39 | |
clarkb | ianw: I guess one risk there is we're sort of equating pip to virtualenv there in the followon change? | 22:39 |
clarkb | ianw: do we also need to check for virtualenv and if it isn't present run the install anyway? | 22:39 |
ianw | clarkb: sorry which one is the followon change? | 22:40 |
clarkb | ianw: https://review.opendev.org/718224 that one | 22:41 |
clarkb | ianw: basically with that change we introduce the idea that if pip is present then so is virtualenv (because we're installing them together when installing pip) | 22:41 |
clarkb | I think in the default case everything will work fine, but tristanC's case might get a little odd if they aren't also installing virtualenv | 22:42 |
clarkb | (and maybe that is ok as power users they can deal with that) | 22:42 |
ianw | clarkb: umm, not really ... i've tried to deliberately make it not install virtualenv | 22:42 |
ianw | it seems like we have to on Xenial, because we found that venv doesn't work there with our mirrors | 22:43 |
clarkb | ianw: https://review.opendev.org/#/c/718224/11/roles/ensure-pip/tasks/RedHat.yaml but it is? | 22:43 |
clarkb | gotcha there might be a few exceptions but in general its relying on python -mvenv which should be there if python is there | 22:43 |
ianw | yeah, were it has to, such as the python2 install | 22:43 |
ianw | but i expect that to be hardly used | 22:43 |
clarkb | ianw: I'm mostly wondering if we need to check for python -m venv and/or virtualenv being valid in addituion to `pip` in https://review.opendev.org/#/c/720254/2 | 22:44 |
clarkb | (or in a followup) | 22:44 |
ianw | i don't think so? https://review.opendev.org/#/c/718224/11/roles/ensure-pip/tasks/main.yaml checks and prefers "-m venv" in all cases it can? | 22:45 |
ianw | that should then be tested by the https://review.opendev.org/#/c/718224/11/test-playbooks/ensure-pip.yaml on all our platforms, to ensure that the ensure_pip_virtualenv_command is something valid | 22:46 |
clarkb | ianw: ya but we are skipping the installs entirely if pip is already present | 22:46 |
clarkb | so if you had pip installed but not venv or virtualev (depending on platform) you would be in a weird spot | 22:46 |
clarkb | I think for now its probably fine | 22:47 |
clarkb | because its a corner case that only power user types like tristanC will run into | 22:47 |
clarkb | thinking about it more I think its ok to not worry about that too much. Basically what we're saying is if you know better then we'll get out of the way | 22:49 |
clarkb | and if that breaks you its on you | 22:49 |
ianw | i'm wondering if we should be doing this for the packaged pip case -> https://review.opendev.org/#/c/720254/2/roles/ensure-pip/tasks/main.yaml | 22:50 |
clarkb | ianw: fwiw this all started because ensure-pip broke sf.io 3pci | 22:51 |
clarkb | and its my undersatnding that happened because pip was already installed | 22:51 |
clarkb | and this wasn't reconciling that state for some reason | 22:51 |
ianw | well it is already installed on infra images too | 22:52 |
clarkb | but the ensure-* roles are intended to noop if the thing they ensure is already there | 22:52 |
clarkb | whcih is why corvus -1'd it | 22:52 |
clarkb | (and why people didn't want to roll forward this morning) | 22:52 |
clarkb | the deafult is to install from packages so skipping the checks when installing from packages doesn't help I don't think? | 22:53 |
clarkb | at least not with the current sf.io testing | 22:53 |
clarkb | I wonder if they are running jobs with a ro fs? | 22:53 |
ianw | just that the package: install should be idempotent (i.e. noop when already installed) anyway | 22:54 |
tristanC | clarkb: not sure what do you mean by power user, but i think that using the tox job with a python container that doesn't have sudo should not be a corner case | 22:54 |
clarkb | tristanC: the corner case is you've preprepped the image. This role is for prepping the image | 22:54 |
clarkb | tristanC: I think the correct way for you to use this would be to not use esnure-* anything if you are using prebuilt images without root | 22:55 |
clarkb | but I'm also happy to try and accomodate the preinstalled case beacuse I think it won't be uncommon | 22:55 |
clarkb | tristanC: the corner case here is that you are using a role that will install things if necessary but you don't let it do that | 22:55 |
ianw | tristanC: so if ensure-pip has a "package:" call with become: yes, that won't work for you, right? | 22:55 |
ianw | even though that is idempotent, as such -- keeping to the rules of ensure-* roles that they don't do anything if the stuff is already there | 22:56 |
tristanC | clarkb: we are not using that role, we just use the tox job provided by the zuul-jobs project. | 22:57 |
clarkb | tristanC: on the root point the whole system has sort of been designed to make using root as safe as possible. because unfortunately a lot of stuff does need root (not necessarily tox though) | 22:58 |
fungi | how exactly did it break for you then? | 22:58 |
clarkb | fungi: its because sudo rpm -q or whatever it does to check if the package is installed failed | 22:58 |
fungi | oh, right the *job* not the *role* | 22:58 |
clarkb | fungi: via ensure-tox consuming ensure-pip | 22:58 |
fungi | the tox job in zuul-jobs tries to install the things it will use, so if you're preinstalling those things the job might still try to sudo even if it'll be a no-op | 23:00 |
fungi | got it | 23:00 |
clarkb | fungi: yup | 23:00 |
ianw | so ... should we make the tox job not call ensure-tox? i thought we decided it wasn't yesterday? | 23:00 |
fungi | so yeah any become would need to be guarded behind whatever conditional ensures it's a no-op | 23:01 |
ianw | playbooks/tox/pre.yaml: - ensure-tox | 23:01 |
clarkb | ianw: well I think there is still value in the check if pip is there without package manager case because it could be installed without the package manager? | 23:01 |
fungi | or else the ensure roles should not be included | 23:01 |
tristanC | clarkb: fungi: iirc we already agreed that the tox job should be usable without sudo access | 23:01 |
fungi | tristanC: yep, makes sense | 23:01 |
clarkb | tristanC: yup I wrote the change to fix it :) | 23:02 |
ianw | tristanC: ++ on tox job not using sudo | 23:02 |
clarkb | but there is a weird side case where the way we pull in pip implies virtualenv (or venv) will be available | 23:02 |
ianw | heh, well we agree on something :) | 23:02 |
clarkb | and if you haven't built the image with virtualenv or venv it will be weird for you | 23:02 |
clarkb | but we can't fix that in any case because there is no sudo so its not worth worrying about I don't think | 23:02 |
ianw | i'm back to why ensure-tox is in the tox role pre.yaml playbook | 23:02 |
clarkb | ianw: how does the job work if it isn't ensuring tox is available? | 23:03 |
clarkb | (I don't think I followed that conversation from before) | 23:03 |
ianw | clarkb: i thought from yesterday, i'll have to go back, we were somewhat of the agreement it was up to you to run "ensure-tox" before running the tox job | 23:04 |
clarkb | ianw: I think the implication was that maybe tristanC should have a different tox job that didn't run any of the roles | 23:04 |
clarkb | *any of the ensure-* roles | 23:04 |
tristanC | perhaps we could drop the assumption that zuul-jobs are not meant to be usable by custom container, and then we should provides a zuul-container-jobs that provides light weight version of the job's play that doesn't use the ensure-* role | 23:05 |
clarkb | but I'm not sure | 23:05 |
tristanC | those jobs could even reference public container images that are known to work with the jobs | 23:05 |
clarkb | tristanC: I wouldn't even label them container jobs as the pattern could be useful in other systems too | 23:05 |
clarkb | fwiw I think my change will fix this particular problem | 23:06 |
clarkb | and we never merged the change that would break tristanC ? | 23:06 |
clarkb | so the system is working? | 23:06 |
ianw | http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-04-14.log.html#t2020-04-14T21:30:31 was the comment i was thinking of | 23:07 |
ianw | "that happens to make it so that tristanC can avoid running the ensure role too if he wants to define a new tox job." | 23:08 |
tristanC | yes, the system is working, and i'm happy to keep supporting sudo-less environment. I'm also happy to drop the support, as long as we have an agreement | 23:08 |
clarkb | https://review.opendev.org/#/c/717663/26 is green now | 23:09 |
clarkb | from sf.io I mean | 23:09 |
corvus | ianw: the context for that quote was that tristanC was concerned that we were doing extra work that wasn't necessary for him. | 23:09 |
corvus | that's different than our current understanding, which is that if we merged that change we would have broken a working system | 23:10 |
corvus | (so, to be clear, i support tristanC optionally creating a new job that is more efficient; but at this point i don't think we're saying that should be required in order for the basic thing to work) | 23:10 |
ianw | right, that's ok | 23:12 |
tristanC | corvus: clarkb: it seems like there is value in being able to associate job with prepared runtime known to be working for a specific task. So perhaps we could start designing an extra zuul-jobs project that provides job play using the role from zuul-jobs. | 23:12 |
tristanC | we could even agree on labels name and provides nodeset too | 23:12 |
corvus | tristanC: i'm not sure i'm ready to give up on having a tox job in zuul-jobs that works everywhere | 23:13 |
corvus | it seems like there's a path forward here, so maybe let's see how good we can make that before we fork | 23:13 |
ianw | now i'm starting to wonder if having the virtualenv bits in ensure-pip is a good idea | 23:21 |
openstackgerrit | Merged zuul/zuul-jobs master: Check if pip is preinstalled before installing it https://review.opendev.org/720254 | 23:26 |
ianw | looking at the keystone job | 23:35 |
ianw | https://zuul.opendev.org/t/openstack/build/152dd7622d8b404589d09d120986ed25/log/job-output.txt#1662 | 23:35 |
*** tosky has quit IRC | 23:39 | |
*** mlavalle has quit IRC | 23:47 | |
ianw | cmurphy: ^ i can not understand where this is coming from :/ | 23:55 |
ianw | https://opendev.org/openstack/devstack/src/branch/master/stackrc#L152 ... it should be using venv ... it must be a branch or something i haven't considered | 23:55 |
fungi | stable/stein, right? | 23:56 |
cmurphy | https://zuul.opendev.org/t/openstack/build/152dd7622d8b404589d09d120986ed25/ is on master not stein | 23:57 |
fungi | yeah, just double-checked | 23:57 |
fungi | https://zuul.opendev.org/t/openstack/build/152dd7622d8b404589d09d120986ed25/log/zuul-info/inventory.yaml#131 | 23:57 |
fungi | so codesearch is returning the relevant hits in that case | 23:58 |
fungi | only seems to appear in devstack | 23:58 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!