ianw | and i need to fix the base deployment issues | 00:00 |
---|---|---|
Ramereth | ianw: I've been trying to catch those instances and clean them up. Do we have some right now? | 00:01 |
ianw | i'm seeing 5 in ERROR status from our client | 00:01 |
ianw | fault | {'message': 'libvirtError', 'code': 500, 'created': '2021-06-16T23:56:47Z'} | 00:02 |
ianw | looks like they've all got libvirt error | 00:03 |
ianw | fungi: in openstackzuul-linaro-us i only see active servers? | 00:10 |
Ramereth | checking | 00:14 |
Ramereth | yup, more qemu-kvm processes in a defunct state :/ | 00:16 |
Ramereth | I wish I knew why that keeps happening. I don't see this on our ppc or x86 clusters | 00:16 |
Ramereth | i'll have to work on cleaning those up tomorrow as I'm about to head out for the day | 00:21 |
opendevreview | Merged opendev/system-config master: openafs-client: add service timeout override https://review.opendev.org/c/opendev/system-config/+/796578 | 00:22 |
ianw | Ramereth: no problems. i'll likely re-enable osuosl once ^ deploys | 00:22 |
fungi | ianw: nodepool list when run on nl03 at least shows 15 nodes in a deleting state for linaro-us for me | 00:22 |
ianw | ahhh, so ZK is out of sync with the cloud i guess | 00:23 |
ianw | i was looking via openstack client | 00:23 |
fungi | could be, yeah | 00:23 |
fungi | possible these are cluttering up zk and so nodepool thinks they're occupying quota? | 00:24 |
ianw | that could be true. i can delete the nodes by hand | 00:24 |
ianw | (the ZK nodes ... too much overloading of node :) | 00:24 |
fungi | we've got quite the backlog of queued arm64 builds and linaro-us now has only deleting and two ready nodes which are over a day old | 00:25 |
Ramereth | ianw: how long until that deploys? Maybe I should go ahead and do it now | 00:26 |
fungi | not sure why we'd have a ready ubuntu-focal-arm64 and ubuntu-bionic-arm64 node in linaro-us for more than a day | 00:26 |
fungi | would have expected those to get used sooner, but maybe all the arm64 testing is centos/fedora | 00:26 |
ianw | Ramereth: probably an hour, but then i also have to merge the change to re-enable. i think it can wait | 00:26 |
ianw | hrm, system-config at a minimum should slurp those up you'd think | 00:27 |
ianw | but most of the testing is tox-ish things as well, that i think would all be focal | 00:27 |
ianw | ok, i'm into the zk shell just correlating now | 00:31 |
ianw | fungi: hrm, i don't see lots of deleted nodes. i wonder if we're getting launch failures | 00:33 |
ianw | no i tell a lie, sorry, they don't have images attached so weren't matching arm64 grep | 00:34 |
ianw | ok, i've cleared the ZK nodes, nodepool list is no longer showing them | 00:39 |
ianw | 2021-06-17 00:39:51,268 DEBUG nodepool.PoolWorker.linaro-us-regionone-main: Active requests: [] | 00:40 |
ianw | it doesn't feel like it thinks it has anything to do | 00:40 |
fungi | mmm | 00:40 |
ianw | i might have to restart nl03, i feel like i've seen this before | 00:41 |
ianw | 2021-06-17 00:43:31,423 DEBUG nodepool.PoolWorker.linaro-us-regionone-main: Active requests: ['300-0014437271', '300-0014440634', '300-0014440637', '300-0014441811', '300-0014455060', '300-0014455555', '300-0014437862', '300-0014437863', '300-0014454786', '300-0014455047', '300-0014462863', '300-0014462864', '300-0014454788', '300-0014454789', '300-0014455322', '300-0014455325', '300-0014463009', '300-0014463010', '300-0014454915', '300-0014409073', | 00:43 |
ianw | '300-0014422609', '300-0014433950', '300-0014434049'] | 00:43 |
ianw | that's more like it | 00:43 |
fungi | ahh, excellent, thanks | 00:43 |
ianw | we now have 25 servers building in linaro | 00:44 |
fungi | yes, that sounds like what we should normally see | 00:45 |
ianw | i think the fact that nodepool didn't notice those VM's were gone *and* thought that there were no requests despite the long queue are related. i'm not sure how though | 00:50 |
Ramereth | ianw: alright, I cleared those out. Heading out for real now | 00:50 |
ianw | Ramereth: thank you! | 00:51 |
ianw | i've enabled the openafs service and am rebooting the osuosl mirror. the timeout has applied "A start job is running for OpenAFS client (1min 2s / 8min 8s)" | 01:07 |
ianw | 2min 54s / 8min 8s and it's up | 01:09 |
opendevreview | Merged openstack/project-config master: Revert "Disable the osuosl arm64 cloud" https://review.opendev.org/c/openstack/project-config/+/796585 | 01:19 |
ianw | i think the hosts missing the EFI mounts are limited | 01:42 |
ianw | mirror01.ord.rax.opendev.org and bridge.openstack.org. since we're not starting bionic nodes, i think i'll just fix this by hand | 01:43 |
melwitt | https://review.opendev.org/c/opendev/jeepyb/+/795912 converts blueprint integration in jeepyb to the gerrit API if anyone interested. here's the system-config change to re-enable blueprint integration for patchset-created https://review.opendev.org/c/opendev/system-config/+/795914 | 02:06 |
*** ykarel|away is now known as ykarel | 04:14 | |
*** zbr is now known as Guest2488 | 05:03 | |
opendevreview | chandan kumar proposed openstack/project-config master: Added publish-openstack-python-tarball job https://review.opendev.org/c/openstack/project-config/+/791745 | 05:28 |
opendevreview | Dr. Jens Harbott proposed opendev/git-review master: Fix nodeset selections for zuul jobs https://review.opendev.org/c/opendev/git-review/+/796754 | 05:29 |
*** marios is now known as marios|ruck | 05:44 | |
*** raukadah is now known as chandankumar | 05:46 | |
opendevreview | Merged opendev/system-config master: bridge: upgrade to Ansible 4.0.0 https://review.opendev.org/c/opendev/system-config/+/792866 | 06:20 |
opendevreview | Florian Haas proposed opendev/git-review master: Support the Git "core.hooksPath" option when dealing with hook scripts https://review.opendev.org/c/opendev/git-review/+/796727 | 06:52 |
*** rpittau|afk is now known as rpittau | 07:22 | |
*** jpena|off is now known as jpena | 07:31 | |
*** raukadah is now known as chandankumar | 08:28 | |
hjensas | Hi, the job at the top of the queue for TripleO, its been sitting there with that on tripleo-ci-centos-8-undercloud-upgrade-victoria "queued" for 12ish hours. See: https://zuul.opendev.org/t/openstack/status#tripleo | 08:35 |
hjensas | Almost all the jobs behind it in the queue has finished, but with the job at the top of the queue stuck the whole queue seems stuck? Any chance of poking so that the job in "queued" is started? | 08:36 |
*** ykarel is now known as ykarel|lunch | 08:38 | |
hjensas | #opendev anyone around who can take a look at TripleO's stuck gate queue? https://zuul.opendev.org/t/openstack/status#tripleo | 09:38 |
*** ykarel|lunch is now known as ykarel | 09:45 | |
frickler | hjensas: for a fast workaround, you could likely drop that patch from the queue. not sure we'd get to any further debugging before the event next week, but for that you'd likely have to wait a couple of hours for corvus to show up | 09:46 |
hjensas | frickler: yeah, we may just have to abandon that patch and restore it. Any idea when corvus usually shows up? | 10:02 |
hjensas | fyi, TripleO abandoned the blocking change, and restored it to get the queue unstuck. | 10:09 |
*** jpena is now known as jpena|lunch | 11:34 | |
*** amoralej is now known as amoralej|lunch | 12:00 | |
*** whayutin is now known as weshay|ruck | 12:10 | |
opendevreview | Ananya Banerjee proposed opendev/elastic-recheck master: Run elastic-recheck container https://review.opendev.org/c/opendev/elastic-recheck/+/729623 | 12:14 |
opendevreview | Ananya Banerjee proposed opendev/elastic-recheck master: Run elastic-recheck container https://review.opendev.org/c/opendev/elastic-recheck/+/729623 | 12:20 |
*** jpena|lunch is now known as jpena | 12:32 | |
fungi | hjensas: frickler: it does look more generally like there may be some stuck node requests across the board though... for example an openstack/ovsdbapp has been sitting in check for 137 hours waiting for nodes for some of its unit and static analysis jobs | 12:33 |
fungi | trying to track those down now and probably restart launchers to free any of the locks they're holding on those node requests | 12:34 |
hjensas | fungi: in case you didn't see, we abandoned the stuck change in the tripleo queue. So it is resolved there, but thanks for investigating! :) | 12:47 |
fungi | hjensas: yep, thanks, i've got plenty of other candidates to serve as examples this time | 12:48 |
fungi | here's the node request i'm hunting down: | 12:48 |
fungi | 2021-06-11 19:39:33,602 DEBUG zuul.Pipeline.openstack.check: [e: 08ab26431732453d87d673e2aaaf138e] Adding node request <NodeRequest 300-0014404756 <NodeSet ubuntu-focal [<Node None ('ubuntu-focal',):ubuntu-focal>]>> for job openstack-tox-lower-constraints to item <QueueItem af28c45f00f441238e6bbf3099a4a98c for <Change 0x7fe02f02eeb0 openstack/ovsdbapp 795892,1> in check> | 12:48 |
fungi | looks like it was taken by nl04: | 12:51 |
fungi | 2021-06-11 19:40:43,357 INFO nodepool.NodeLauncher: [e: 08ab26431732453d87d673e2aaaf138e] [node_request: 300-0014404756] [node: 0025081525] Node is ready | 12:52 |
fungi | but it never unlocked the request | 12:52 |
fungi | so same symptoms we've been finding in other cases | 12:52 |
fungi | i wonder if a thread dump will prove useful, i'll trigger one | 12:53 |
*** amoralej|lunch is now known as amoralej | 13:02 | |
fungi | okay, i've sent 12 (sigusr2) to the child launcher process twice now roughly a minute apart and the dumps are in /var/log/nodepool/launcher-debug.log at 2021-06-17 13:11:43,516-13:12:55,304 | 13:13 |
fungi | i'm going to restart the container now to release the locks on those node requests | 13:14 |
fungi | #status log Restarted the nodepool-launcher container on nl04.opendev.org to release stale node request locks | 13:15 |
opendevstatus | fungi: finished logging | 13:15 |
fungi | and now i'm seeing many of the stuck builds transition from queued to running | 13:17 |
fungi | corvus: probably not particularly urgent as it's not that debilitating and we've been seeing it for months off and on so no idea when it started, but do you have any good suggestions for how to try to track this down? | 13:34 |
fungi | most of the time it seems to happen amid flurries of node launch failures, though that perception could also be selection bias on my part | 13:35 |
fungi | sometimes it's a decline after three launch failures, sometimes it's a successful node launch, but the commonality is that it either doesn't get communicated back to the scheduler or the scheduler loses the event somehow | 13:37 |
fungi | not entirely sure which | 13:37 |
fungi | also the trimmed up thread dumps are now in nl04.opendev.org:~fungi/2021-06-17_threads.log | 13:38 |
corvus | fungi: which nl has that debug log ^? | 13:38 |
corvus | heh :) | 13:38 |
fungi | 04 | 13:38 |
fungi | if catching a launcher actively in this state would help, i can avoid restarting next time it comes up | 13:41 |
fungi | but generally when it happens it's blocking more than just a few changes | 13:41 |
*** marios|ruck is now known as marios|call | 14:02 | |
corvus | fungi: there was a zookeeper connection suspension between when the server finished booting and when nodepool should have marked the request fulfilled and unlocked the nodes. it was only a suspension, which means that it should have been able to pick up without losing anything. additionally, that node request and several others all disappeared from the list of current requests without a log entry; that's not supposed to be possible. | 14:18 |
corvus | i don't think keeping the launcher in that state would provide more info. i think that's enough clues to figure out what debug info we're missing | 14:19 |
fungi | thanks, and i don't recall seeing it prior to maybe february, but i suppose it could have been lurking there for as long as we've been handling node requests via zk... when we're regularly restarting the launchers it tends to just solve itself | 14:21 |
*** marios|call is now known as marios|ruck | 14:44 | |
*** ykarel is now known as ykarel|away | 15:10 | |
*** ysandeep is now known as ysandeep|out | 15:33 | |
clarkb | looks like the osuosl mirror got sorted out, that is great news | 15:34 |
clarkb | fungi: re gerrit yes I should page that back in then ask for review on data that I'm back up to speed with. I'll see if I can get to that today or tomorrow | 15:34 |
*** rpittau is now known as rpittau|afk | 16:09 | |
*** sshnaidm is now known as sshnaidm|afk | 16:16 | |
*** marios|ruck is now known as marios|out | 16:28 | |
clarkb | I'm looking into why the LE certs didn't refresh after 60 days as epxected for the names we got emails about | 16:34 |
clarkb | it appears that nb03.opendev.org failed for some reason in the last couple of LE playbook runs which meant the certs haven't updated anywhere. | 16:34 |
clarkb | looks like a full /opt on nb03 is causing that. | 16:34 |
*** amoralej is now known as amoralej|off | 16:35 | |
*** jpena is now known as jpena|off | 16:36 | |
clarkb | /opt/nodepool_dib (where we store the images that get uploaded) is only about 1/3 of the disk use. This implies we've leaked a bunch in the dib_tmp dir. I'll stop the daemon, clear out dib_tmp, then restart things | 16:38 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add note about afs01's mirror-update vos releases to docs https://review.opendev.org/c/opendev/system-config/+/796893 | 16:53 |
clarkb | infra-root ^ that is an update to the openafs docs based on what I discovered the hard way earlier this week :) | 16:54 |
frickler | clarkb: fungi: I tried to fix the nodeset selection for git-review jobs, which I think succeeded, but now there are some issues with the test instance of gerrit not starting, see https://review.opendev.org/c/opendev/git-review/+/796754 | 17:07 |
frickler | seems to be some kind of race condition or similar, as it hits only one out of 5 jobs | 17:08 |
clarkb | agreed and seems to have only hit 8 tests in that job | 17:10 |
clarkb | we don't seem to collect the gerrit startup logs in that case to see what went wrong though | 17:11 |
clarkb | re next step it is probably modifying the test suite to grab the gerrit error log file to see why it breaks | 17:12 |
fungi | i thought we did collect the gerrit logs | 17:13 |
clarkb | fungi: I don't see them on that job or in the job-output.txt | 17:14 |
rosmaita | when someone has a few minutes, i'm trying to make sure a job has python 3.6 available, but the change i made to .zuul.yaml on this patch isn't working, it's still reporting "InterpreterNotFound: python3.6": https://review.opendev.org/c/openstack/python-brick-cinderclient-ext/+/796835 | 17:31 |
rosmaita | i'm using "python_version" in the job vars, do i need something else? | 17:31 |
clarkb | rosmaita: you have to run the job on a platform that has the python version you want | 17:31 |
fungi | make sure it's ubuntu-bionic | 17:31 |
clarkb | I expect that job is running on focal which has not python3.6. Bionic is probably what you want | 17:32 |
clarkb | ya | 17:32 |
fungi | if you don't specify a nodeset, right now you get ubuntu-focal which is too new | 17:32 |
rosmaita | oh, ok | 17:32 |
rosmaita | and i guess python_version is still a good idea? | 17:32 |
fungi | our abstract py36 jobs set the nodeset so that child jobs will inherit that, but if you're creating a job which doesn't inherit then you need a nodeset which has the version of python you need or you need to specify an alternative means of installing it | 17:33 |
rosmaita | ok, thanks, will revise my patch | 17:34 |
fungi | setting python_version is fine for jobs which switch on that, though it's more useful for selecting a non-default interpreter on whatever node label you're running against | 17:34 |
fungi | e.g. ubuntu-focal defaults to python 3.8 but has 3.9 packages available, so you might want to use python_version to tell the job to use 3.9 on focal | 17:35 |
rosmaita | that's helpful, i understand this better now | 17:36 |
clarkb | corvus: reading the matrix spec and one thing that stands out to me is that the homeserver is responsbile for maintaining scrollback (until the end of time I guess?). While this makes sense I wonder what sort of storage needs are required? its basically the entire channel log with a pointer for each user's client indicating where they were last caught up? | 17:43 |
clarkb | corvus: any idea how that scales over time? do we need a potentially ever growing data store for this information (not that channel logs tend to be large but trying to understand what is involved there) | 17:44 |
fungi | clarkb: if it helps, the channel logs for most of the channels we'd be talking about, as recorded by the meetbot, presently occupy 17gb on disk and this includes the htmlified copies as well | 17:48 |
fungi | technically that's ever-growing but we've never really come close to running out of space for it | 17:49 |
clarkb | fungi: that is good info. Probably a decent estimate for matrix disk needs (though I expect matrix adds significantly more metadata, the order of magnitude is probably similar) | 17:49 |
corvus | clarkb: that's my understanding -- at least for the users on that homeserver that are in that room. of course, we will almost certainly have at least one "user" in each room on our homeserver in the form of a bot. | 17:51 |
fungi | though i agree, having someone with a homeserver which has channel history provide an estimate for how much room logging needs by comparison to our meetbot logging might be good | 17:51 |
fungi | like x% the size of a plan text log for the equivalent timeframe | 17:51 |
clarkb | another thing that comes to mind is how upgrades are done. Can those be done without taking chat down? | 17:52 |
clarkb | I suspect we'd be required to run multiple homeservers to accomplish that? | 17:52 |
corvus | clarkb: it's possible there are ways to reduce or truncate the data since our bots wouldn't need history. however, it seems likely that we might want the opendev.org homeserver to provide history as a service for new folks connecting. so i think we should expect to keep indefinite history, not as a technical requirement, but as a good service. | 17:52 |
clarkb | yup | 17:53 |
clarkb | scrollback is one of the key features people always talk about so keeping that around makes sense to me | 17:53 |
corvus | clarkb: synapse is a python program with a database; i'd expect minimal downtime between upgrades. remember that users won't see that downtime, except maybe in just a little bit of extra lag, as their homeserver queues updates. | 17:53 |
corvus | clarkb: i have stopped and restarted my homesever several times during conversations with you :) | 17:54 |
JayF | fungi: holy @#$%, I've been using `skipdist` in about every tox file I've created since the beginning of time. I was wondering why it always seemed so ineffective | 17:54 |
JayF | thank you for that email :D | 17:54 |
corvus | clarkb: i have no idea how accurate this is: https://matrix.org/docs/projects/other/hdd-space-calc-for-synapse | 17:54 |
clarkb | corvus: ya I guess the impact would be due to prolonged outages, short outages for updates should go unnoticed | 17:54 |
corvus | clarkb: but it is certainly a thing you can put numbers in and get other numbers out of :) | 17:54 |
fungi | JayF: yw | 17:55 |
corvus | clarkb: i'd have to check the actual Matrix Specification for this, but it may be the case that only bots and new users would notice homeserver downtime. | 17:56 |
corvus | (given that we're talking of only using our homeserver for bots) | 17:56 |
clarkb | new users bceause they wouldn't be able to join the channels in that moment? | 17:57 |
corvus | right | 17:57 |
corvus | clarkb: the high-level thing under "how does it work" at https://matrix.org/ leads me to believe that's the case | 17:59 |
corvus | oh we should steal that "next" animation idea for zuul | 17:59 |
corvus | essentially, our bots would get an eventually-consistent view of the room after the homeserver came back up | 18:01 |
clarkb | I've brought the subject up over at the foundation. Pointed out the oddity in potential pricing for element hosted homeserver for us and asked if there was any reason to not reach out to them now and start a conversation to understand what element can do better. I also mentioned corvus and mordred are willing to meet up and talk about it in more depth | 18:02 |
corvus | (also, fwiw, i think new users could technically join the room if any other homeserver involved chose to publish it at an alternate address) | 18:02 |
clarkb | I'm also trying to build a model in my head for what hosting a homeserver looks like if we do it ourselves. Seems like we should expect some disk and network bw. | 18:03 |
clarkb | And for upgrades maybe we can have docker compose simply update to the latest image constantly? | 18:03 |
mordred | ++ | 18:04 |
corvus | <clarkb "I'm also trying to build a model"> yep; minimal for #zuul, potentially significant for other communities | 18:04 |
corvus | <clarkb "And for upgrades maybe we can ha"> I haven't tried an upgrade yet; that sounds reasonable, but it's definitely a wildcard in my mind. | 18:04 |
*** dviroel is now known as dviroel|away | 18:05 | |
clarkb | having a plan for those as well as understanding what server upgrades look like is probably important before going down the run our own path | 18:05 |
corvus | yup | 18:05 |
clarkb | in particular for the host server I wonder if changing IPs will be a problem or if we can spin up a new one alongside and do a failover of sorts, etc | 18:05 |
clarkb | a sync + failover seems like the sort of thing matrix may just do out of the box given how it does federation | 18:06 |
fungi | yeah, i asked similar questions about server redundancy in the zuul spec | 18:06 |
*** slittle1 is now known as Guest2552 | 18:06 | |
fungi | seems like something we need a little more research around | 18:06 |
corvus | clarkb: also potential interesting info for the foundation is that we have contacts in ansible, fedora, gnome, and mozilla orgs all of whom are in various stages of the same process (most/all of whom are chosing to use EMS hosting) and they seem happy to help with the process | 18:07 |
corvus | changing ips will not be a problem | 18:07 |
clarkb | corvus: oh ya it would be interesting to get info about EMS experiences from them I bet | 18:08 |
clarkb | On the subject of history collection I wonder if we can simply expose those logs somehow as the canonical historical record without needing an account and authentication (just so we don't end up with a copy of them all on the homeserver and another copy where the bot lives) | 18:11 |
fungi | that might get tricky to filter if the homeserver has history from any private channels for some reason | 18:13 |
fungi | though if we can be sure the only history it has is safe to publish, then perhaps easier | 18:14 |
clarkb | fungi: good point. Point to point private comms are e2e encrypted so it would just be private channels that have this issue I think | 18:15 |
clarkb | but definitely something to check on if we explore that option further | 18:15 |
corvus | clarkb: i don't think they're natively stored in a way that's usable for that purpose (you know, directed graph structure and all). doing that with a web app is probably computationally expensive. my guess is for purposes of search engine indexing, etc, just having flat files is best. but maybe there's a way to export a room history to flat file to avoid needing the bot. | 18:15 |
clarkb | corvus: gotcha | 18:15 |
fungi | or a way to make a plugin which does that in more lightweight ways than a typical "bot" | 18:16 |
corvus | fungi: yeah, an app service may be able to do stuff like that, but i'm not very familiar | 18:17 |
clarkb | corvus: mordred: do you know if https://element.io/contact-sales is the best way to reach out to EMS? | 18:19 |
clarkb | or rather, do have a better contact? if not thehn ^ is probably easiest | 18:19 |
corvus | clarkb: i think so | 18:28 |
corvus | er, i think that's the best way to get started; i don't have a better contact :) | 18:29 |
opendevreview | melanie witt proposed opendev/jeepyb master: Convert update_blueprint to use the Gerrit REST API https://review.opendev.org/c/opendev/jeepyb/+/795912 | 18:29 |
*** dviroel|away is now known as dviroel | 18:53 | |
clarkb | melwitt: I left a couple of comments on https://review.opendev.org/c/opendev/jeepyb/+/795912 The first one would probably be a good followup and the other is more of a "is this even possible question" | 19:41 |
fungi | oh, thanks for the reminder, i was going to review that today | 19:44 |
clarkb | ok cleanup on nb03 has completed and I have restarted the builder there | 19:48 |
clarkb | hopefully the periodic LE job tonight runs successfully and we don't get warnings about expiring certs tomorrow | 19:48 |
fungi | unrelated, gitea01 has been reporting backup problems. i saw the e-mails for the past ~week but haven't found time to look into it yet | 19:49 |
clarkb | fungi: those backups are primarily there to keep db updates around and db updates are primarily important for project renames. If we don't do a project rename until that is fixed it may not be urgent. Of course understanding why it broke would be nice | 19:52 |
fungi | yeah, more or less what i was thinking, i just wanted to at least mention it so i know i'm not the only one aware there's a problem | 19:54 |
clarkb | ++ | 19:55 |
melwitt | clarkb: ok thanks, will look | 20:09 |
y2kenny | With OpenDev's Zuul, is it possible for organizations to add additional nodepool and/or executors? If so, what is the process? | 20:38 |
y2kenny | (or perhaps "attach" additional nodepool/executor is better wording...) | 20:40 |
fungi | y2kenny: we haven't worked out a process for adding provider-specific builders/launchers/excutors, so far we've only got central services connecting to publicly accessible cloud apis with public addressing (in some cases ipv6-only) to reach the nodes: https://docs.opendev.org/opendev/system-config/latest/contribute-cloud.html | 20:45 |
y2kenny | fungi: Is it something that's in the plan or is that something that will be tricky? | 20:46 |
y2kenny | fungi: This is still very much on the drawing board right now, but the kind of testing resources I am thinking of is not generally available in the cloud (baremetal HW with GPUs.) | 20:47 |
fungi | y2kenny: we've talked about adding a zoned executor for one prospective donor who has no ipv6 connectivity to their environments and extremely limited ipv4 capacity, but we haven't talked about dedicated builders or launchers at all | 20:48 |
fungi | and even environment with the dedicated executor would be a slight reduction in functionality since we'd need to attach floating ips to give developers remote access to held nodes | 20:49 |
y2kenny | fungi: ok... so is this something that you guys would be interested in starting a conversation on or is it too complex to do any time soon? | 20:50 |
fungi | y2kenny: i guess we'd need to talk about what the architecture would look like and what amount and sort of capacity we're talking about, to determine whether the engineering work needed to support it would be sufficiently offset, since there's just not that many people helping design and maintain our control plane these days | 21:16 |
y2kenny | fungi: ok, I see. | 21:17 |
y2kenny | fungi: what is the right place to continue the discussion? (service-discuss@lists.opendev.org ?) | 21:22 |
fungi | y2kenny: sure, here or there, though the mailing list is better for longer-term asynchronous discussion | 21:33 |
fungi | the handful of sysadmins who are active on opendev are scattered around the globe, so not all awake/around right now | 21:33 |
y2kenny | fungi: understood. | 21:39 |
opendevreview | melanie witt proposed opendev/jeepyb master: Convert update_blueprint to use the Gerrit REST API https://review.opendev.org/c/opendev/jeepyb/+/795912 | 21:43 |
ianw | clarkb: ahh, thanks for finding the letsencrypt job failure | 21:53 |
ianw | also, i just manually added fstab entries for EFI on our rax bionic hosts | 21:54 |
ianw | yesterday on base i was seeing almost random "-13" errors; the only thing i could find was an ancient devstack-gate launchpad bug from yourself that mentioned iptables randomly returning -13 (no permissions) | 21:54 |
ianw | looks like last night https://review.opendev.org/c/opendev/system-config/+/792866 failed to install updated ansible on bridge; looking | 21:56 |
ianw | ### ERROR ### | 21:57 |
ianw | Upgrading directly from ansible-2.9 or less to ansible-2.10 or greater with pip is | 21:57 |
ianw | known to cause problems. Please uninstall the old version found at: | 21:57 |
ianw | i guess this is not worth coding for. i'll manually uninstall and re-install ansible 4.0.0 once to get around this. we don't see this in the gate with fresh installs | 22:00 |
ianw | "Ansible will require Python 3.8 or newer on the controller starting with Ansible 2.12." ... we should think about bridge upgrade process too | 22:04 |
ianw | #status log manually performed uninstall/reinstall for bridge ansible upgrade from https://review.opendev.org/c/opendev/system-config/+/792866 | 22:05 |
opendevstatus | ianw: finished logging | 22:05 |
clarkb | ianw: that is an interesting move by ansible since they have historically kept compat with really ancient python. I guess they decided that isn't sustainable (good for them) | 22:46 |
clarkb | fungi: y2kenny: also note that there are specs up to zuul to make that sort of allocation easier on a per tenant basis | 22:47 |
opendevreview | Merged opendev/system-config master: review02 : switch reviewdb to mariadb_container type https://review.opendev.org/c/opendev/system-config/+/795192 | 22:57 |
ianw | clarkb: it is only on the controller side though | 23:03 |
clarkb | ianw: ah | 23:14 |
fungi | yeah, what version of python interpreter ansible can be installed with, not what version it can orchestrate | 23:18 |
opendevreview | Merged openstack/project-config master: Add gmann to IRC accessbot https://review.opendev.org/c/openstack/project-config/+/795986 | 23:31 |
opendevreview | Merged openstack/project-config master: Added publish-openstack-python-tarball job https://review.opendev.org/c/openstack/project-config/+/791745 | 23:31 |
corvus | clarkb: i just upgraded my homeserver to matrixdotorg/synapse:latest; it automatically applied 1 db schema update | 23:42 |
corvus | that was just a pull followed by docker-compose up -d | 23:43 |
clarkb | nice | 23:43 |
fungi | so if i want to run a homeserver should i work out the docker orchestration, or do you expect the matrix-synapse 1.36.0 package in debian would suffice? | 23:47 |
opendevreview | Ghanshyam proposed openstack/project-config master: End project gating for retiring arch-wg repo https://review.opendev.org/c/openstack/project-config/+/796962 | 23:51 |
corvus | fungi: i think that would be fine. it's basically a single python daemon, at least unless you want to start running app services like a slack bridge; so if the debian packages have worked out all the python deps already, i don't see a big advantage. it uses sqlite by default for ease of testing, but they recommend postgres for production use. | 23:53 |
corvus | i'm about to migrate from sqlite to postgres; that's going to take a couple of minutes. so i'll go silent for a bit and... hopefully... let you know how it goes :) | 23:53 |
corvus | fungi, clarkb apparently it worked fine :) | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!