Friday, 2024-08-30

clarkbinfra-root one of the backup servers is complaining about needing a pruning soon00:10
clarkbI smell dinner so will have to defer to someone else on that00:10
fungii can take a look at it in a bit, finally back to my hotel room for the night00:10
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: WIP: 4k Block device support  https://review.opendev.org/c/openstack/diskimage-builder/+/92755000:17
opendevreviewJeremy Stanley proposed opendev/zone-opendev.org master: Add a Rackspace Flex SJC3 mirror server  https://review.opendev.org/c/opendev/zone-opendev.org/+/92755100:23
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Refactor selinux-fixfiles-restore into its own element  https://review.opendev.org/c/openstack/diskimage-builder/+/92755200:31
opendevreviewJeremy Stanley proposed opendev/system-config master: Add a Rackspace Flex SJC3 mirror server  https://review.opendev.org/c/opendev/system-config/+/92755300:34
fungi#status log Pruned backups on backup02.ca-ymq-1.vexxhost bringing volume usage down from 90% to 75%01:20
opendevstatusfungi: finished logging01:21
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add DIB_SKIP_GRUB_PACKAGE_INSTALL for bootloader  https://review.opendev.org/c/openstack/diskimage-builder/+/92755902:05
opendevreviewSteve Baker proposed openstack/diskimage-builder master: Add DIB_SKIP_GRUB_PACKAGE_INSTALL for bootloader  https://review.opendev.org/c/openstack/diskimage-builder/+/92755902:45
opendevreviewSteve Baker proposed openstack/diskimage-builder master: WIP Optionally use guestfish for extract-image  https://review.opendev.org/c/openstack/diskimage-builder/+/92756102:45
SvenKieskeHi, does anybody happen to know if I can somehow query zuul CI, or if there are existing aggregated reports for CI failures? Specifically I'm interested in looking up failures to reach external package mirrors for the kolla project.10:38
SvenKieskeI know I can filter jobs for failure, click through each job for kolla, download logs manually and grep for something like mirror errors (4XX, 5XX codes), but I would rather like to avoid that if possible. :D10:39
ykarelSvenKieske, you can use opensearch https://docs.openstack.org/project-team-guide/testing.html#checking-status-of-other-job-results10:43
SvenKieskeah ty!10:54
SvenKieskemhm, I guess our build failure modes are just not fine grained enough for that to be helpful, but at least I guess I can get a quicker overview there and write some custom scripts to query the log files directly.11:00
SvenKieskeI was just wondering, because I know some people do analyze the reason why CI fails already (e.g. recheck reasons), if there is no prior art tooling around somewhere.11:01
fungireview comments are exposed by gerrit's rest api and so recheck messages can be queried there11:10
fungiat one time there was an elastic-recheck service we ran which used a curated set of queries against job logs in an attempt to identify failures due to already identified problems and report those in review comments, but that system was unmaintained and broken for many years and we finally turned it off because there was nobody with time to fix and maintain it11:12
fungisimilarly, we had bayesian (crm114) classification of log lines that tried to establish correlation between messages that were more likely in failing jobs and could therefore be related to novel failures11:13
fungiand we had a separate relational database of recent subunit exports from tempest jobs which could be used to identify the frequency of specific tests failing in order to determine test confidence and decide which ones should be skipped/fixed11:15
fungiall of these complex solutions, while useful to varying degrees, required 1. people who cared enough about job stability to use them regularly, and 2. people with sufficient time to maintain/troubleshoot/upgrade the implementations11:17
fungisince they would go weeks or months at a time with nobody even noticing they were down, it was clear we didn't even have #1, much less #211:18
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url  https://review.opendev.org/c/zuul/zuul-jobs/+/92758213:10
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url  https://review.opendev.org/c/zuul/zuul-jobs/+/92758213:46
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url  https://review.opendev.org/c/zuul/zuul-jobs/+/92758214:45
SvenKieskefungi: as always, thanks for the background information. reason why I'm asking is, I try to compile a list of external mirrors, sorted by frequency of failure to reach them. next step would be to bug you to mirror the worst offenders on openinfra premises instead :)14:56
SvenKieskeif you are curious, the list of externally mirrored stuff can be seen here: https://etherpad.opendev.org/p/KollaWhiteBoard#L246 or here: https://github.com/osism/issues/issues/1111 (this also has some more context)14:57
Ramereth[m]fungi: clarkb has there been any change this week in the performance? BTW, I should be getting the RAM today and I'm planning on getting that installed on the nodes later this afternoon. I will be disabling those nodes ahead of time so that no new zuul jobs get started while I do the upgrade. That might impact deployment if we don't have enough resources with the remaining nodes.15:03
SvenKieskemhm, I'm pretty sure I found a bug in gerrit.15:05
SvenKieskewould be nice if someone could reproduce before I report to upstream, it's just around comments, so easy to do.15:06
SvenKieske1. Create a new top level comment on any changeset via the "REPLY" button at the top. 2. write anything. 3. make sure to UNCHECK "Resolved" box 4. do _not_ click on the preview button 5. Click "SEND" 6. reload page 7. watch your new "unresolved comment be marked as resolved.15:07
SvenKieskethe preview thing seems to be important somehow, if I click on preview, it actually works as expected. tested in firefox on review.opendev.org15:08
fungiRamereth[m]: let me see if i can find the link we used to look at arm job timings previously for comparison15:09
SvenKieskemhm, weird, can't reproduce everytime..15:10
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url  https://review.opendev.org/c/zuul/zuul-jobs/+/92758215:10
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook  https://review.opendev.org/c/zuul/zuul-jobs/+/92760015:10
*** ykarel__ is now known as ykarel15:25
fungiRamereth[m]: maybe? i see some sub-hour completion times today for jobs that were taking closer to 2 hours previously: https://2f69333936e3feb7cea6-be6253c0e82f1539fed391a5717e06a0.ssl.cf5.rackcdn.com/t/openstack/builds?job_name=kolla-ansible-debian-aarch64&result=SUCCESS&skip=0&limit=10015:29
Ramereth[m]fungi: do you recall what times you were getting before?15:30
fungier, i meant https://zuul.opendev.org/t/openstack/builds?job_name=kolla-ansible-debian-aarch64&result=SUCCESS&skip=0&limit=100 for the url, but same view15:30
fungiRamereth[m]: skipping back to early july i see runs around an hour, though at that point it was a mix of osuosl and linaro resources running them so harder to say15:32
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook  https://review.opendev.org/c/zuul/zuul-jobs/+/92760015:32
fungiinfra-root: related, the build times graph corvus produced a screenshot from on monday could be added to production if someone approves https://review.opendev.org/89976715:33
Ramereth[m]fungi: okay so roughly similar to before. I replaced a drive that was failing with a newer drive that has better performance on the ceph cluster. Hopefully the RAM upgrade later today will help. Plus once I get the SSD pool added 15:34
fungithat change is now 10 months old, and very useful stuff15:34
fungiRamereth[m]: sounds awesome! thanks for the work and resources15:34
Ramereth[m]Sigh, looks like it won't be coming today. Just checked tracking "A railroad mechanical failure has delayed delivery." Won't be until early next week.15:49
Ramereth[m]Guess, that's what I get for using free shipping šŸ˜†15:49
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: WIP: 4k Block device support  https://review.opendev.org/c/openstack/diskimage-builder/+/92755015:54
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Debug upload-logs-s3 failure  https://review.opendev.org/c/zuul/zuul-jobs/+/92761315:58
fricklerah, now I know where corvus got the graphs from the other day ;)16:03
Clark[m]SvenKieske have you considered switching to upstream container images for tools that offer them? Then you aren't spending extra time installing those packages and building extra unnecessary images and you get the benefit of the container images cache16:14
Clark[m]MariaDB, grafana, and rabbitmq at least provide such images I think16:15
SvenKieskeClark[m]: that would be a rather enormous change for kolla, and also introduce limitations, e.g. we build arm arch images, some upstreams don't. and if you look into some upstream containers they are not really usuable in an openstack production environment16:16
Clark[m]Right but you don't need arch images you just need an official arm MariaDB image16:17
SvenKieskethere's actually a lot of ops knowledge and also security hardening taking place in kolla image builds. sometimes it's reversed of course and we are behind upstream advancements..you win some and lose some, I guess.16:17
Clark[m]We very successfully use the upstream MariaDB images. I can't speak to the others16:19
Clark[m]Essentially what you've done is exploded the matrix of things that need to be cared for into a very large problem space. Personally I would look at shrinking the problem space before trying to optimize for every little thing that can fail16:20
SvenKieskesure, we already shrunk the support matrix a lot, and in general I agree16:22
SvenKieskebut this all breaks down if you look at support cycles, which is already a massive challenge with SLURP, e.g. 2023.1 has rabbitmq 3.11 which is EOL upstream, no container image to be had there.. so we just now build a mechanism to upgrade rmq one minor version at a time for SLURP16:23
SvenKieskethe thing is, if you want to deploy a containerized cloud, imho you need to be able to rebuild all containers, this is imho not possible if you rely on upstream images.16:24
cardoeCache the image and use your cache?16:25
SvenKieskeeven if "only" to be able to rollback to the last known good version and maybe backport a customized patch. we only support/encourage users building their own containers anyway16:25
Clark[m]They publish the docker files for them but ya that doesn't guarantee the things the docker file relies on will be available forever. But neither does doing it the way Kolla dors16:25
Clark[m]But even then why not have a single MariaDB Kolla image?16:26
SvenKieskecardoe: caches have the bad habit of expiring, if you can't rebuild your docker container bit-by-bit compatible you're doing containers wrong - and yes I know that means > 90% of people do it wrong.16:26
cardoeSo I'll also assume that kolla and loci don't share anything?16:27
SvenKieskeyou either need to store your artifacts somewhere locally (not in a cache, use a proper registry), or you need to be able to rebuild from source.16:27
SvenKieskewhat is loci?16:27
cardoehttps://opendev.org/openstack/loci16:27
cardoeused by https://opendev.org/openstack/openstack-helm-images16:28
SvenKieskecardoe: no, never used that, didn't even know it exists.16:28
cardoeTL;DR another OpenStack project that builds its own rabbitmq, mariadb, etc images.16:28
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: WIP: 4k Block device support  https://review.opendev.org/c/openstack/diskimage-builder/+/92755016:29
cardoeBut you look at the operators using OpenStack Helm, they're all using the mariadb-operator, the rabbitmq-operator16:29
SvenKieskecardoe: rabbitmq is only mentioned in bindeps.txt there, I guess they do the installation of this external stuff via helm then?16:30
cardoehttps://opendev.org/openstack/openstack-helm-infra16:31
Clark[m]Anyway I'm just hoping we can also look at reducing the need for a bunch of one off mirrors of content that will inevitably go stale and then I'll be stuck dealing with cleanup for in a few years.16:31
cardoeAgreed with Clark[m] 16:31
Clark[m]As a side note we don't archive things we mirror them so when the mirror upstream cleans up we clean up too16:31
SvenKieskeso they are - in fact - not rebuilding their own rabbitmq, but using the k8s operator pattern for that, which is fine I guess. this is still totally orthogonal to kolla.16:31
cardoeNo. OSH builds its own16:32
Clark[m]For example when centos 8 stream deleted its mirror content we synced that update16:32
cardoeThe downstream operators like Vexxhost and Rackspace use the k8s operator.16:32
cardoeI also just confirmed that Rackspace's kolla usage doesn't use kolla's rabbitmq or mariadb containers.16:36
SvenKiesketoo bad, I guess ĀÆ\_(惄)_/ĀÆ16:37
cardoeEssentially I'm going back to Clark[m]'s comment about shrinking the problem space.16:37
cardoeCan you upstream your improvements to upstream's containers?16:38
SvenKieskewell, I currently have a tiny subtask to just improve on our CI by making sure packages can be pulled. You want me to just "hey, why don't you throw out your complete architecture and do everything totally different instead?" at least that's what it sounds like to me..16:39
SvenKieskeso that doesn't sound really reasonable to me, sorry. maybe you need to explain it in a different way, maybe you know kolla better and can show how you can just use a different base image without killing dozens of use cases and without introducing cycle-long migration paths?16:40
SvenKieskeI mean I didn't come up with the original idea of kolla and I personally am not emotionally attached to any technology, so if there's a better backwards compatible way to do things, I'm all ears. I just don't see how this would work.16:41
fungimy main concern is less testability, and more that openstack (via deployment projects) is publishing its own images of rabbitmq, mariadb, and so on. the tc has a very clear statement about such projects needing to make sure users are clearly warned not to use the upstream built images in production, but it seems like that message isn't being heard and a lot of users are doing that anyway16:42
fungibut to be fair, i and others pushed back against kolla's design choices from the beginning, it's more that the people who made those choices were determined to do things their way and had priorities other than e.g. security and vulnerability handling16:43
SvenKieskefungi: well on that point: I don't really get that tbh. If you want to have an open build process (which I like), people will always be able to pull some artifacts. and we will always rely on third party OSS components. and I don't see a difference if people are pulling a rebuild from here e.g. rabbitmq or rabbitmq upstream. what's the point there?16:44
fungithe tc members at the time (myself included) couldn't do much more than ask them to take those concerns into consideration16:44
SvenKieskeguess I need to reread that tc decision again, but I just read it some weeks ago..16:45
fungiSvenKieske: do kolla's rabbitmq builds track rabbitmq security advisories, get patched ina timely fashion and warn users to upgrade it?16:45
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Fix the upload-logs-s3 test playbook  https://review.opendev.org/c/zuul/zuul-jobs/+/92760016:46
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Debug upload-logs-s3 failure  https://review.opendev.org/c/zuul/zuul-jobs/+/92761316:46
cardoeIā€™m not asking you to throw anything away. Iā€™m just saying you only have so many hours in a day. If you spread yourself across everything itā€™s hard. Itā€™s the same argument that OpenStack has with ā€œwhy arenā€™t downstreams contributingā€. Cause they disagree with something and fork/build their own.16:46
fungiit was those sorts of yolo approaches made by deployment projects which prompted the vmt to make it very clear we don't have the bandwidth to also track and warn users about vulnerabilities in software that isn't produced by openstack but merely being repackaged and embedded16:47
cardoeItā€™s something Iā€™m pushing within my org. Contribute back and use upstream unless absolutely impossible.16:47
SvenKieskecardoe: that's a universal truth, so hard to argue about. you have maxed your available time if you managed to delegate all the work ;)16:47
cardoeThis is one of the reasons why Iā€™m running for the TC. I would like to see OpenStack be the best it can be at OpenStack. Not mediocre at OpenStack and mediocre at upstream jobs.16:49
fungibasically, kolla is the reason for requirement #2 at https://security.openstack.org/repos-overseen.html#requirements16:50
cardoeSo to answer fungiā€™s question about security updates of rabbitmq packaging. Our internal user says that kolla position was that downstream had to package their own to keep up with security. So they opted to just use upstream containers where possible.16:51
cardoeIf thatā€™s still true today he does not know.16:52
SvenKieskenow I'm curious, does anyone track vulns in e.g. nova requirements? when I look at the tarball, the language at least doesn't inspire confidence: Requirements lower bounds listed here are our best effort to keep them up to  date but we do not test them so no guarantee of having them all correct.16:52
fungiwhich isn't just kolla's position, but openstack as a whole where such "binary artifacts" are concerned: https://governance.openstack.org/tc/resolutions/20170530-binary-artifacts.html#guidelines16:53
fungiSvenKieske: nova doesn't ship copies of its requirements16:53
SvenKieskethey ship instructions to install software which is usually honored during installation from the tarball. I see little difference from a security pov between "I ship vuln $foo" or "I tell you to install vuln $bar".16:54
SvenKieskeif your requirements are listing vulnerable packages I would consider that a security issue.16:54
SvenKieskeand it seems I'm not alone in that, there's a dedicated CWE just for that: https://cwe.mitre.org/data/definitions/1395.html16:55
fungiSvenKieske: i think projects could do a better job of clarifying that versions pinned in constraints/requirements lists are there for testing purposes and are intentionally not increased in order to track upstream vulnerabilities in those requirements. this comes up frequently and i continue to think adding a disclaimer in those files to that effect is a good idea16:56
SvenKieskefungi: from an abstract standpoint we just do the same then. you can s/requirements.txt/Dockerfile/ . both are instructions to install a package/software. both can be outdated.16:57
fungistable branch constraints are pinned hard, and people do mistakenly think that means they should install those versions of dependencies but the idea is that they shouldn't be pip installing packages in production, they should use curated downstream distributions who backport security fixes to contemporary versions of those dependencies instead16:57
SvenKieskeok we happen to have a public image cache somewhere in a registry for practical purposes, that's s/dockerhub|quay.io/pypi.org/ then16:57
fungiyes, "We have also published developer-focused Python package formats such as sdists and wheels to pypi.python.org because those formats are useful in our own CI systems, and because it seemed clear, if only implicitly, that consumers of the packages would understand they are not supported for production use."16:58
cardoedocker.io/openstackhelm is pushed by Zuul from loci and osh images.16:58
fungi(from the tc resolution i linked above)16:59
SvenKieskethat seems like very detached from reality, sorry. "it seems clear that these consumers will understand that the packages on pypi.org are unsafe, but it's also clear that the very same consumers will deem docker images somehow safe"..what?17:01
fungii think the underlying problem at this point is the assumption that it's implicitly clear17:01
SvenKieskeyeah17:01
SvenKieskeneither the dockerhub stuff nor the pypi.org stuff have any disclaimer on them, e.g. https://hub.docker.com/r/openstackhelm/mariadb has over 100k pulls, surely all for testing17:02
fungiideally the tc will ask projects to make it explicitly clear instead, by documenting that in places users will see it17:02
SvenKieskeI mean the first thing I do on ML and IRC if users ask something is making sure they should build their own images. but frequently they don't.17:02
Clark[m]For a long time we didn't publish the applications to PyPI fwiw 17:02
fungiyeah, up until about 2014/2015 we only published libraries on pypi17:03
fungimainly because we were worried users would try to pip install nova17:03
SvenKieskeI think that might be a saner approach: no publishing, maybe only to some internal CI mirror. problem still is devs want to pull this stuff too, for testing etc.17:03
SvenKieskefor most dev setups pip install $foo is fine, for prod, well yeah. The thing is, I actually agree with your stance. But I think it's inconsistently enforced/viewed when looked at different openstack projects17:05
fungii wholeheartedly agree17:05
SvenKieskepart of the problem will be that this stuff changes over time, and implications and what users do changes over time and suddenly old beliefs are suddenly no longer true :/17:06
SvenKieskeanyway I should long be gone, I need to do my taxes.. :D17:07
SvenKieskethanks for the discussion!17:07
fungii end up in similar arguments with openstack-ansible folks who think it's wise to have a deployment project pip install packages from a pinned stable constraints list17:07
fungihave a good evening!17:07
Clark[m]And for the record I'm not demanding anything be done I just want people to consider potential alternatives or additional steps that might simplify things and improve reliability in other ways17:07
SvenKieskewell I know why I don't work full time in IT sec anymore. you always have the paradox of: "install only a defined set of versions" vs "update everything asap to get critical updates" then stuff breaks (either security, or features)17:08
Clark[m]Often times when there is a clear failure of package install failed due to mirror issues it's easy to immediately jump to "make the mirror reliable"17:08
Clark[m]And sometimes we need to look at the bigger picture and determine if there are other options 17:08
SvenKieskeClark: I already tried the "make the mirror reliable" part :D one result was a discontinued pypi mirror last year, because it was too unreliable, not exactly the desired outcome17:09
fungimy unhelpful answer to people who say "but that would require a ground-up redesign of our project!" is "yes, it would, since the current design is problematic"17:09
SvenKieske:D I always like that fungi. that's also often my kind of reply when I hear "but that would be a lot of work" and I think "well yes, but that's no reason in itself to not do it?" :D17:10
SvenKieskebut I won't redesign kolla, at least not this weekend ;)17:10
fungiSvenKieske: it's not a paradox, it's what linux distros and other people who manage operating systems are there for. to say "meh, we don't need them" is to ignore the critical role their work plays, and then you get to rediscover it yourself as you become responsible for solving the exact same challenges17:10
SvenKieskefungi: I know, if you run a datacenter/webhost/whatever you soon discover that you either support what the distro you are using supports, or you end up being your own distro with your own security/upgrade process. did that with gentoo, debian, ubuntu, centos..17:12
SvenKieskenobody wants to hear something like that ;)17:12
fungii think the container image ecosystem is slowly coming around to realize that if you want to repackage everything yourself, you have to solve the same problems as the existing curated distros you think you're getting rid of17:12
cardoeI do think it would be great if you could pip install nova. If we published containers they had good SBOMs which could be audited. It would reduce the barrier to entry.17:12
SvenKieskeproblem with most upstream containers is, that they are quite often also just a thin docker wrapper around horrible bash scripts :D (this might have gotten better in the last few years)17:13
funginah, they're still pretty much that, yep17:13
fungidockerfile calling a shell script to ./configure;make;make install17:14
SvenKieskelol, okay, hey I've _seen_ some who are actually a bit better :) but yeah, you get the point.17:16
cardoeI mean all software everywhere sucks. The whole ship fast break stuff mentality has reduced the overall quality.17:17
cardoeI just hope enough of our software engineering mistakes are recorded so that the next group after our next dark ages does better. But they probably wonā€™t.17:18
SvenKieskenot sure if the quality is reduced, but I tend to think we still live in the "middle/dark age of information technology" like 1050 a.d. or something. future generations will hopefully laugh at our weird it problems17:19
fungiwell, also hubris/overconfidence leading new projects to assume that the ways existing projects do things are unnecessary and so go on to make all the same mistakes rather than learning from those who came before17:19
SvenKieskethat's just humans, you have to be overconfident to try something new. 1 out of 100 times it works ;)17:19
SvenKieskemade up numbers17:20
SvenKieskenow I really gotta leave and got to do those lovely taxes :D17:20
fungigood luck!17:20
SvenKieskehave a nice, calm weekend, thanks!17:20
fungiyou too!17:21
fungibut yeah, at the start of openstack, people from other older open source projects were brought in to help establish the culture, community norms, and approaches to problems they knew we were going to run into. that's in part why we do a lot of things that seem old and outmoded to newcomers, but it allowed us a means of escaping a lot of the trap of assuming we could just figure all this out17:22
fungilater17:22
cardoeUgh. You wanna do my taxes too? I gotta finish mine.17:22
opendevreviewTristan Cacqueray proposed zuul/zuul-jobs master: Update the set-zuul-log-path-fact scheme to prevent huge url  https://review.opendev.org/c/zuul/zuul-jobs/+/92758217:33
frickleris it tax day everywhere? says /me reading backlog while taking a break from doing taxes :D18:27
funginot in the usa, at least not general citizen taxes18:28
fungiwe do our personal income taxes in april18:29
fungi(unless you've requested and been granted a filing extension anyway)18:29
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: 4k Block device support  https://review.opendev.org/c/openstack/diskimage-builder/+/92755020:31
mordredfungi: are you calling me old now?22:39
fungimordred: we're all old23:42
* fungi gets back to shaking his cane at the children who keep appearing on the lawn23:43
fungioldmanyellsatcloud.gif23:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!