Friday, 2021-11-19

ianwfungi: i'm probably just sensitive because i've been using a self-compiled git lately :)00:08
fungithat would certainly make me sensitive00:14
fungii agree about just matching on the first two or three version components thpugh00:14
Clark[m]Ya I can update the change tomorrow to take the first 3 tuple and emit a warning if we don't determine a version01:02
*** rlandy|ruck is now known as rlandy|out01:52
opendevreviewMerged opendev/system-config master: infra-prod: remove master override steps
*** pojadhav|afk is now known as pojadhav04:20
*** ysandeep|out is now known as ysandeep05:14
*** akahat|rover is now known as akahat|lunch08:36
*** jpena|off is now known as jpena08:37
*** akahat|lunch is now known as akahat|rover09:59
*** ysandeep is now known as ysandeep|afk11:04
*** rlandy|out is now known as rlandy|ruck11:11
*** ysandeep|afk is now known as ysandeep12:22
dtantsurfungi: to follow-up on the yesterday's cross-project tests discussion: I've come up with and it seems to work as I wanted, even when run on the library.14:08
fungidtantsur: awesome!14:12
fungiseems fairlt straightforward too14:12
fungier, fairly14:12
dtantsuryeah, zuul for the win :)14:15
fungidtantsur: fwiw, i see other projects omit ansible_user_dir for that:
dtantsurI see. Yeah, I copy-pasted this bit from somewhere in zuul-jobs14:17
mgariepyClark[m], you can release the vm : openstack-ansible-deploy-infra_lxc-ubuntu-focal for root@
fungimgariepy: Clark[m]: i've deleted that autohold. thanks!14:59
mgariepythanks a lot fungi 15:00
fungimy pleasure15:00
*** rlandy|ruck is now known as rlandy|ruck|biab15:08
*** ysandeep is now known as ysandeep|out15:13
clarkbfungi: I think we should be good to try next for gerritbot user updates16:19
fungioh, yep, approved. looks virtually identical, i forgot that was separate16:20
*** rlandy|ruck|biab is now known as rlandy|ruck16:22
opendevreviewClark Boylan proposed opendev/git-review master: Fix use of removed --preserve-merges option
clarkbfungi: ianw ^ I think that addresses the latest comments on that change16:26
opendevreviewClark Boylan proposed opendev/lodgeit master: Update docker image to bullseye and python 3.8
clarkbfungi: ^ re bullseye we basically need a bunch of changes like that. For the most part the updates haven't been too bad. Only zuul executors ran into socat behvior changes and nodepool builders had problems with container stuff? Python apps that don't interact with the system much should be easy (like lodgeit)16:39
fungiyeah, on the whole i expect there would be no real functional difference16:39
fungioccasionally we'll hit things around changes in command-line options16:40
opendevreviewMerged openstack/project-config master: Retire puppet-senlin - Step 3: Remove Project
outbritoG'day folks! Do you happen to know why zuul is not merging and I'm seeing the "submit" button on this change?
outbritoI see it disabled though16:40
fungiclarkb: why did we not need to explicitly declare the older python-builder image though?16:41
clarkbfungi: I think that waas a bug16:41
clarkboutbrito: you should never get a working submit button in gerrit. I think it may be showing you the button because it is submittable but you don't have permissions to do so (only zuul should have permissions for that)16:42
clarkbwhich means we need to figure out why zuul isn't doing that or isn't able to16:42
fungioutbrito: it looks like that change is based on an outdated parent16:43
fungiits git parent is 816259,2 but someone revised 816259 without rebasing 817140 so now it can't merge16:44
fungiif you rebase 817140 onto the master branch at this point it should work16:44
fungi816259,3 is what ended up merging16:44
clarkbI guess gerrit 3.3 stopped showing you an orange warning for that16:44
fungii think that's what it's trying to signal by putting the (Merged) next to the parent is in red16:45
funginormally it would be grey/black16:45
clarkbah yup shows the dark grey color for its merged parent16:46
funginot a good ui choice for accessibility16:46
fungieven something common like red/green color-blindness would make that virtually impossible to notice16:46
mgariepyclarkb, fungi would it be possible to have an auto-hold on vms that timeout for one role we have ?16:46
clarkbmgariepy: we can only filter by project change or job. Not role if that is what you are asking16:47
*** marios is now known as marios|out16:47
mgariepyso you cloud filter on timeout on a specific patch ?16:48
fungiif "role" here means a particular git repository, that's doable16:48
clarkbbut it has to be the project that triggered the job not the timeout if that makes sense16:49
johnsomHmm, zuul status is giving me "Something went wrong", is there a restart going on?16:49
opendevreviewMerged opendev/system-config master: Run matrix-gerritbot with gerritbot user
clarkbI too get the something went wrong16:49
clarkband now it is back16:49
mgariepyok i'll think about it and see what we should do.16:50
clarkbjohnsom: its probably a bug in zuul-web dealing with updating configs/layouts16:50
clarkbjohnsom: the service itself seems to be fine though.16:50
johnsomOr a mis-configured health monitor on the LB pool?16:50
clarkbjohnsom: it isn't an LB pool16:50
johnsomWell, there is your problem. GRIN16:50
clarkbeverything is active active active active and needs to deal with locks and such properly16:51
clarkband currently your webbrowser talks to a single web frontend16:51
clarkbBasically I think it is a bug but only in rendering the info to the end user. The actual zuul processing in the background seems to be happy. And if you wait 30 seconds it resolves itself16:52
johnsomYeah, it looks like the job I was looking for started even though I couldn't see it16:52
fungithe zuul-web logs have a bunch of deserialization exceptions16:53
fungijson.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)16:54
clarkbya so it probably read data before it was properly written. I'm guess we need to add a lock to something to avoid that16:54
fungialso kazoo.exceptions.NoNodeError16:54
clarkb*I guess16:54
fungior just avoid invalidating the cache until it gets a good read16:55
fungii also see some KeyError: 'change_queues'16:56
fungiall of these are potentially the same underlying cause though16:56
fungithe KeyError: 'change_queues' usually follows a kazoo.exceptions.NoNodeError though sometimes i see kazoo.exceptions.NoNodeError without the subsequent KeyError16:57
clarkbI think change_queues is a db record that is kept separate for performance reasons. It wouldn't surprise me if we aren't handling its specialness properly in zuul web16:58
outbritofungi, will try, tks17:08
opendevreviewClark Boylan proposed opendev/system-config master: Switch lodgeit to run under a dedicated user
clarkbfungi: ^ that just reported in matrix testing channel. Lets confirm the matrix gerritbot restarted17:12
clarkbit hasn't updated its docker compose yet or restarted17:14
clarkbI guess it did gerritbot first and I need to be patient :)17:14
clarkboh I see, the deploy job that was running was for a project-config update and since we dno't update to master in deploy we weren't running with latest there17:14
clarkbwe restarted gerritbot for a new project add? But the currently running job is for the matrix-gerritbot update so we should see that update shortly17:15
fungioh, yeah that explains it17:15
clarkbok matrix updte failed beacuse we actually rewrite the config with an ansible task and that couldn't overwrite the existing file bceause it is 644 by root17:21
clarkbI'm going to manually chown the file then the next hourly run should sort us out17:21
clarkbthats done. I need to find breakfast but will check in on this again later (and the hourly runs should update it I think)17:23
fungioh, and the ansible task runs after?17:26
clarkbya this is a pre step to make a config for the bot17:29
clarkbsince the bot takes a dhall config but we want ot maintain yaml configs for humans17:29
*** jpena is now known as jpena|off17:41
opendevreviewGhanshyam proposed opendev/irc-meetings master: Remove Technical committee office hours
clarkbWarning: Could not get or create the default cache directory: <- matrix-gerritbot is unhappy17:51
clarkbunfortunately that string doesn't seem to show up in the matrix-gerritbot source so I'm not sure what the default cache directory is17:55
clarkbtristanC: ^ do you know what the cache directory is?17:57
clarkbmy hunch is that HOME=/root here is the problem18:03
clarkband its trying to write to $HOME/.config or some such18:03
clarkbI'll push up a partial revert for now18:03
fungicould we override $HOME when starting the container?18:04
clarkbfungi: yes, we can, but I don't know to what value. I think we should consider only consuming docker images that are built with standard tools. The nix stuff is hard to process18:05
clarkb(and build our own matrix-gerritbot image if that is necessary)18:05
fungiahh, yeah18:05
clarkbfungi: we could try /tmp maybe18:05
fungiusing docker to build an image ourselves is probably the most straightforward18:05
clarkbfungi: do you want me to try overriding to /tmp by hand before we try the partial revert?18:06
fungii suppose it can't hurt18:06
clarkbsame error18:07
fungihuh, in further exercising the new pep-517 pbr version, i see newer interpreters are complaining that pbr doesn't explicitly close manifests it reads18:07
clarkbI've manually done the partial revert (just commented out the user directive in docker-compose.yaml) and it seems happier. I'll push the revert up which should confirm it is happier too18:09
opendevreviewClark Boylan proposed opendev/system-config master: Partial revert of matrix-gerritbot user change
clarkbthat showed up in the testing channel in matrix I think the revert is sufficient18:12
clarkber I mean the partial revert18:12
clarkbfungi: I'll happily review/write PBR updates to use with open(foo) as bar: context managers or similar to fix that18:13
clarkbjust point me at the location18:13
fungiyeah, once i have a handle on where it's complaining about, i'll push some up18:13
clarkbtristanC: to summarize matrix-gerritbot can't get or create the default cache location when we override the user. It doesn't tell us what the location is. If you can help us understand that better we'd appreciate it18:13
fungii think it's just find_sources and get_version from pbr.packaging where i'm running into it, but i'll try to make sure whatever i put together is comprehensive since i'm sure i'm not exercising every last code path in pbr here18:15
clarkbya I guess we can grep for open() and then update all occurenaces easily enough18:15
fungiconfirmed adding {toxinidir} to deps on all testenvs where i was previously relying on usedevelop is still working too18:22
fungii always worry that just . won't work as expected if tox is called from somewhere which isn't the root of the repo18:23
clarkboh good point18:24
clarkbI'll update my example bindep change to do that18:24
fungii added a comment on the bindep change just so we don't forget18:24
opendevreviewClark Boylan proposed opendev/bindep master: Try out PBR pep 517 support
clarkbthat should do it18:26
fungiwell, good news, the vast majority of open() calls in pbr are already using context managers, and most of the remainder are in tests. i only see one other obvious case besides the two i was hitting in packaging.py18:26
fungioh, in fact i misread, the other one i thought i saw was a subprocess.Popen() it just matched my naive grep for open(18:29
fungione of the two i hit is easy enough to fix, the other will be tricky since it's an open() inside a try/except18:32
clarkbfungi: you should be able to use with there or a finally?18:32
clarkb(then guard againstalready closed fd)18:32
fungiyeah, i can also explicitly close it in another try i guess18:32
fungimaybe i can just combine these two try/except blocks?
fungithen i can do a with inside the try18:35
fungiand catch (IOError, OSError, email.errors.MessageError)18:35
clarkbfungi: I would try: with open(filename, 'r') as pkg_metadata: and then catch whatever needs catching from that18:37
fungisomething like
clarkbits always a continue18:37
fungiyeah, i suppose i could nest the try blocks as an alternative18:38
clarkbno you did what I was thinking about18:38
fungiyeah, this seems more concise18:38
fungishould i fix up the tests to not leave open descriptors too, or will anyone likely care?18:39
clarkbI've figured out the rough area of matrix-gerritbot that is hainv problems. It is after we connect to matrix and validate our session18:39
clarkbI think it is the gerrit connection that is breaking because we also log after gerrit connects that matrix is ready18:40
clarkbhowever haskell does delayed execution so this might be flawed analysis18:42
clarkbaha I was reading it wrong. I think it is in the joinroom area of the code beacuse we validate session then join room and we don't get room join logs18:49
fungiclarkb: aargh, moving target!
fungi"Back out deprecation of setup_requires and replace instead by a deprecation of setuptools.installer and fetch_build_egg. Now setup_requires is still supported when installed as part of a PEP 517 build, but is deprecated when an unsatisfied requirement is encountered."19:00
clarkbI guess that means you still have to specify it in pyproject.toml so that things line up19:01
clarkbgood to know, I think we support that just fine19:01
clarkbI'm having a really hard time finding anything that would need a cache directory in matrix gerrit bot so far.19:01
clarkbIt uses in memory "databases" to store things like room info. It forks `ssh` directly19:02
clarkbIt might be the matrix library?19:02
fungithat certainly seems possible19:04
clarkbI appreciate that python tends to do a better job of identifying the origins of log messages19:05
opendevreviewMerged opendev/system-config master: Partial revert of matrix-gerritbot user change
clarkbtristanC: maybe add -prof to the cabal compile options then we can run the executable with -xc for problems like this? I'm not sure what the -prof impact at runtime is but I imagine its small if you have to explicitly set -xc on the executable to get that info back?19:19
clarkbI feel like I'm reaching the end of my ability to debug this as I don't intend on pulling in nix to build this image.19:19
clarkbit just occured to me that I could lsof the running process to see what cache it might be opening19:25
clarkbI'll do that19:25
fungioh, yep now that it's running19:26
fungiassuming it holds an open descriptor to its cache anyway19:26
fungi(it may not)19:26
fungibut worth a shot19:27
clarkbya I'm not seeing anything that could be the cache19:27
fungiso it probably only opens things there on demand19:27
clarkbya its got fd 0 on /dev/null 1 and 2 on pipes some event loop kernel fds and then tcp sockets19:29
clarkbnothing looks like an on disk caceh19:29
clarkbI could probably strace it and hope to filter out the noise somehow to find reads/writes to a cache19:30
clarkbbut that seems iffy19:30
clarkbThis has led me to suspecting it might be the prometheus health endpoint that is doing caching as that runs a webserver19:35
tristanCclarkb: catching up, let me see if i can reproduce locally19:46
clarkbtristanC: I put the exact output in the comments of
clarkbmy best guess at this point is that it is related to the web server for prometheus metrics. Otherwise I'm not really finding anything that might be trying to cache stuff. But I'm also not finding anything indicating the web server there is caching (based on lsof and my hitting it manually)19:48
tristanCclarkb: i see thanks. So the error couldn't be printed because of a missing utf-8 locale, and it would have showed `$HOME/.cache/dhall`19:48
clarkbtristanC: ok we overrode $HOME to be /tmp and that didn't help19:49
clarkbI would expect that /tmp would be writable by all users on the image but I'm probably making a bad assumption because nix19:49
clarkbtristanC: but also that string isn't utf8 it is ascii? shouldn't putchar be fine with ascii?19:50
tristanCclarkb: the error message is using utf-8 character19:50
clarkbI guess that comes after what I got since it failed19:52
clarkband is in addition to `$HOME/.cache/dhall`19:52
tristanCclarkb: ftr it is
clarkbit is interesting that google hasn't indexed that string19:55
clarkb(I tried googling it several different ways before giving up, probably too much source code out there to index entirely)19:55
clarkbtristanC: any idea why setting $HOME to /tmp in the docker-compose.yaml didn't correct this?19:57
clarkb(we assumed something might be trying to write to $HOME/.cache which is why we tried taht019:57
tristanCclarkb: there is no /tmp in the image19:58
clarkbof course not20:00
clarkbtristanC: would it be crazy to suggest that using slightly bulkier images that are possible to debug and build locally using normal tools is a good idea?20:00
clarkbI appreciate the nix image is super minimal but that makes it very difficult to debug and it uses very specialized tools to do something that doesn't necessarily benefit from that20:01
clarkbthe image also sets a bash prompt but bash isn't even installed20:01
clarkbwe should be able to `cabal build` on something like debian right?20:02
opendevreviewGhanshyam proposed openstack/project-config master: Retire training-labs: remove project infra
clarkbI think OpenDev should probably consider doing that at least.20:03
clarkbThen we won't have to worry about /tmp or bash or utf820:04
opendevreviewGhanshyam proposed openstack/project-config master: Retire training-labs: remove project infra
clarkbI guess the solution here would be to mount something to /root/.cache with appropriate permissions?20:07
clarkbassuming no changes to the image20:07
clarkbHrm but /root is likely to not be o+x20:07
clarkbthat probably won't work either20:08
opendevreviewClark Boylan proposed opendev/system-config master: Give matrix-gerritbot a writeable cache
clarkbthat seems really hacky but might work?20:16
opendevreviewGhanshyam proposed openstack/project-config master: Remove 'publish-training-labs-scripts' definition
tristanCclarkb: i can reproduce the error and i'll provide a fix in the image. Are the host file sharing the same uid as the container one right?20:18
clarkbtristanC: they are. I'm not sure what we need you to hardcode the uid in the image, we just need to be able to have enough of a normal filesystem that we can bind mount appropriately20:19
opendevreviewGhanshyam proposed openstack/project-config master: Remove 'publish-training-labs-scripts' definition
clarkbThe underlying problem here seems to be we've overly minimized (lack of utf8 locale and lack of expect filesytem locations)20:19
clarkband that is made worse by having a bunch of non standard tools. I am really surprised that dhall needs to write to disk20:20
clarkber * Not sure that we need you to hardcode the uid in the image. s/what/that/20:20
tristanCclarkb: i meant with podman, when using the `--user $(id) --volume $HOME/.ssh:/root/.ssh` then the .ssh directory in the container is still owned by root20:22
clarkbtristanC: yes because it is under /root20:22
clarkbthat was my point above we can't mount to /root/.cache. But we could mount to /tmp20:23
clarkbor just use /tmp and not cache since it is all epehermal anyway20:23
tristanCclarkb: i meant without `--userns keep-id`, which i assume is what docker set by default20:23
tristanCclarkb: you can mount in /root, the folder actually doesn't exist20:23
clarkbtristanC: is dhall creating it in that case?20:23
clarkbwell no it can't be because then it wouldn't error20:24
clarkbwe error because that directory is not readable20:24
tristanCi think it happens when the runtime is creating the parent directory20:24
clarkbbut the runtime would be the non root user and create it as itself if that were the case20:24
clarkbbut we have strong evidence that directory is not readable20:24
tristanCiirc, when bind mounting to /a/b, if /a doesn't exists it get created as root by default20:25
clarkbthe docker runtime not the haskell runtime20:25
clarkbtristanC: thinking out loud here: is there any way to tell dhall to not cache to disk?20:28
clarkbsince this is a container caching to disk is as ephemeral as the process so caching to memory seems fine20:28
fungiso with latest pbr i still see setuptools complain about calls to install, originating from pbr.util.setup_cfg_to_setup_kwargs here:
fungithis is the full traceback:
fungiclarkb: does that make any sense to you?21:08
Clark[m]We might need to avoid instantiating the class to get around the warning? That seems odd though. Does bindep do that?21:15
Clark[m]Might be config specific if not? I'm popping out for a bike ride now but can look closer after21:15
fungii'll try to reproduce with bindep in a bit21:17
fungibut yeah it seems to happen when pip calls on setuptools to parse setup.cfg21:18
tristanCclarkb: in that case, dhall is just issuing a warning that it can't access the cache folder, and the unicode character makes the print fails. But I think the main issue is that the HOME directory is not writable when setting an arbitrary user.21:29
tristanCmoreover, when using openssh, the .ssh location is resolved through /etc/passwd with a default to `/.ssh`21:32
tristanCso i think i know how to slightly adjust the image so that it can works with arbitrary uid21:33
opendevreviewTristan Cacqueray proposed opendev/system-config master: Update the gerritbot-matrix image to support arbitrary uid
tristanCclarkb: i'm sorry this caused so much trouble and i hope 818645 should enable what you are trying to do.21:48
tristanCclarkb: and of course you can use `cabal build` to build the gerritbot-matrix binary, but i think the dockerfile will need a similar trick to support arbitrary uid21:58
clarkbtristanC: thanks. Its mostly that I question the utility of some of these decisiosn as they seem at odds with one another. The minimal image build doesn't seem to get along with dhall (and I guess openssh?)22:56
clarkband if not doing a minimal image build makes sense for the software then I question why use nix to build the image22:56
clarkband I don't think anything prevents us from running cabal in a Dockerfile?22:58
clarkbfungi: ok so the issue is that initialize_options is where setuptools raises the deprecation warning beacuse I guess that implies you're calling it on the command line? THat surprises me a little, but I think we work backward from that to figure out how to bypass it with pbr22:59
clarkbfungi: I think this is only an issue if using
clarkbfungi: I don't think bindep has this problem because it doesn't extend this way23:02
clarkbfungi: do you have a link to the repo you're hitting this with?23:02
clarkbbut basically cmdclass is deprecated aiui because you have to run to hit it ratherthan say build23:02
clarkbI think that means this is expected23:03
tristanCclarkb: i would say the benefit of nix container is two folds: it declares all the dependencies in a reproducable setting (think base image + bindep + requirements.txt), and sharable layers (each dependency is a single layer)23:33
clarkbtristanC: I think the second thing only really matters if you're doing a lot of nix containers right? For example in opendev's case this is our only nix container image so we get all the layers and no deduping for additional images23:34
clarkbBut you get the deduping using a consistent base image like opendev does anyway23:34
tristanCclarkb: it does matters even for a single image where update will only pulls missing layers23:35
clarkbthe strict control over all the deps is a neat feature of nix. I'm just not sure if gets us much here for a simple service like this. Cabal is capable of pinning things too right? then you're only dealing with the distro ghc and openssh23:35
fungiclarkb: it's not extending, and this was just the pip install tox was doing23:35
clarkbfungi: I think your setup.cfg sets a cmdclass value23:36
tristanCclarkb: and you can build gerritbot-matrix differently if you prefer, but you would need a similar trick for the home user dir so that it can work with arbitrary uid23:36
clarkbfungi: and cmdclass extends and pbr is trying to make that happen23:36
fungiclarkb: it's here:;a=blob;f=setup.cfg;h=1cbd5501ce8ceecf677085c4272c76468dacc015;hb=HEAD23:36
clarkbtristanC: yup I'm trying to work that through in my head. I'm beginning to think it might be a reaosnable thing for us to do for consistency with our images23:36
tristanCclarkb: having all the deps frozen is helpful to ensure the image can build in the far future23:36
clarkbtristanC: it also ensures that you're not getting security updates23:37
tristanCclarkb: right, so instead of updating a comment in a dockerfile to get a new build, you would update the repository commit instead23:38
clarkbtristanC: you'd also need to unpin things23:38
tristanCclarkb: here is an example dockerfile we use for another cabal base application:
clarkbbut I guess you can do that in the same commit23:38
clarkbtristanC: how did the image update for your change above? I don't see the updated flake.nix in the github repo. Maybe that is just a sync problem though23:40
tristanCclarkb: nix flake update is the command to update dependencies, and you can do a tree diff to see what exactly changes23:41
clarkbre layer splits for updates. I'm not sure there is a ton of value in that. Yes, you'll avoid some network traffic but again that really only matters if you are doing significant numbers of updates that represent large amounts of data23:43
clarkbIt is "neat" but I don't think it ie necessary when you update an image once a week or less23:43
clarkband only have a handful of images that share those layers23:43
clarkbour base debian images with python in them are like 200MB total23:44
clarkbIf we pull that once a week on a number of servers it isn't a big deal23:44
clarkbBasically I'm trying to optimize for simplicity and easy of use. Not for deploying massive amounts of software frequently to large datacenters. There are different needs.23:45
clarkbNix would probably do well if you had hundreds of releases a day hitting tens of thousands of nodes23:45
clarkband you'd accept the complexity and divergence from expected norms as those optimizations become important for you23:45
clarkbfungi: thats interesting because the pbr code is executing that path when you've set [global] commands if I'm reading it correctly23:49
clarkband translating that to cmdlcass23:49
clarkbfungi: what I did in the past was preinstall pbr and then told build to not use isolated build environments. Then I could instrument the pbr installation to sort out what was going on. Might need to do that here23:52
clarkbto see what sorts of values are being handled there to work backwards and figure it out23:52
clarkbI wonder if you're hitting it in a dependency?23:52
fungii doubt it's a dependency (the dependencies are listed there in the setup.cfg, passlib and pyyaml)23:54
fungibut yeah, first i'll try turning on warnings in bindep and see what i can reproduce with it23:55

Generated by 2.17.2 by Marius Gedminas - find it at!