*** openstack has joined #opendev | 07:30 | |
*** ChanServ sets mode: +o openstack | 07:30 | |
ianw | #status log rebooted eavesdrop01.openstack.org as it was not responding to network or console | 07:32 |
---|---|---|
openstackstatus | ianw: finished logging | 07:32 |
*** dougsz has quit IRC | 07:36 | |
*** dougsz has joined #opendev | 07:36 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add borg-backup roles https://review.opendev.org/741366 | 07:37 |
*** tosky has joined #opendev | 07:37 | |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Pre-install python3 for CentOS https://review.opendev.org/741868 | 07:44 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [wip] Drop dib-python requirement from several elements https://review.opendev.org/741877 | 07:44 |
*** dtantsur|afk is now known as dtantsur | 07:55 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
*** priteau has joined #opendev | 08:24 | |
ttx | hmm, jobs in the kata-containers tenant seem to have stopped running on Jul 17: https://zuul.opendev.org/t/kata-containers/builds?pipeline=SoB-check | 08:38 |
ttx | Anyone knows of any obvious culprit, before I dive deeper? | 08:38 |
ttx | seems to match the last reconfiguration/restart time | 08:42 |
openstackgerrit | Merged opendev/irc-meetings master: Change Neutron L3 Sub-team Meeting frequency https://review.opendev.org/741876 | 08:51 |
*** xiaolin has joined #opendev | 09:47 | |
openstackgerrit | Fabien Boucher proposed opendev/gear master: Bump crypto requirement to accomodate security standards https://review.opendev.org/742117 | 10:35 |
openstackgerrit | Daniel Bengtsson proposed openstack/diskimage-builder master: Update the tox minversion parameter. https://review.opendev.org/738754 | 11:26 |
*** sshnaidm|afk is now known as sshnaidm | 11:41 | |
*** iurygregory has quit IRC | 11:49 | |
*** tkajinam has quit IRC | 11:56 | |
*** iurygregory has joined #opendev | 11:59 | |
openstackgerrit | Fabien Boucher proposed opendev/gear master: use python3 as context for build-python-release https://review.opendev.org/742165 | 11:59 |
fungi | ttx: i can take a look in a few hours | 12:09 |
ttx | fungi: thx! | 12:12 |
ttx | I looked but there seems to be no trail at all | 12:12 |
openstackgerrit | Fabien Boucher proposed opendev/gear master: Bump crypto requirement to accomodate security standards https://review.opendev.org/742117 | 12:16 |
fungi | ttx: yeah, i expect i'll have to check the scheduler debug log to see if we're getting any events from github at all | 12:17 |
*** ryohayakawa has quit IRC | 12:25 | |
*** Eighth_Doctor is now known as Conan_Kudo | 12:59 | |
*** Conan_Kudo is now known as Eighth_Doctor | 12:59 | |
*** gema has joined #opendev | 13:01 | |
*** fressi has joined #opendev | 13:21 | |
*** fressi has quit IRC | 13:22 | |
clarkb | fungi: ttx github is deprecating some bits of its application webhooks but that isn't scheduled until october 1st so doubt that is it | 13:33 |
clarkb | seems like something to do with the zuul scheduler update | 13:33 |
clarkb | tobiash: ^ you're probably more up to date on changes that may affect github? have you seen any problems? | 13:33 |
fungi | yeah, i'm digging in logs now, looks like we stopped getting any trigger events from their repos but i haven't yet cross-checked our other github connection repos | 13:34 |
tobiash | was there a scheduler restart on jul 17? | 13:35 |
clarkb | tobiash: yes | 13:35 |
tobiash | are there github related exceptions? | 13:35 |
clarkb | we restarted to pick up zookeeper tls connection setup with the kazoo update. I suppose that could potentially be related too (though doubtful since other aspects are happy) | 13:36 |
tobiash | one potential change that alters github stuff has been merged on jul 16: https://review.opendev.org/710034 but this should have beem a noop refactoring | 13:36 |
fungi | i'm checking now to identify the timestamp of the last GithubTriggerEvent in our debug logs, but... they're large (and compressed) | 13:38 |
fungi | okay, last GithubTriggerEvent we saw for any project (not just limietd to the kata-containers tenant) was 2020-07-17 22:28:44 | 13:39 |
fungi | s/limietd/limited/ | 13:39 |
tobiash | quick testing with the github-debugging script shows no general issue against a real github | 13:40 |
fungi | 2020-07-17 22:32:59,413 INFO zuul.Scheduler: Starting scheduler | 13:41 |
tobiash | and if that got broken you would see exception traces or github auth failures | 13:41 |
fungi | so yes, we stopped seeing GithubTriggerEvent at restart | 13:41 |
clarkb | I wonder if we needed to restart zuul-web but didn't | 13:41 |
clarkb | ? | 13:42 |
clarkb | since the github events are all funneled through that | 13:42 |
tobiash | should be compatible actually, but maybe | 13:42 |
fungi | oh, i should be looking at the zuul-web logs right? i forgot about these coming in via webhooks | 13:43 |
clarkb | fungi: well both sets of logs | 13:43 |
clarkb | the entry to the system is zuul-web though | 13:43 |
clarkb | and it was started in june | 13:43 |
fungi | finding a github-related traceback in the scheduler logs will take a while... with all the other tracebacks the scheduler throws it's needle-in-haystack stuff | 13:44 |
fungi | but maybe the zuul-web logs will get me there faster | 13:44 |
clarkb | ya may give you event ids to grep on | 13:45 |
clarkb | since you should be able to trace that from the entry point to job completion and back to reporting | 13:45 |
fungi | argh, i think zoom webclient in chromium has locked up the display on my workstation | 13:47 |
openstackgerrit | Kendall Nelson proposed openstack/project-config master: Updates Message on PR Close https://review.opendev.org/742194 | 13:50 |
fungi | zuul-web is definitely calling into the rpcclient to notify the scheduler of webhook events | 13:52 |
clarkb | we have had to restart zuul-web in the past for it to work after a scheduler restart. Maybe this is that class of problem and we should go ahead and restart it? | 13:53 |
clarkb | I think that will pick up the PF4 changes too | 13:53 |
fungi | the only exceptions in the web-debug.log today are websocket errors | 13:57 |
fungi | trying to see if the scheduler is logging those rpc commands at all | 13:59 |
clarkb | I wonder if that has started the zk transition | 13:59 |
clarkb | and old zuul-web is talking gearman while scheudler is checking zk | 13:59 |
*** roman_g has joined #opendev | 14:00 | |
fungi | yeah, nothing in the scheduler debug log about rpc calls from the zuul-web | 14:01 |
fungi | corvus: when you're around, do you have an opinion on trying a zuul-web restart to see if we start getting github webhook events again? or is there additional troubleshooting we should consider first? | 14:02 |
AJaeger | fungi, clarkb, ianw: ricolin proposes in https://review.opendev.org/#/c/742090/1 a couple of arm64 jobs, some advise from you would be welcome there, please | 14:08 |
corvus | fungi: so we see the events in zuul-web, but not on the scheduler? | 14:17 |
*** markmcclain has quit IRC | 14:23 | |
fungi | corvus: tobiash helped me find that geard logs them | 14:24 |
fungi | and zuul.GithubEventProcessor handles them but then does nothing | 14:24 |
*** markmcclain has joined #opendev | 14:24 | |
*** mlavalle has joined #opendev | 14:38 | |
clarkb | infra-root fyi http://lists.openstack.org/pipermail/openstack-discuss/2020-July/016022.html some of those changes seem like reasonable improvments. I think my concern is mostly for https://review.opendev.org/#/c/737043/16 which adds centos support back into base. We finally simplified down to not needing to have support for multiple platforms :/ | 15:57 |
clarkb | Our shift away from trying to provide general purpose roles is a direct consequence of never really having had much help for that so we ended up doing extra work for ourselves and didn't see much benefit. In this case I suppose if rdo is using it then we may get that help (basically we have users before the extra work) | 15:58 |
corvus | we've been down this road before | 15:59 |
corvus | we specifically put roles in playbooks/roles because we are targeting no external users | 16:02 |
clarkb | yup, because we had struggled mightily with trying to accomdate external users with the puppet stuff | 16:03 |
corvus | it does look well tested | 16:04 |
corvus | that might help make this effort more successful than the last 2 | 16:04 |
clarkb | I think my biggest concern from a mirror perspective is we expect them to be used by our system only. That means we can and do change paths and deprecate and remove old stuff | 16:06 |
fungi | even so, i'm dubious... could the modularity of the existing orchestration be improved so that they can plug in support for the stuff they need which we aren't running? that way they can maintain it separately | 16:06 |
clarkb | making those changes safely for external users becomes much more difficult (though they could guard against issues with testing) | 16:06 |
fungi | also worried that this could be the tip of the iceberg and we wind up back in puppet-openstackci territory again | 16:06 |
*** marios has quit IRC | 16:07 | |
clarkb | for example once we drop xenial test jobs I'd like to force https on our mirrors | 16:07 |
clarkb | but that may break external usage of the role if people are consuming http and can't https (as in the case of xenial apt) | 16:08 |
fungi | yeah, that's the main thing for me. we need to be able to make breaking changes to this without coordinating with downstream users | 16:08 |
clarkb | I'll write up a response with our concerns and why we've structured things the way we have | 16:09 |
clarkb | and see what Javier thinks about that | 16:09 |
fungi | right now, only accepting patches for improvements in what we're running that we can take advantage of helps reinforce that people shouldn't expect help trying to reuse it | 16:09 |
mnaser | i'm using the python builder images and i had some really tricky issue which is -- i need to install Cython -- to install the packages that i need | 16:14 |
mnaser | is this a pattern that may have been addressed in another way? | 16:14 |
mnaser | the trick is we build things inside a venv, and that's created inside `assemble` | 16:15 |
clarkb | mnaser: bindep? | 16:15 |
clarkb | basically the bindep compile target should get installed prior to installing pip packages | 16:15 |
mnaser | clarkb: well, i dont think i can do python3-cython in bindep | 16:16 |
clarkb | why not? | 16:16 |
mnaser | because that would install in system python, not the one the image uses inside /usr/local/bin/python | 16:16 |
clarkb | I could be wrong about this but I think it should find your system install via PATH and use it from there | 16:17 |
clarkb | whether or not that would actually work for compiling the cython bits I'm not sure | 16:17 |
mnaser | if i do that i end up with two pythons | 16:17 |
mnaser | the python3 that ships in the docker image is built from source into /usr/local/bin/python | 16:17 |
mnaser | if i install cython bits, it will end up installing debian shipped python | 16:18 |
corvus | how is one expected to install cython when using a built-from-source python? | 16:18 |
mnaser | corvus: pip install Cython worked just fine in the container | 16:18 |
*** chandankumar is now known as raukadah | 16:19 | |
mnaser | (i think they ship a lot of wheels so it works™) | 16:19 |
clarkb | ok so the actual problem is that installing python3-cython deps on python3 under debian and that mxies up the python install | 16:19 |
clarkb | we do install a dpkg override file thing in those images to avoid that I thought | 16:19 |
clarkb | maybe we need to add more content to that list? | 16:20 |
mnaser | i could install Cython from pip | 16:20 |
mnaser | but it would have to be two staged | 16:20 |
mnaser | install Cython into build venv, then actually install the pkgs i need | 16:20 |
corvus | it doesn't work to just put it first in requirements.txt? | 16:20 |
clarkb | we do the control file for python3-dev | 16:21 |
clarkb | maybe adding one for python3 is sufficient? and that addresses a larger global problem with the images? | 16:21 |
mnaser | (trying with Cython first) | 16:22 |
mnaser | WARNING: Cython is not installed. | 16:23 |
mnaser | ERROR: Cannot find Cythonized file rados.c | 16:23 |
mnaser | it's almost as if it was a pre-build step or something | 16:24 |
clarkb | another approach could be to install a wheel of your cythonized package | 16:24 |
mnaser | building this https://github.com/ceph/ceph/tree/master/src/pybind/rados | 16:25 |
*** roman_g has quit IRC | 16:25 | |
mnaser | 'cythonize' is called in setup.cfg | 16:25 |
corvus | mnaser: what line in assemble fails? | 16:27 |
mnaser | `/tmp/venv/bin/pip install -c /tmp/src/upper-constraints.txt --cache-dir=/output/wheels Cython git+https://opendev.org/openstack/glance@stable/ussuri git+https://github.com/ceph/ceph@octopus#subdirectory=src/pybind/rados PyMySQL python-memcached` | 16:27 |
mnaser | http://paste.openstack.org/show/796175/ | 16:28 |
mnaser | this is the dockerfile (but takes a little bit to build because ceph clone) | 16:28 |
*** priteau has quit IRC | 16:28 | |
mnaser | with this bindep - http://paste.openstack.org/show/796176/ | 16:29 |
corvus | i didn't know assemble took args | 16:29 |
corvus | i see that now | 16:29 |
*** priteau has joined #opendev | 16:29 | |
fungi | mnaser: yeah, cython is going to have to be preinstalled for setuptools to make use of it, i guess pip install could be run twice | 16:30 |
mnaser | fungi: pip install didn't work because we build wheels inside a venv | 16:30 |
mnaser | https://opendev.org/opendev/system-config/src/branch/master/docker/python-builder/scripts/assemble#L90 | 16:30 |
fungi | can't create the venv, pip install cython in it, then build the wheels in it? | 16:31 |
corvus | mnaser: oooh so this is all just in assemble. | 16:31 |
mnaser | yeah :) because we build them out of repo so its a little tricky | 16:31 |
clarkb | email reply sent to javier | 16:31 |
corvus | mnaser: i think maybe we should just change how assemble works :) | 16:31 |
*** dougsz has quit IRC | 16:32 | |
mnaser | im hoping to eventually using zuul-checked-out code to build the images but that's a todo | 16:32 |
clarkb | mnaser: was it confirmed that using debian cython doesn't work? | 16:32 |
clarkb | it wasn't clear to me if that was tried and failed or assumed to fail | 16:32 |
corvus | mnaser: maybe just add a "--pre-install" argument to assemble, so it installs cython in the venv in a separate step? | 16:32 |
clarkb | I do think updating assemble is likely the best longer term option but if system cython works that may get you moving quicker | 16:32 |
mnaser | clarkb: it installs a whole set of python packages and you end up with two pythons, and the build tools (assemble) doesn't actually use that 'newly installed python' | 16:33 |
mnaser | clarkb: i shortcircuited all of that and straight up added python3-rados but that pulled python and wasn't even usable in /usr/local/bin/python3 (but it was in /usr/bin/python3) | 16:34 |
clarkb | mnaser: right but cython is a command line tool isn't it? so would be in $PATH? and -builder's contents are thrown away except for the built wheels | 16:34 |
*** priteau has quit IRC | 16:34 | |
clarkb | the end result should be fine assuming cython can execute properly | 16:34 |
clarkb | its clunky for sure though | 16:34 |
mnaser | i dont think cython is acommand line tool because its imported in setup.cfg | 16:34 |
mnaser | s/cfg/py/ | 16:34 |
clarkb | ah | 16:34 |
mnaser | i mean, it might have one, but the setup.py seemed to import it and do things with the api | 16:34 |
clarkb | ya the package itself may call it via imports not cli | 16:35 |
corvus | maybe just stick something like lines 1-3 into assemble? http://paste.openstack.org/show/796178/ | 16:35 |
mnaser | yeah that's what im thinking. probably the most annoying part is adding argument parsing there | 16:35 |
mnaser | :P | 16:35 |
corvus | mnaser: yeah, i was like "let's just pseudocode this" :) | 16:36 |
corvus | anyway, that sounds like a good addition to assemble to me; we'll probably want a nice long comment about why it's necessary | 16:37 |
corvus | cause it sure did take me a few minutes to achieve understanding :) | 16:37 |
mnaser | corvus: this worked locally -- http://paste.openstack.org/show/796179/ so ill integrate it | 16:48 |
corvus | mnaser: sweet! my mental pattern matcher says there's a 60% chance that's bash and 40% it's perl. ;) | 16:50 |
mnaser | hahaha. | 16:50 |
mnaser | getopt is so not my thing :( | 16:50 |
openstackgerrit | Mohammed Naser proposed opendev/system-config master: python-builder: allow installing packages prior to build https://review.opendev.org/742249 | 16:53 |
mnaser | corvus, clarkb, fungi ^ you can see the paste to toy around expected behaviour, cause i dont thinkw e really test these images | 16:54 |
mnaser | i have a change i can test it in but unfortunately its in the vexxhost tenant | 16:55 |
corvus | mnaser: i think the gerrit images in opendev depend on python-builder, so we can do a noop depends-on with that to at least exercise it | 16:55 |
corvus | (because the gerrit images build in jeepyb) | 16:56 |
mnaser | actually, i think the uwsgi images will get rebuilt too and that's a good exercise too | 16:56 |
corvus | k | 16:56 |
mnaser | as that uses the 'arguments' thing i think | 16:56 |
mnaser | like the assemble uWSGI -- i think | 16:56 |
mnaser | ill make a gerrit one too | 16:56 |
mnaser | k that job is already gonna rebuild uwsgi base image so just gotta do a gerrit one | 16:57 |
openstackgerrit | Mohammed Naser proposed opendev/jeepyb master: DNM: testing python-builder image changes https://review.opendev.org/742251 | 17:00 |
*** priteau has joined #opendev | 17:00 | |
mnaser | and that's a test | 17:00 |
*** bolg has quit IRC | 17:01 | |
*** dtantsur is now known as dtantsur|afk | 17:15 | |
*** priteau has quit IRC | 17:21 | |
mnaser | corvus, clarkb, fungi: the docker image change built uwsgi and jeepyb just fine, the integration job failed but that seems unrelated: `/usr/bin/python3: No module named pip` | 17:56 |
mnaser | and the uwsgi image build also uses 'assemble uWSGI' so that means we didn't break that | 17:58 |
fungi | mnaser: yeah, needs ensure-pip probably | 17:59 |
clarkb | https://review.opendev.org/#/c/741277/3 fixes the integration job | 17:59 |
clarkb | fungi: yup | 18:00 |
mnaser | ok, that solved my issue when building locally too | 18:13 |
mnaser | i do have another issue but that's unrelated :) | 18:13 |
corvus | i'm going to restart all of zuul now | 18:30 |
corvus | (i confirmed the promote job ran for https://review.opendev.org/742229 ) | 18:30 |
corvus | i'm going to save queues, then run zuul_restart.yaml playbook | 18:32 |
clarkb | corvus: did the fix for that land to restart executors? | 18:32 |
clarkb | I think it did but I guess we know where to look if that bit fails. Oh also did we ever remove ze* from the emergency file? | 18:32 |
clarkb | if not then they may not automatically pull the new images? /me is checking | 18:32 |
corvus | i believe we removed them in order to switch to containers | 18:33 |
clarkb | ya I don't see them in emergency anymore so should be good | 18:33 |
corvus | okay, i'll execute now | 18:33 |
clarkb | though we update images hourly? so may still be behind? | 18:33 |
corvus | i thought we update when we start? | 18:33 |
clarkb | no I think we docker-compose pull during main.yaml then start and stop only start and stop | 18:34 |
clarkb | however does a docker-compose up imply a pull? | 18:34 |
fungi | up does, start does not i think? | 18:35 |
clarkb | ah | 18:35 |
corvus | so i'll hit the button? | 18:35 |
clarkb | ya I think so I'm just looking at docker-compose docs and it does seem that it does pull | 18:35 |
clarkb | --quiet-pull is an option to it | 18:36 |
corvus | it's running | 18:36 |
corvus | this playbook is not an optimal stop/start sequence | 18:37 |
fungi | any moment someone's going to ask in here whether zuul is down ;) | 18:38 |
corvus | we seem to be stuck stopping a merger? i don't know what it's doing. | 18:39 |
corvus | it's gathering facts on mergers | 18:39 |
corvus | and it's said ok far zm01,2,5,6,8 | 18:39 |
corvus | apparently we are unable to gather facts on zm03,4,7? | 18:40 |
corvus | i can't ssh to any of those | 18:40 |
corvus | maybe our playbook should start with "gather facts on all zuul hosts" before it even starts. | 18:40 |
corvus | i'm going to abort this, add those 3 hosts to emergency, and re-run | 18:41 |
clarkb | 3 and 4 report connection reset by peer. 07 seems to hang for me | 18:41 |
corvus | ok we've moved on to stoppeing ze | 18:43 |
corvus | oh and we don't wait for executors to stop | 18:43 |
corvus | i don't know if that's okay | 18:43 |
corvus | if it works, it's going to make for confusing log files | 18:43 |
clarkb | ya that will probably end up starting a second set of processes | 18:44 |
clarkb | which may be a problem for the built in merger? | 18:44 |
corvus | i only see one container process running | 18:44 |
clarkb | oh I wonder if we are less graceful with containers | 18:44 |
corvus | yeah that's what i'm thinking | 18:44 |
corvus | that's probably okay for this | 18:45 |
clarkb | looks like jobs are starting | 18:49 |
corvus | re-enqueing | 18:50 |
fungi | skimming cacti, zm03 died almost 24 hours ago, zm04 roughly 20 hours ago, and zm08 is still responding to snmp | 18:50 |
clarkb | fungi: 07 not 08 | 18:50 |
clarkb | the new web ui is just different enough that you notice :) | 18:50 |
fungi | oh, yep, misread. zm07 died around the same time as zm04 | 18:51 |
fungi | so all three went within a few hours of each other | 18:51 |
fungi | that's an odd coincidence | 18:51 |
clarkb | shared hypervisor? | 18:51 |
fungi | possibly | 18:52 |
clarkb | I'm guessing we will want to reboot them, if services come up on them I'm betting they will be the older verion which is actually ok for mergers (but we should probably udpate anyway?) | 18:53 |
corvus | the git sha it's reporting that it's running -- i can't find it in the history | 18:53 |
corvus | is that because we promoted a zuul merge commit? | 18:53 |
corvus | Zuul version: 3.19.1.dev125 6f0e46ce | 18:54 |
clarkb | corvus: that would be my hunch (it is the downside to using artifacts built in the gate | 18:54 |
corvus | git describe says 3.19.0-141-g9a9b690dc | 18:54 |
clarkb | since timestamp is part of the commit hashing if a merge was used and not a fast forward we'll end up with different hashes | 18:54 |
corvus | i can't reconcile the numbers either | 18:54 |
corvus | shouldn't dev125 == 141? | 18:54 |
clarkb | sort of related I miss that cgit showed refs in the commit view | 18:56 |
clarkb | corvus: yes I would expect the 125 to be 141 | 18:56 |
corvus | i'm wondering if pbr counts differently; i'm re-installing in a venv to ask it | 18:56 |
fungi | `cd /etc/zuul-scheduler/;sudo docker-compose exec scheduler pbr freeze` says "zuul==3.19.1.dev125 # git sha 6f0e46ce" | 18:58 |
corvus | yes, that's what the scheduler says, i'm trying to match that up to the actual source tree | 18:58 |
fungi | so that at least suggests zuul is interpreting what pbr claims | 18:59 |
clarkb | we should copy the whole tree into the docker image during build time, but then on the actual prod image we just use the resulting wheel right? | 18:59 |
corvus | yeah, i know that the version at the bottom of the status page is the pbr version | 18:59 |
corvus | i just don't know what that version *means* | 18:59 |
fungi | oh, right, this is going to be built from a temporary merger state, not the branch history | 18:59 |
clarkb | fungi: ya but we would've expected the commit delta count to be closer | 18:59 |
corvus | so i'm currently updating my local zuul install so i can run pbr freeze in a source tree that i know is up to date. | 19:00 |
corvus | it's just really friggin slow | 19:00 |
clarkb | also its meeting time, not sure if we want to take this over into the meeting and do lunch/breakfast debug meeting or go with our regularly scheduled content | 19:01 |
corvus | zuul==3.19.1.dev137 # git sha 9a9b690dc | 19:01 |
corvus | i have no idea what we're running. | 19:01 |
fungi | current master branch tip installed in a venv here also claims "zuul==3.19.1.dev137 # git sha 9a9b690dc" | 19:02 |
clarkb | may need to go to the build job in the gate? and work forward from there? | 19:02 |
clarkb | is it possible we promoted the stable/3.x branch fix and are running that? | 19:03 |
clarkb | whcih would in theory have a lower delta count from 3.19? | 19:03 |
corvus | i'm checking that theory now | 19:04 |
corvus | it seems plausible | 19:04 |
fungi | looks like the devNNN should be two less than `git log --oneline 3.19.0..HEAD|wc -l` | 19:04 |
corvus | tox has decided to recreate again, so it'll be a minute | 19:05 |
fungi | stable/3.x should in theory claim 3.19.1.dev1 | 19:05 |
clarkb | fungi: its ahead of the tag though | 19:06 |
clarkb | (at least minimally) | 19:06 |
fungi | anyway nowhere near 125 | 19:06 |
fungi | huh, nevermind. zuul==3.19.1.dev3 # git sha 5d9119425 | 19:06 |
corvus | fungi: where's that from? | 19:07 |
fungi | pip install of origin/stable/3.x | 19:08 |
fungi | so i wonder why pbr thinks an install of master is dev137 when it has 141 commits since 3.19.0 | 19:08 |
clarkb | https://zuul.opendev.org/t/zuul/build/719c6a2ca02b40c88bd5e1e26da8673c/log/job-output.txt#1743 <- that should be for the master change | 19:08 |
corvus | yes i got that too | 19:09 |
clarkb | https://zuul.opendev.org/t/zuul/build/6b898f8f37f84f589aa796132ff6aa24/log/job-output.txt#1535 <- stable change | 19:09 |
clarkb | I wonder if docker-compose up doesnt' do what we think it does | 19:09 |
clarkb | how old is that image? | 19:09 |
corvus | zuul/zuul-executor latest 6fbba1285c10 4 days ago 1.43GB | 19:10 |
corvus | that seems to be the problem. sigh. | 19:10 |
fungi | so up doesn't pull? | 19:10 |
clarkb | fungi: apparently not, maybe it only pulls if there are missing iamges | 19:11 |
clarkb | out of date images may not be sufficinet? | 19:11 |
clarkb | I mean we should try a manual pull and see if it gets anything | 19:11 |
clarkb | if not then we may have a image promotion issue/ | 19:11 |
corvus | pulling on ze01 | 19:11 |
* clarkb will context switch to meeting more now that we have a thread to pull on | 19:12 | |
corvus | zuul/zuul-executor latest 50e6f6d1f5eb About an hour ago 1.43GB | 19:12 |
corvus | so yeah, we need to pull everwhere. | 19:13 |
clarkb | maybe do the reboots of the mergers now and they can be fixed with everything else? | 19:14 |
fungi | or wait for the periodic pull? | 19:14 |
fungi | and yeah, i'm working on the mergers now | 19:14 |
fungi | just wanted to check a console for one first | 19:15 |
fungi | not that i expect to find much | 19:15 |
corvus | fungi: thanks; i'm not going to wait for them -- i'll proceed with pulling and restarting everything else | 19:15 |
fungi | corvus: sounds good | 19:15 |
fungi | they're in emergency disable anyway | 19:15 |
clarkb | thanks! | 19:16 |
fungi | once they're up i'll stop the merger on them as fast as i can and pull | 19:16 |
fungi | yeah, nothing useful on console, some bursts of messages about hung kernel tasks but no idea how old those are since the timestamps are seconds since boot | 19:18 |
fungi | no carriage return does not produce new login prompts though | 19:19 |
fungi | anyway, will proceed with those three reboots (zm03,4,7) | 19:19 |
fungi | #status log performed hard reboot of zm03,4,7 via api after they became unresponsive earlier today | 19:21 |
openstackstatus | fungi: finished logging | 19:21 |
fungi | the containers on those three mergers are downed and pulling now | 19:24 |
fungi | and done | 19:24 |
ianw | fungi: last acme.sh update is "[Fri Jul 17 06:30:38 UTC 2020] Skip, Next renewal time is: Sun Jul 19 19:07:05 UTC 2020" | 19:24 |
ianw | it doesn't appear to have run since\ | 19:24 |
fungi | ianw: yeah, that's why i wondered if networking issues could be causing ansible trouble runnnig it | 19:24 |
ianw | Friday 17 July 2020 06:31:22 +0000 (0:00:01.038) 0:03:01.934 *********** | 19:27 |
fungi | okay, i've done docker-compose up -d on the errant mergers now | 19:27 |
ianw | that's from letsencrypt.yaml -- that seems to be the last time it ran | 19:27 |
fungi | so they should be back in the fold | 19:27 |
ianw | so it might be more a bridge problem, and linaro is the canary | 19:27 |
fungi | corvus: do you want me to leave those in the emergency disable list for now, or go ahead and take them back out? | 19:27 |
fungi | don't want to further complicate the restart | 19:28 |
corvus | fungi: i'll handle it | 19:28 |
corvus | i'll take them out of emergency and run a pull on them | 19:28 |
fungi | thanks | 19:28 |
fungi | oh, i already ran a pull on them | 19:28 |
corvus | then it'll noop :) | 19:29 |
fungi | downed them with docker compose, did pull, then up -d | 19:29 |
fungi | cool | 19:29 |
corvus | all done | 19:29 |
fungi | thanks! | 19:29 |
corvus | i'll now do the full restart again | 19:29 |
corvus | Zuul version: 3.19.1.dev137 aaff0a02 | 19:36 |
corvus | that looks to be expected for master | 19:36 |
fungi | agreed, knowing that the id won't match, the count looks correct | 19:36 |
fungi | though i still don't quite get why it's 137 and not 141, i don't feel like digging in pbr internals right now to find out | 19:37 |
corvus | Zuul version: 3.19.1.dev137 | 19:37 |
corvus | that's from the executor | 19:37 |
corvus | so even if we did get the streams crossed, we ended up with the master version both places it matters | 19:38 |
corvus | re-enqueing, and i gave the release team the all-clear | 19:38 |
fungi | thanks! i'll keep an eye out for github events in the log and see if things have gone back to normal there | 19:41 |
corvus | #status restarted all of zuul at 9a9b690dc22c6e6fec43bf22dbbffa67b6d92c0a to fix github events and shell task vulnerability | 19:47 |
openstackstatus | corvus: unknown command | 19:47 |
corvus | #status log restarted all of zuul at 9a9b690dc22c6e6fec43bf22dbbffa67b6d92c0a to fix github events and shell task vulnerability | 19:47 |
openstackstatus | corvus: finished logging | 19:47 |
fungi | so i lied, i did want to go digging in pbr internals apparently (or at least i couldn't resist the urge) | 19:50 |
fungi | pbr is basically iterating over the output from `git log --decorate --oneline` looking for the earliest tag it encounters | 20:13 |
fungi | in the case of the current zuul master branch tip the commit tagged 3.19.0 appears on line 138 of the log, so the dev count is 137 commits since that tag | 20:14 |
fungi | however, if you `git log --oneline 3.19.0..HEAD` you get 141 commits | 20:15 |
fungi | so git is including 4 additional commits since that tag which don't appear before it in the log output | 20:17 |
fungi | er, don't appear after it i mean | 20:17 |
fungi | (above, whatever_ | 20:17 |
fungi | if you ask git to give you the log for 3.19.0..HEAD it includes 170542089, 1d2fd6ff4, 25ccee91a and 99c6db089 because they're not in the history of 3.19.0 and are in the history of master, but because git log sorts them below (earlier chronologically compared to) the tagged commit, pbr isn't counting them toward its dev count | 20:24 |
fungi | this seems like a fundamental flaw in how pbr counts commits "since" a tag | 20:25 |
ianw | ... so back to the letsencrypt thing ... do we not run it daily. i thought we did, but maybe not | 20:40 |
ianw | nah, it's in periodic, so why hasn't it run ... | 20:41 |
clarkb | maybe ssh failed due to ipv4 problemsm | 20:43 |
clarkb | we converted our inventory to ipv4 not ipv6 addrs | 20:43 |
clarkb | due to rax ipv6 routing problems | 20:43 |
ianw | https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-letsencrypt ... it's getting skipped the last few days | 20:46 |
ianw | dependencies are : infra-prod-install-ansible (soft) | 20:47 |
ianw | but that doesn't seem to be failing... | 20:48 |
ianw | (be cool if that dependency link were clickable :) | 20:48 |
ianw | infra-prod-install-ansible: TIMED_OUT | 20:51 |
ianw | 2020-07-21 19:53:15.312219 | | 20:54 |
ianw | 2020-07-21 19:53:15.312511 | TASK [Make sure a manaul maint isn't going on] | 20:54 |
ianw | 2020-07-21 20:22:57.172003 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/opendev/system-config/playbooks/zuul/run-production-playbook.yaml@master] | 20:54 |
ianw | ... we've stopped it? | 20:54 |
ianw | 2020-07-17T21:02:56+00:00 | 20:55 |
ianw | ze rollout corvus | 20:55 |
ianw | so ... that explains that at least | 20:55 |
ianw | corvus: ^ are we ok to remove this now? | 20:55 |
*** roman_g has joined #opendev | 20:58 | |
clarkb | oh | 20:58 |
*** roman_g has quit IRC | 20:58 | |
clarkb | ianw I suspect so as that was for the zuul zk tls rollout which is complete now | 20:59 |
*** Dmitrii-Sh has quit IRC | 21:09 | |
*** Dmitrii-Sh has joined #opendev | 21:10 | |
corvus | yes | 21:20 |
clarkb | ianw: what makes https://mirror.regionone.linaro-us.opendev.org/wheel/ have more content than https://mirror.regionone.limestone.opendev.org/wheel/ with our emulated cross builds I think we should make all the arches available everywhere, but its not clear to me how we distinguish that | 21:20 |
corvus | sorry about that. good thing we have a comment :) | 21:20 |
clarkb | we don't have any special rewrite rules for aarch64 that I see and the symlink to the web dir is for the parent wheels/ dir | 21:34 |
clarkb | so how do we end up seeing different things there | 21:34 |
clarkb | project rename announcement sent | 21:39 |
ianw | clarkb: they should be in sync ... | 21:43 |
ianw | clarkb: do you see afs errors on limestone? | 21:43 |
clarkb | hrm I had to reboot my laptop and haven't relodaed my key. Let me see | 21:44 |
ianw | ls: cannot access 'centos-8-x86_64': No such device | 21:44 |
ianw | the answer is yes :/ | 21:44 |
clarkb | I want to say others showed similar too but I hvane't done a complete audit | 21:45 |
ianw | [Thu Jun 25 06:41:09 2020] afs: Lost contact with file server 23.253.73.143 in cell openstack.org (code -1) (all multi-homed ip addresses down for the server) | 21:45 |
ianw | [Thu Jun 25 06:41:47 2020] afs: file server 23.253.73.143 in cell openstack.org is back up (code 0) (multi-homed address; other same-host interfaces may still be down) | 21:45 |
ianw | last messages on limestone | 21:45 |
clarkb | so maybe its a reboot and then happy situation? or reload afs kernel module? | 21:45 |
ianw | i've never managed to get it to reload; i think a reboot is the easiest | 21:46 |
ianw | infra-root: just to confirm, i'm going to remove the DISABLE-ANSIBLE file? | 21:46 |
clarkb | ianw: ++ re disable file | 21:46 |
ianw | ok done, that should get certs renewed on linaro | 21:47 |
ianw | at next periodic run | 21:47 |
corvus | ianw: ++ remove disable (ftr) | 21:52 |
clarkb | gra1.ovh, bhs1.ovh, ord.rax, sjc1.vexxhost, and regionone.limestone all exhibit the lack of dirs problem under wheels/ via webserver | 22:00 |
clarkb | the others (rax, inap, and vexxhost regions as well as the linaro mirror) seem fine | 22:01 |
ianw | that's a good mix :/ | 22:01 |
clarkb | considering we already disrupted jobs with the earlier zuul restart maybe give it until tomorrow before I reboot things | 22:01 |
ianw | there's something maybe to try ... umm let me look at notes | 22:01 |
ianw | fs checkvolumes | 22:02 |
clarkb | oh right we've done that before to address mismatched cache info | 22:02 |
clarkb | and that runs on the mirror side right? | 22:02 |
ianw | https://mirror.regionone.limestone.opendev.org/wheel/ ... yeah that's done it | 22:02 |
clarkb | cool I'm loading up my key now and can get the others | 22:03 |
clarkb | just root fs checkvolumes on the mirror? | 22:03 |
ianw | umm i kinit /aklogged first | 22:03 |
clarkb | k | 22:04 |
clarkb | ianw: but you did that on mirror not the fs server? | 22:04 |
ianw | yes on the mirror node | 22:06 |
clarkb | sjc1.vexxhost is now happy | 22:07 |
clarkb | thanks! I'll finish up the rest of the ones I found | 22:07 |
clarkb | and now we should be able to use region specific mirrors for aarch64 cross arch builds | 22:10 |
clarkb | not that I'm ready to do that given the issues with missing cryptography package but afterwards maybe | 22:11 |
ianw | right, wheel builds ... let me cycle back to that | 22:18 |
ianw | http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-07-20.log.html#t2020-07-20T01:55:35 was the last discussion | 22:20 |
ianw | i looks like the release job ran https://zuul.openstack.org/build/679fdc41cf9846579f9f2bc7100af1f2 | 22:21 |
clarkb | ianw: when that was happening I noticed difficulting reaching the mirror from home. I don't have ipv6 so attributed it to the same issue. More recently I havne't noticed that | 22:21 |
clarkb | but I may be getting lucky with network connectivity too | 22:21 |
ianw | releases look good on https://grafana01.opendev.org/d/LjHO9YGGk/afs?orgId=1 | 22:23 |
ianw | clarkb: so basically if your wheel was going to be built, i would assume it would be done now ... cryptography right? | 22:23 |
clarkb | yes cryptography, but maybe they released the new wheel after the job to update it ran | 22:24 |
clarkb | publish-wheel-cache-debian-buster-arm64 /me looks for those logs | 22:24 |
ianw | https://zuul.openstack.org/build/b10b1629af454b7f9e45c6178ef38a1b | 22:25 |
ianw | that's the latest run | 22:25 |
ianw | cryptography\=\=\=2.9.2 | 22:26 |
clarkb | https://zuul.openstack.org/build/b10b1629af454b7f9e45c6178ef38a1b/log/python3/wheel-build.sh.log ya that doens't show 3.0.0 so I expect tomorrow's run in about 8 hours will get it | 22:26 |
clarkb | oh wait | 22:28 |
clarkb | does this use constraints? | 22:28 |
clarkb | or just requirements? | 22:28 |
clarkb | looks like it uses constraints so that may be the issue | 22:28 |
clarkb | this is going to be the major issue with trying to use openstack's wheel cache | 22:29 |
ianw | yeah it runs over upper-constraints.txt | 22:31 |
clarkb | corvus: ^ the intermedate cache layer may be worth looking into further now | 22:32 |
clarkb | I know you're distracted with the zuul reelase, but wanted to call that out | 22:32 |
ianw | we could do another run for the latest stuff -- but it already ran @ 1 hr 58 mins 49 secs and that's just doing the latest 2 branches | 22:32 |
clarkb | well openstack constraints hasn't updated yet | 22:33 |
clarkb | but zuul/nodepool don't use constraints so get the latest version | 22:33 |
corvus | clarkb: i think i'm going to have to catch up with you on this tomorrow | 22:33 |
clarkb | we could run a zuul specific job to add wheels | 22:33 |
clarkb | corvus: no worries, my day is near the end anyway. These early mornings are hard | 22:33 |
ianw | clarkb: we'd probably need to run it on a separate host | 22:43 |
clarkb | ianw: ya as a separate job and maybe even order them so that we don't conflict on writes? | 22:43 |
clarkb | some wheels may not build determinstically and could end up with different shas or whatever? | 22:44 |
ianw | yeah basically no wheels build deterministically | 22:44 |
clarkb | another option is to use openstack constraints file for arm64 builds | 22:45 |
clarkb | as sort of a hack to make jobs go faster | 22:45 |
ianw | one of the issues in https://review.opendev.org/#/c/703916/1/specs/wheel-modernisation.rst | 22:46 |
ianw | we might be able to make the job multi-node ... it already runs under parallel | 22:47 |
ianw | that might almost give us a bunch of "free" builds within the same overall time period, because i think there's some packages that take a very long time to build | 22:48 |
clarkb | ya they are slow under buildx but I imagine not really quick on the actual hardware either | 22:48 |
ianw | so if one host was doing them, the other host might be zooming through the smaller packages | 22:48 |
clarkb | fwiw there is also a bug in pypa's manlinux repo about doing arm stuff and tldr is its hard for them to make it more generic due to all the flavors | 22:48 |
clarkb | one approach I considered was working with them to just make manylinux builds easier since all these deps are manylinux on x86 anyway | 22:49 |
fungi | i have an open change to attempt to add some determinsim, but as ianw rightly pointed out, that depends on makefiles consistently treating cflags overrides, which is far from the case (some replace cflags, some append, some make up their own alternate xcflags and so on) | 22:49 |
clarkb | that said there is a manylinx for aarch64 and we could probably talk to cryptography about running those jobs maybe, either via qemu or a cloud provider with arm hardware | 22:49 |
ianw | fungi: yeah, also add to that list there's no standard way to "-j" | 22:49 |
ianw | https://github.com/pyca/cryptography/issues/5292 | 22:53 |
ianw | it might be worth getting some numbers on how fast linaro runs the test case | 22:53 |
*** tkajinam has joined #opendev | 22:55 | |
*** DSpider has quit IRC | 23:03 | |
*** Dmitrii-Sh has quit IRC | 23:05 | |
*** Dmitrii-Sh has joined #opendev | 23:06 | |
*** mlavalle has quit IRC | 23:08 | |
clarkb | tge point about apple hardware is a fun one | 23:10 |
ianw | i'm running tox now, it's collected ~10,000 tests | 23:13 |
ianw | it's no running in parallel though afaics | 23:13 |
ianw | coverage run --parallel-mode ... it says it is, but it's not using all the cpus | 23:15 |
openstackgerrit | Pierre-Louis Bonicoli proposed zuul/zuul-jobs master: Use ansible_distribution* facts instead of ansible_lsb https://review.opendev.org/742310 | 23:15 |
ianw | might actually need --concurrency=multiprocessing | 23:16 |
ianw | 782.05user 18.82system 13:21.25elapsed 99%CPU (0avgtext+0avgdata 2212248maxresident)k | 23:21 |
ianw | 8inputs+1171456outputs (0major+1625364minor)pagefaults 0swaps | 23:21 |
ianw | i guess i'll offer to hook zuul up for them; i think all that would need to happen is allow the bot | 23:27 |
ianw | %Cpu(s): 93.9 us ... using xdist, that's more like it | 23:34 |
*** iurygregory has quit IRC | 23:42 | |
ianw | 574.65s (0:09:34) | 23:42 |
ianw | not as much as i'd thought | 23:42 |
*** tosky has quit IRC | 23:47 | |
openstackgerrit | Pierre-Louis Bonicoli proposed zuul/zuul-jobs master: Avoid to use 'length' filter with null value https://review.opendev.org/742316 | 23:52 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!