Monday, 2020-07-27

*** ryohayakawa has joined #opendev00:02
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use (upload|promote)-docker-image roles in periodic jobs
openstackgerritAndrii Ostapenko proposed zuul/zuul-jobs master: Add ability to use (upload|promote)-docker-image roles in periodic jobs
openstackgerritMerged openstack/project-config master: Update ceph grafana for nova
openstackgerritIan Wienand proposed openstack/project-config master: Add opendev-3pci/project-config
openstackgerritIan Wienand proposed openstack/project-config master: Add OpenDev 3rd-party CI tenant
*** sgw1 has quit IRC02:40
clarkbianw: we discivered that our infra-prod ansible jobs werent doing anything since the zuul local execution fix02:47
clarkbianw that git fixed by putting apieceof the infraprod base job into opendev/base-jobs as that repo is trusted02:47
clarkbjust mentioning it asit explains why things like LE certs werent updating02:47
clarkband so if you notice other laggy applies that could be related02:48
clarkbshould be good now though02:48
ianwok, cool, thanks :)  glad i didn't have to debug it :)02:49
*** sgw1 has joined #opendev02:53
*** sgw1 has quit IRC02:55
*** sgw1 has joined #opendev03:11
*** chandankumar has joined #opendev03:44
*** fdegir2 has joined #opendev03:48
*** fdegir has quit IRC03:49
*** Dmitrii-Sh has quit IRC03:57
*** Dmitrii-Sh has joined #opendev03:57
*** sgw1 has quit IRC04:24
*** DSpider has joined #opendev04:32
*** ysandeep|away is now known as ysandeep05:19
*** marios has joined #opendev06:03
*** tosky has joined #opendev06:42
*** ysandeep is now known as ysandeep|rover06:46
*** qchris has quit IRC06:51
*** qchris has joined #opendev07:04
*** lpetrut has joined #opendev07:10
*** dpawlik2 has joined #opendev07:22
*** hashar has joined #opendev07:24
*** fressi has joined #opendev07:27
openstackgerritIan Wienand proposed openstack/project-config master: Add opendev-3pci/project-config
openstackgerritIan Wienand proposed openstack/project-config master: Add OpenDev 3rd-party CI tenant
*** moppy has quit IRC08:01
*** moppy has joined #opendev08:01
*** tosky has quit IRC08:03
*** dtantsur|afk is now known as dtantsur08:05
*** ysandeep|rover is now known as ysandeep|lunch08:15
*** hrw has joined #opendev08:16
*** rpittau has joined #opendev08:31
*** fdegir2 is now known as fdegir08:40
*** ysandeep|lunch is now known as ysandeep|rover08:42
*** tosky has joined #opendev09:33
*** tkajinam has quit IRC10:10
*** lpetrut has quit IRC10:17
*** hrw has quit IRC10:33
*** iurygregory has quit IRC10:59
*** iurygregory has joined #opendev11:01
*** ryohayakawa has quit IRC11:03
*** ysandeep|rover is now known as ysandeep|afk11:04
*** ysandeep|afk is now known as ysandeep|rover11:29
*** mordred has joined #opendev12:11
openstackgerritArtom Lifshitz proposed openstack/project-config master: Add nested-virt-ubuntu-focal label
openstackgerritArtom Lifshitz proposed openstack/project-config master: Add nested-virt-ubuntu-focal label
*** dtantsur is now known as dtantsur|brb13:22
openstackgerritPierre Crégut proposed openstack/diskimage-builder master: Adds gnupg2 for apt-keys in ubuntu-minimal
*** mnasiadka has joined #opendev13:48
*** sgw1 has joined #opendev13:50
*** lpetrut has joined #opendev14:14
*** hashar has quit IRC14:16
*** ysandeep|rover is now known as ysandeep|away14:17
*** bhagyashris is now known as bhagyashris|away14:24
openstackgerritPierre Crégut proposed openstack/diskimage-builder master: Makes EFI images, bootable by bios
*** mlavalle has joined #opendev14:43
openstackgerritPierre Crégut proposed openstack/diskimage-builder master: Adds gnupg2 for apt-keys in ubuntu-minimal
*** lpetrut has quit IRC14:49
*** mlavalle has quit IRC14:58
fungiinfra-root: just a heads up, openstack had four more afs-related job failures again today, all ran from ze1114:59
clarkbfungi: I'm guessing a reboot us our next thing since checkvolumes has been run there right?15:00
fungiyeah, i did fs checkvolumes on both ze10 and ze11 late last week15:00
fungii wary of rebooting lest we lose evidence of the underlying problem, but at this point i'm at a loss for what to check we haven't already15:01
clarkbanother option may be to restart the container to ensure all the bind mounts are properly in place? But ya I'm not sure what else could be looked at15:02
clarkbfungi: oh do ze10 and 11 run the same kernel andotherhosts have newer or older kernels?15:03
clarkbthat would be when was the last reboot dependent. I wonderif that could explain why some executors are affected but not others15:04
fungiyeah, that could explain it since they rebooted recently15:04
*** dtantsur|brb is now known as dtantsur15:06
fungithough so did ze02 and we hadn't seen any errors from it yet15:06
fungize08 is running 4.15.0-70-generic, ze01/05/07 are on 4.15.0-91-generic, ze12 has 4.15.0-101-generic, ze03/04/09 have 4.15.0-106-generic, and ze02/06/10/11 are 4.15.0-107-generic15:08
fungiso far we've only seen issues with ze10 and ze11, mostly the latter15:08
clarkbthereis a .112 iirc15:11
clarkbmaybe weupdate 11 to that and reboot and see if it is happier?15:11
clarkb(that is based on memory with my local xenial server running hwe kernels)15:12
fungiyeah, there's a vmlinuz-4.15.0-112-generic in /boot15:12
fungiwhat's the graceful way to stop an executor now that they're containerized? we don't have initscripts or systemd units any longer, right?15:12
fungijust stop the container?15:13
clarkbcorrect and itsnot graceful15:13
clarkbzuul did just land graceful support though so we can possibly update docker-compose to use that when we update the image15:14
clarkbbut ya stopping is fast and not graceful for now15:14
*** mlavalle has joined #opendev15:17
fungiso might as well just `sudo reboot` i guess?15:17
clarkbyes I think that may be roughly equivalent. Unless you want to docker-compose down the container so they dont start on boot15:19
clarkbthen you can start it manually after checking afs? probably not necessary15:19
fungiyeah, i don't expect there's anything worth checking at reboot before starting the container15:20
clarkbI'll be around to help more directly in a few. Finishing up breakfast now15:22
fungii'll go ahead and reboot ze1115:22
fungi#status log rebooted ze11 in hopes of eliminating random afs directory creation permission errors15:23
openstackstatusfungi: finished logging15:23
fungiLinux ze11 4.15.0-112-generic #113~16.04.1-Ubuntu SMP Fri Jul 10 04:37:08 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux15:24
fungilooks like everything came back up normally15:25
clarkbalso it will be 37C ish today so I may melt15:25
fungitime difference between dmesg and syslog timestamps is still 10 minutes 37 seconds15:27
*** mlavalle has quit IRC15:27
clarkband ntpd shows us forcing a jump I expect?15:28
*** mlavalle has joined #opendev15:29
fungilooks like syslog started out 10 minutes into the future15:29
*** mlavalle has quit IRC15:32
fungialso the container runtime (according to logs) started in the future too (i.e. before ntpd corrected the system clock)15:32
clarkbfungi: I wonder if the clock is namespaced ? and the container is still out of sync? that could explain why afs within the containeris sad?15:33
fungiyeah, and i'm not entirely sure how to check that... i tried docker-compose exec but i think that creates a new container, right?15:34
clarkbno exec runs in the existing container15:34
clarkbdocker run makes anew container. exec attaches to an existing one15:34
fungiahh, well, invoking date via docker-compose exec matched the system time for the server15:35
fungidocker-compose logs has some "Temporary failure in name resolution" exceptions raised in the statsd client (and with timestamps which would have been 10 minutes in the future at the time they were logged) but nothing else out of the ordinary15:39
clarkbif the issue persists maybe restart the containers so that services there start with updated clock?15:40
clarkband if that fixes it we can probably do a unit after dependency for docker on ntpd15:41
fungiyeah, that makes sense. will fix these latest release failures and keep an eye out for any more15:42
*** lpetrut has joined #opendev15:42
fungisince we've also seen some (though fewer) similar failures on ze10 i wonder if i should go ahead and restart the container on it15:42
fungi`docker-compose down` followed by `docker-compose up -d` right?15:43
fungithough i suppose i have to wait for the ansible processes on it to actually die in between?15:44
*** mlavalle has joined #opendev15:44
clarkbfrom the dir with docker-compose.yaml in it15:45
fungii'll get started with that on ze1015:45
fungihrm, actually maybe because they're in a common cgroup, the ansible processes all seem to die instantly15:45
fungino lingering ssh persist or agents15:46
fungi#status log took the zuul-executor container down and back up on ze10 in order to try to narrow down possible causes for afs permission errors15:47
openstackstatusfungi: finished logging15:47
openstackgerritClark Boylan proposed openstack/project-config master: Run the grafana ansnible when grafana dashboard change
*** marios has quit IRC16:02
openstackgerritmelanie witt proposed openstack/project-config master: Fix copy-paste error for the ceph grafana nova ussuri panel
clarkbfungi: thinking out loud here, if it continues to have troubles maybe we should start doing bionic rebuilds (or even focal?) as that will land VMs on different hypervisors with different clocks16:08
clarkbfor afs bionic is more of a known quantity16:08
* mordred waves16:13
clarkbmordred: welcome back16:14
mordredclarkb: I mean - I hate chasing solutions without fully knowing the problem - but chasing it on focal would put us likely in better position for debugging16:14
mordredwe'd at least be chasing recent vs old16:14
mordredand our userspace in those images is newer too - which shouldn't matter - but maybe dones16:15
clarkbya what is odd is ze10 and 11 seem to hvae hte problems but none of the others16:15
clarkbbut the systm clock issues stand out16:17
fungiyeah, i have a feeling they were reboot-migrated onto hosts with massively skewed clocks16:18
clarkbmordred: separately we discovered that buildx does have a pretty big performance impact for cpu intensive activities like compiling software for python wheels. We've had to disable nodepool's arm64 image builds in order to get the kazoo fix released.16:19
mordredwhy does the hypervisor clock matter?16:19
mordredclarkb: oh wow16:19
clarkbmordred: because VMs boot with hypervisor clock as the starting point for their clocks16:19
mordredclarkb: yah - but wouldn't ntp update the kernel clock?16:20
clarkbmordred: and then the VM ntpd makes a huge jump and some software (perhaps even afs) dislike that16:20
mordredthat makes sense16:20
fungiyeah, no clue really, so far it's all conjecture16:24
fungibasically we had three executors spontaneously reboot last week16:25
fungiall came up with varying degrees of clock skew ntpd had to correct shortly after boot16:25
fungize02 had the least, ze10 was pretty bad and ze11 was worst of the three (>10 minutes)16:25
fungiwe pretty much immediately started getting random afs errors when ansible was asked to recursively ensure the project directories existed before writing tarballs, and also similarly before copying release notes into afs16:26
fungimost happened on ze11, some on ze10, so far none reported for ze0216:27
fungi is a good example from today16:27
fungia file task to idempotently create /afs/ in case it doesn't exist failed with the error "There was an issue creating /afs/ as requested: [Errno 13] Permission denied: b'/afs/'"16:29
*** hashar has joined #opendev16:29
fungithe "/afs/" in that error seems a bit misleading, i don't think ansible would actually have tried to create /afs/ because the task is recursive:false16:30
*** dtantsur is now known as dtantsur|afk16:59
clarkbfungi: were those jobs reenqueued? I'm wondering if we need to do anything other than wait and monitor at this point?17:10
fungino, for the tarball write errors we can't reenqueue, i have to manually copy them from pypi into afs along with scraping the signatures from the build logs and adding those17:11
clarkbright that is because we've already pushed to pypi at that point and reuploads to pypi are errors, got it17:12
fungibut a couple of the failures were release announcements, which might be reenqueueable if they ran in the tag pipeline instead of the release pipeline, i need to check that17:12
fungithe tarballs fix looks like: 1. pull wheel and sdist from pypi for failed release, 2. copy signature text from build log and put in similarly-named .asc files, 3. use gpg --verify to check that the signatures are valid for the wheel and sdist, 4. copy the wheel/sdist/sigs into correct directory in afs, 5. vos release project.tarballs17:14
fungistep 3 is how we confirm the files pypi is serving are really the ones we built17:15
clarkbfungi: some of the jobs do run successfully on ze11 with afs writes too right? I'm wondering if maybe we should add a retries: 3 or something to the job tasks17:18
clarkbto at least reduce the impact while we sort it out17:18
fungii think so? i'm really not sure what jobs do this exact same file task with ansible though17:19
fungibut maybe we could find other builds of the same jobs succeeding on ze11?17:19
clarkbya that would at least be able to rule in that it succeeds sometimes17:19
clarkb(if we can't find evidence with that one job we'd have to keep looking)17:20
clarkbI think the docs jobs are the other afs writers from executors17:20
funginot entirely sure how to query that, maybe mysqlclient17:20
clarkbwe don't record the executor in the db. It is in the logs though17:20
clarkbso maybe grep for job name in ze11 executor logs?17:20
fungiahh, it's in the manifest, right17:20
clarkband that gives a list of uuids?17:20
fungioh! yeah, executor logs. great idea17:21
clarkbthe neutron fix for the logstash issue has landed so I'm going to give that whole system a restart17:22
fungilooks like the releasenotes failure indicates we're running commands like this in a subprocess `mkdir -p /afs/`17:23
fungithe -p would in theory try to create /afs/ if it wasn't found for some reason17:25
fungiso this could also just be good ol' network issues with the server instance the executor's running on17:26
clarkboh interesting17:27
fungiall four failures were clustered between 05:58:30 and 06:09:5417:27
mordredthat does sound like a good possibility for good-ol-network17:28
fungiprevious failures were clustered together too, so could be brief periods where that executor was unable to reach the afs servers17:28
fungithough i find it odd that the file task on the tarball jobs gets a similar error even though it sets recurse:false
fungior does that option not work the way i think it works?17:33
fungimmm, so far all the builds of release-openstack-python i can find on ze11 failed the same way17:39
clarkb#status log Restarted logstash geard, logstash workers, and logstash now that neutron logs have been trimmed in size.17:40
openstackstatusclarkb: finished logging17:40
fungioh, here's one which succeeded 2020-07-22 13:52:2017:42
clarkbcool so it isn't a 100% failure and retries may hep17:43
fungiand the spontaneous reboot was Wed Jul 22 16:2317:43
fungiso yeah, it's a 100% failure since wednesday's reboot17:43
clarkbunrelated it looks like cert check is happy with those expiring certs so fixing the ansible after the zuul updates seems to have been sufficient17:43
clarkboh fun17:44
fungiif i authenticate with my afs creds on ze11 and `mkdir -p /afs/` it doesn't error17:46
fungibut it seems to error every time the executor tries it17:46
clarkbfungi: have you tried it iwthin the container?17:47
clarkbmaybe use zuul's creds in there though17:47
fungii'm not sure how to go about trying that17:47
clarkbfungi: I think you can docker exec into the executor container and exec bash, then do what you'd normally do?17:47
clarkbmordred: ^ you had tested bwrap things with afs in containers and may have steps sorted out?17:48
fungiahh, maybe, didn't realize we had interactive shells in the images but yeah i guess that makes sense17:48
fungineat, we don't set a default kerberos realm in these17:50
fungimkdir: cannot create directory ‘/afs/’: Permission denied17:51
fungils: cannot access '/afs/': No such file or directory17:51
fungimaybe we aren't bindmounting /afs into the containers?17:51
fungi`mount` says:17:52
fungi/dev/xvda1 on /afs type ext4 (rw,noatime,nobarrier,errors=remount-ro,data=ordered)17:52
fungias opposed to the main system context where it instead says:17:52
fungiAFS on /afs type afs (rw,relatime)17:52
fungii think that must be our problem17:53
clarkbI wonderif it isnt available on boot when services start due to the ntp issue17:53
clarkbbut then it is later? you restarted ze10 right? does it show that same issue?17:54
fungii'm about to check now17:54
fungiyep! on ze10 it's accessible now17:57
fungiso i concur, it seems like maybe it's starting the container before afsd17:58
clarkbcool restarting the containers on ze11 should fix the immediate issue then we can add unit hints to fix the next reboot17:59
fungi#status log took the zuul-executor container down and back up on ze11 in order to make sure /afs is available18:03
openstackstatusfungi: finished logging18:03
fungiso looks like the reason ze02 hasn't hit this is that it won the race at boot time18:05
clarkbI wonder if afs waits for ntp and docker doesnt18:05
clarkbthe bigger delta likely makes ntp takelongee18:06
fungidocker definitely doesn't wait for ntp because the docker logs have skewed timestamps from prior to ntpd correcting them18:06
fungiokay, i've manually checked all 12 executors and their containers now have a usable /afs, so at least this shouldn't bite us again until one of them gets rebooted18:08
clarkband before that happens we can drop one of those hint files18:12
clarkbwe want docker to start after afs18:12
clarkbiirc you can make a file that specifies just that relationship without modifying the other units18:13
clarkbI thought we did that somewhere else and can look for it in a bit18:13
fungicurrently /etc/init/docker.conf includes "start on (filesystem and net-device-up IFACE!=lo)"18:14
fungithough afsd is handled via sysvinit compat for /etc/init.d/openafs-client18:15
fungi"...create a directory named unit.d/ within /etc/systemd/system and place a drop-in file name.conf there that only changes the specific settings one is interested in..."18:19
JayFit should literally be as simple as echo -e "[Service]\nAfter=after_unit.service\n" > /etc/systemd/system/unit.service18:22
fungitheir specific example is for a /etc/systemd/system/httpd.service.d/local.conf18:22
fungilooks like we have afs.mount and openafs-client.service units, i'm not sure which we should make docker go after18:23
fungithough i suppose we could just specify both18:23
JayFI'd 100% do afs.mount (maybe both) if openafs-client.service is sysvcompat18:24
fungithat's a good point, it might return control to the calling process before it's actually started18:24
JayFbecause sysvcompat services are "up" as soon as /etc/init.d/servicename start (or similar) returns, which isn't always the same as the service being up18:24
fungii like the cut o' yer jib, matey18:25
JayFI really love the systemd service/unit model, and I don't sysadmin much in my current job18:26
JayFand happened to look at this channel at the right time to help :D18:26
fungithanks! ;)18:26
JayFnp; gl18:26
fungicurrently /lib/systemd/system/docker.service (from the docker-ce package) claims firewalld.service containerd.service18:28
JayFdon't ever trust files on disk for systemd units18:28
JayFbecause drop ins can change them18:29
JayFsystemctl cat docker.service # gives you the effective unit18:29
fungigood call. at least it matches in this case18:29
clarkbalso daemon-reload needs to run to pick up changes18:29
clarkbwhich if we cared about this outside of boot time you'd want to add to your steps above18:29
fungithankfully in this case we don't18:30
clarkbI'm still not having any luck finding where we do this already (I know we did it for something) but the stuff above lgtm18:30
fungiso anyway, we'll have our executor role create a /etc/systemd/system/docker.service.d/after-afs.conf containing a [Unit] section with only " firewalld.service containerd.service afs.mount openafs-client.service"18:32
JayFThose drop-ins are additive18:32
JayFI believe you just need to add the ones you want to add18:32
JayFbut I'm not 1000% sure18:32
clarkbJayF: fungi correct it is additive18:32
fungioh, so it can just be "After=afs.mount openafs-client.service" and that'll be appended18:32
fungieven nicer18:33
fungiokay, furious typing commences18:33
*** lpetrut has quit IRC18:33
JayFIf you ever need to remove something from a list like that, e.g. you need to remove an After=18:33
*** tosky has quit IRC18:33
JayFyou do it like this: [Service]\nAfter=\nAfter=[list of things or multiple After= directives]18:34
JayFBasically a  blank entry clears a list18:34
fungineat, though somewhat black magic18:36
fungithen again, systemd is nothing if not dark arts18:36
JayFIt's well documented in long, detailed man pages. Of course, that also means it's hard to find when you wanna do one specific thing.18:36
fungii have my gate seals in place, so this summoning should go okay as long as i don't forget the binding ritual18:37
JayFEh. No harder to understand than reading some of the old school crazy combinations of bash scripts with libraries and including groups of other bash scripts and using numbers at the beginning to force ordering18:37
JayFdark magic is any magic a person isn't familiar with, IME18:37
JayFI have many users who'd call OpenStack dark magic ;)18:37
fungicount me in those ;)18:37
clarkbI think the difference with systemd is a lot of the magic is super implicit. Using an @ in a service name for example18:38
clarkbwhereas bash is bash18:38
clarkbbasically its a matter of leraning the magic. with bash many already know it18:38
JayFThat's fair, I'd complain about it a lot harder if it wasn't so well documented. Most of it I have internalized at this point, too, which helps18:39
fungiyep, i'm certainly a shellusionist, need totally different tomes to be a systemdmancer18:39
JayFI'm just a bard with a bunch of scrolls in my bag of holding, and I'm not really sure how all of them work18:40
clarkbJayF: it is well documented but figuring out @ took me forever because its not easily searchable18:41
clarkbbut now that i know it its fine18:41
JayFThe @, and the socket activation are the darkest of the systemd magiks18:41
clarkbits the sort of thing whree if you want to search "systemd unit parameters" you get immedaitely to the info you need. But if its "What is this @ in the unit filename and why is it special" you get to go on a long quest :)18:43
fungisocket activation doesn't seem much different from inetd, so doesn't really strike me as strange18:43
JayFAs part of our Ironic deployment here, we run the nova-computes with an "@" and the string after-the-@ is set to Host=, to make it easy to run multiple instances of the nova-compute on a single box18:43
openstackgerritJeremy Stanley proposed opendev/system-config master: Start zuul-executor after afsd and /afs is mounted
fungiclarkb: JayF: mordred: smcginnis: ^18:51
clarkbfungi: is where we do it elsewhere fwiw18:53
clarkbnot sure if yo uwant to add the owner/group/mode settings or not18:53
fungisystemd is running as root so i doubt it cares18:55
clarkbunless its worried about unprivileged users modifying privileged services (but I don't think that is an issue here either as nasible runs as root)18:57
smcginnisfungi: Great find.19:01
mordredfungi: nice! +A but there's a comment from JayF19:26
openstackgerritClark Boylan proposed opendev/system-config master: Deny Gerrit /p/ requests
clarkbfungi: ^ thats WIP but we talked about what that would look like earlier if you want to double check it now19:33
clarkbinfra-root is a simple easy review for etherpad management19:50
*** DSpider has quit IRC19:54
*** hashar has quit IRC20:02
openstackgerritMerged opendev/system-config master: Start zuul-executor after afsd and /afs is mounted
openstackgerritMerged opendev/system-config master: Run our etherpad prod deploy job when docker updates
*** fressi has quit IRC20:46
*** tosky has joined #opendev21:25
corvusclarkb: it's a little late, sorry, but i just added a meeting topic for tomorrow21:45
corvus(i talked with some folks at google about CI system integration with Gerrit and wanted to share that with folks)21:45
clarkbcorvus: no worries I got distracted by other things and haven't sent it out yet21:46
clarkbcorvus: did you update the wiki? I can wait a bit longer if not21:46
corvusclarkb: i did21:47
clarkbah yup refresh shows it. I'll get the email out soon21:47
fungifwiw, i would love to hear what they had to say21:50
fungiso thanks for adding it to the agenda!21:50
openstackgerritIan Wienand proposed openstack/project-config master: Add opendev-3pci/project-config
openstackgerritIan Wienand proposed openstack/project-config master: Add OpenDev 3rd-party CI tenant
*** tkajinam has joined #opendev22:52
*** mlavalle has quit IRC22:56
*** dtroyer has quit IRC23:03
*** qchris has quit IRC23:06
openstackgerritMerged openstack/project-config master: Run the grafana ansnible when grafana dashboard change
openstackgerritMerged openstack/project-config master: Fix copy-paste error for the ceph grafana nova ussuri panel
*** qchris has joined #opendev23:07
*** tosky has quit IRC23:25

Generated by 2.17.2 by Marius Gedminas - find it at!