jinyuanliu_ | hi | 02:45 |
---|---|---|
jinyuanliu_ | https://review.opendev.org/SignInFailure,SIGN_IN,Contact+site+administrator | 02:46 |
jinyuanliu_ | I have newly registered an account. This error is reported when I log in. Does anyone know how to deal with it | 02:47 |
*** ysandeep|out is now known as ysandeep | 05:11 | |
*** odyssey4me is now known as Guest7196 | 05:47 | |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 07:11 |
*** jpena|off is now known as jpena | 07:23 | |
*** hjensas is now known as hjensas|afk | 07:27 | |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 07:53 |
opendevreview | Oleg Bondarev proposed openstack/project-config master: Update grafana to reflect dvr-ha job is now voting https://review.opendev.org/c/openstack/project-config/+/805594 | 08:02 |
*** ykarel is now known as ykarel|lunch | 08:05 | |
*** ysandeep is now known as ysandeep|lunch | 08:14 | |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 08:20 |
noonedeadpunk | hey there! | 08:30 |
noonedeadpunk | It feels for me that git repos might be out of sync right now. https://paste.opendev.org/show/809294/ | 08:31 |
noonedeadpunk | and at the same time on other machine I get valid 768b8996ba4cb24eb2e5cd5dc149cd114186debd | 08:31 |
*** ykarel|lunch is now known as ykarel | 09:24 | |
ianw | noonedeadpunk: is the other machine coming from another ip address? | 09:25 |
noonedeadpunk | yep | 09:26 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 09:29 |
ianw | d615a7da 15:30:25.611 [1904d04c] push ssh://git@gitea03.opendev.org:222/openstack/cinder.git | 09:31 |
ianw | infra-root: ^ that to me looks like a stuck replication from gerrit, and matches the repo noonedeadpunk is cloning | 09:32 |
ianw | there are a few others as well, some older, up to sep 9 is the oldest | 09:32 |
ianw | i've killed that process, i don't really have time for a deep debug at this point | 09:43 |
ianw | https://paste.opendev.org/show/809297/ <- i have killed those process which all seem stuck. push queues seem empty now | 09:50 |
ianw | we should keep an eye to see if more are getting stuck | 09:51 |
noonedeadpunk | git pull worked for my faulty machine | 09:55 |
noonedeadpunk | thanks! | 09:55 |
*** odyssey4me is now known as Guest7209 | 10:02 | |
*** ysandeep|lunch is now known as ysandeep | 10:08 | |
iurygregory | opendev folks, we started to see ironic jobs failing because of reno 3.4.0... I'm wondering if anyone saw the same problem in other projects (e.g https://zuul.opendev.org/t/openstack/build/6f39928da6a04d2ab0a64258d7309bfa ) | 10:38 |
iurygregory | maybe an issue with pip? | 11:01 |
*** dviroel|out is now known as dviroel | 11:10 | |
rosmaita | iurygregory: seeing that on a lot of different jobs, usually see that when the pypi mirror is outdated | 11:13 |
*** jpena is now known as jpena|lunch | 11:24 | |
iurygregory | rosmaita, so we just wait to do some recheck? | 11:36 |
*** ysandeep is now known as ysandeep|brb | 11:46 | |
*** ysandeep|brb is now known as ysandeep | 11:54 | |
rosmaita | iurygregory: other than reporting it here (not sure the infra team can do anything about the mirrors), not sure what else we can do | 12:04 |
iurygregory | I asked in #openstack-infra also =) | 12:19 |
*** jpena|lunch is now known as jpena | 12:20 | |
ykarel | Clark[m], fungi, ^ happened again | 12:25 |
ykarel | some mirrors affected like pip install --index-url=https://mirror.mtl01.iweb.opendev.org/pypi/simple reno==3.4.0 | 12:25 |
ykarel | i ran PURGE as you suggested last time but doesn't seem to help, may be running wrongly | 12:26 |
ykarel | ran curl -XPURGE https://mirror.mtl01.iweb.opendev.org/pypi/simple/reno | 12:26 |
ykarel | seems to work now | 12:27 |
ykarel | this time i ran without reno may be that worked? | 12:27 |
ykarel | curl -v -XPURGE https://mirror.mtl01.iweb.opendev.org/pypi/simple | 12:27 |
ykarel | iurygregory, both the failures you shared were on mirror.mtl01.iweb.opendev.org, which i see working now | 12:30 |
ykarel | have you seen on other mirrors too? | 12:30 |
ykarel | if not can try recheck and see if all good now | 12:30 |
iurygregory | ykarel, I haven't check other jobs, I will put recheck and see how it goes | 13:09 |
ykarel | ack | 13:09 |
*** lbragstad_ is now known as lbragstad | 13:15 | |
fungi | ykarel: it's not our mirrors which need to be purged, it's pypi's cdn | 13:19 |
fungi | iurygregory: ^ | 13:19 |
fungi | we're just proxying through whatever pypi's serving, and their cdn seems to sometimes serve obsolete content around montreal canada | 13:19 |
ykarel | fungi, yes correct, i messed the words | 13:19 |
fungi | no, i mean you're calling curl to purge our mirror which won't do anything, you need to purge that url from pypi's cdn | 13:20 |
fungi | which should cause fastly to re-cache it from the correct content (hopefully) | 13:20 |
ykarel | ohkk not aware about that | 13:20 |
ykarel | i just ran above curl and it worked somehow | 13:21 |
ykarel | may be it was just timing | 13:21 |
fungi | rosmaita: also we have no pypi mirror to get outdated, we just proxy pypi | 13:22 |
fungi | ykarel: yes, from what we've seen in the past it's intermittent and may recur | 13:22 |
fungi | my suspicion is that pypi still maintains a fallback mirror of their primary site which they instruct fastly to pull from if it has network connectivity issues reaching the main site, and that fallback mirror may be stale, and for whatever reason the fastly cdn endpoints near montreal have frequent connectivity issues and wind up serving content from the fallback site | 13:23 |
rosmaita | fungi: ok, good to know | 13:23 |
ykarel | fungi, ack | 13:25 |
fungi | it used to be the case that the mirroring method pypi used to populate their fallback site didn't create the necessary metadata to make python_requires work, so we'd see incorrect versions of packages selected. it seems like they solved that, but possible that fallback could be very behind in replication or something | 13:26 |
fungi | so instead we're just ending up getting old indices sometimrs | 13:26 |
rosmaita | i'm getting an openstack-tox-docs failure during the pdf build ... sphinx-build-pdf.log tells me "Command for 'pdflatex' gave return code 1, Refer to 'doc-cinder.log' for details" ... but i can't find doc-cinder.log in https://zuul.opendev.org/t/openstack/build/f45e84debbae44449bf6d98cc0807d68/logs ... where should i be looking? | 13:27 |
fungi | it's possible the job isn't configured to collect that log | 13:27 |
rosmaita | oh | 13:27 |
fungi | looking to see if i can tell where it would be written on the node | 13:28 |
tosky | https://47bdf347398a802ddc78-6d4960ba8e43184f8d8ec59d1a3f8e83.ssl.cf2.rackcdn.com/760178/13/check/openstack-tox-docs/f45e84d/sphinx-build-pdf.log | 13:28 |
tosky | rosmaita: ^^ | 13:28 |
fungi | ahh, it just has a different filename | 13:28 |
rosmaita | is that the same log? | 13:28 |
fungi | dunno, do pdf builds work locally for you? if they break similarly you can compare the log content | 13:28 |
rosmaita | fungi: guess i will have to check | 13:29 |
fungi | looks like it contains the same error referring you to the other log, so i think it's not | 13:29 |
fungi | it probably wrote it at /home/zuul/src/opendev.org/openstack/cinder/doc/build/pdf/doc-cinder.log | 13:30 |
iurygregory | fungi, ack | 13:37 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 13:38 |
Clark[m] | fungi: rosmaita: https://zuul.opendev.org/t/openstack/build/f45e84debbae44449bf6d98cc0807d68/log/job-output.txt#2013 shows the file it redirected to which is the one tosky linked to | 13:42 |
fungi | Clark[m]: yeah, check the end of that file, it refers you to the other file rosmaita is looking for | 13:43 |
fungi | maybe it thinks it's referring to itself by a different name? | 13:43 |
fungi | or maybe it contains all the same contents as the other file it references and is a superset? | 13:44 |
Clark[m] | Well the redirected content is what the build outputs. I guess the build could write a separate log file the jobs don't know to collect. In the past the stout and stderr have been enough to debug those though iir | 13:44 |
Clark[m] | fungi: should we trigger replication for the repos ianw had to clean up old replication tasks to be sure they are caught up? | 13:46 |
fungi | Clark[m]: probably, i didn't see ianw mention having triggered a full replication | 13:46 |
fungi | i'll do that in a sec | 13:47 |
fungi | `gerrit show-queue` indicates it's still caught up at least, so no new hung tasks | 13:48 |
Clark[m] | fungi: ykarel: our proxies should proxy the purge requests so I expect those curl commands work. But as fungi points out it creates confusion over where the issue lies. To be very very clear we are only giving the jobs what pypi.org has served to us. | 13:49 |
Clark[m] | Ya those stale ones could be fallout from when we did the server migrations if those weren't as graceful as we expected | 13:49 |
Clark[m] | I don't know if the timing lines up for that or not | 13:50 |
fungi | roughly 1.8k tasks queued now | 13:50 |
fungi | up to 7.8k now | 13:51 |
fungi | 10k... | 13:51 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 13:52 |
Clark[m] | I think the total for a full replication is around 17k | 13:52 |
Clark[m] | It should go quickly for repos that are up to date | 13:53 |
fungi | topped out a little over 18k and now falling | 13:53 |
fungi | we're down around 16k tasks now | 14:29 |
fungi | i guess we're looking at 4-5 hours for it to finish at this pace | 14:30 |
clarkb | I wonder if the bw between sjc1 and ymq is lower than it was between dfw and sjc1 | 14:34 |
clarkb | fungi: fwiw we can replicate specific repos which goes much quicker. In this case probably a good idea to do everything anyway | 14:35 |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 14:38 |
clarkb | fungi: thinking out loud here about lists.o.o reboots. I think step 0 is confirm that auto apt updates haven't replaced our existing boot tools with newer versions including the decompressed kernel. Reboot on what we've got and confirm that works consistently. Then we can replace the decompressed kernel with the compressed kernel and try rebooting again. If that works we are good and | 14:40 |
clarkb | don't really need to do anything more. If that doesn't work we recover with the decompressed kernel and put package pins in place and plan server replacement | 14:40 |
fungi | yep | 14:42 |
fungi | that should suffice | 14:42 |
clarkb | jinyuanliu_ isn't here anymore but chances are an existing account has the same email address as the one they just tried to login with and gerrit caught that as a conflict and prevent login from proceeding | 14:46 |
clarkb | they should either use a different email address or login to the existing account | 14:46 |
fungi | skimming last modified times in /boot on lists.o.o it looks like nothing has been touched since our edits | 14:48 |
clarkb | cool. I still need to load ssh keys for the day. But I guess we can look at rebooting after meetings | 14:52 |
fungi | sgtm | 14:55 |
clarkb | fungi: I just double checked and /boot/grub/grub.cfg shows the 5.4.0-84 kernel as the first entry, that kernel file in /boot is the decompressed version (much larger than the others), and /boot/grub/menu.lst still lists the chain load option first | 14:56 |
clarkb | I agree those files all seem to be as we had them on the previous boot | 14:56 |
*** ykarel is now known as ykarel|away | 15:02 | |
clarkb | fungi: should we status notice the delay in git replication? | 15:03 |
clarkb | something like #status notice Gerrit is currently undergoing a full replication sync to gitea and you may notice gitea contents are stale for a few hours. | 15:04 |
*** ysandeep is now known as ysandeep|dinner | 15:11 | |
fungi | we can, though it's probably not going to be that noticeable. we're already down to ~14k remaining | 15:15 |
clarkb | fungi: https://www.mail-archive.com/grub-devel@gnu.org/msg30984.html I'm not longer hopeful the chainloaded bootlaoder is any better | 15:17 |
clarkb | https://lists.gnu.org/archive/html/grub-devel/2020-04/msg00198.html too | 15:18 |
clarkb | fungi: in that case maybe we should focus on some combo of pinning the kernel package, doing the apt post install hook thing from that forum suggestion, replacing the server | 15:21 |
fungi | yeah, that seems like the best we can probably arrange | 15:22 |
fungi | it's still unclear to me from those posts why chainloading the bootloader still doesn't provide lz4 decompression, yet somehow grub in pvhvm guests can decompress them just fine | 15:23 |
clarkb | fungi: I suspect the chainloader may not hand off and execute grub2 instead it knows how to read grub2 configs and then negotiate running the kernel directly like pv wants | 15:25 |
fungi | mmm, i see. yeah there's mention of the kernel file being handed off to the hypervisor, so i suppose it's being handed off compressed and it's up to the hypervisor whether it supports that compression | 15:26 |
fungi | in which case i wonder why the chainloading was even needed | 15:27 |
clarkb | fungi: the only thing I can think of is maybe I got the menu.lst wrong? But menu.lst isn't going to be reliably updated anymore is it? | 15:28 |
fungi | right, i guess rackspace's external bootloader will want to parse menu.lst (though maybe it would parse grub.cfg if menu.lst wasn't there) | 15:29 |
clarkb | based on what I've read I think we should do a reboot to double check it works reliably as is. Then either pin the kernel package or put in place the post install hack. Then start planning a replacement server? | 15:32 |
clarkb | https://lists.xenproject.org/archives/html/xen-users/2020-05/msg00023.html "I think we lost most of them to KVM already anyway :(" Not going to lie it seems like maybe problems like this are a big reason for that | 15:32 |
fungi | maybe this is a good time to try to combine the two ml servers into one, and/or a more urgent push for mm3 migration | 15:35 |
fungi | worth discussing in today's meeting | 15:36 |
clarkb | ya its on the agenda to discuss this stuff | 15:36 |
mordred | "rackspace's external bootloader" ... do I even want to know? | 15:37 |
clarkb | mordred: basically xen is weird and it can't reliably boot ubuntu focal anymore in pv? mode | 15:38 |
fungi | mordred: it's how xen pv works, the bootloader is external to the server image | 15:38 |
clarkb | mordred: unlike kvm xen isn't running the bootloader if we understand things correcntly. Instead its finding the kernel and running it directly | 15:38 |
clarkb | kvm is like your laptop and runs the actual bootloader avoiding all of these issues | 15:38 |
clarkb | fungi: we can also yolo go for it and see if it can do the compressed file though I'm fairly certain it can't at this point | 15:39 |
fungi | i think xen pvhvm works like kvm though | 15:39 |
clarkb | fungi: ya our pvhvm instances seem fine | 15:39 |
fungi | clarkb: no, i agree with you after digging into those ml threads | 15:39 |
mordred | wow. also - we have non-pvhvm instances? I thought we just used that for everytihng ... but in any case, just wow | 15:41 |
clarkb | mordred: lists.openstack.org is our oldest instance as we haev upgraded it in place to preserve ip reputation for smtp | 15:42 |
clarkb | I think it predates rax offering pvhvm | 15:42 |
mordred | ahhhhhh yes | 15:42 |
clarkb | fungi: the extract tool and the postinstall script don't seem too terrible if we just want to put those in place for now as a CYA manuever while we plan to replace the server? | 15:42 |
clarkb | New suggestion, reboot on current state to ensure it is stable. Then install extract-vmlinux and the kernel postinstall.d script. Then start planning to replace the server | 15:43 |
clarkb | fungi: ^ does that seem reasonable to you and if so do you think we need to try and ansible the extract-vmlinux and postinst.d stuff or just do it by hand? | 15:44 |
fungi | wiki.o.o may be of a similar vintage to lists (though it's actually a rebuild from an image of the original so not technically an old instance, just an instance on an old flavor) | 15:44 |
fungi | mordred: lists.o.o has been continuously in-place upgraded since it was created with ubuntu precise (12.04 lts) some time in 2013 | 15:45 |
fungi | clarkb: yes, that plan sounds solid | 15:45 |
clarkb | cool in that case should we do a reboot nowish? | 15:46 |
fungi | we can probably by-hand manage the kernel decompression and either put a hold on the current kernel package or just plan to fix it with a recovery boot if unattended-upgrades puts us on a newer kernel and then rackspace reboots the instance for some reason | 15:47 |
clarkb | ya that is an option for us as well | 15:47 |
clarkb | the script from linux is in lists.o.o:~clarkb/kernel-stuff if we want to extract it by hand | 15:48 |
fungi | and yes, i'm able to do the reboot now if you're ready | 15:48 |
fungi | and at the ready to do the recovery boot if needed | 15:49 |
clarkb | fungi: ya I'm ready. My meeting is over | 15:49 |
clarkb | fungi: will you push the button? or should I? | 15:50 |
fungi | i will | 15:51 |
fungi | sorry, had to switch rooms, pushing the button now | 15:52 |
clarkb | ok | 15:53 |
fungi | looks like it finished shutting down | 15:54 |
clarkb | and now it isn't coming back :/ | 15:54 |
clarkb | I guess it is a good thing to know this isn't reliable. But would be really nice to know why | 15:55 |
clarkb | oh wait it pings now | 15:55 |
fungi | it's booting | 15:55 |
clarkb | huh is it just REALLY slow? | 15:55 |
fungi | fsck ran for a bit | 15:55 |
clarkb | ah | 15:55 |
fungi | also the chainloading may delay it if we didn't take out the keypress timeouts | 15:56 |
fungi | 15:56:05 up 1 min, 2 users, load average: 2.89, 0.81, 0.28 | 15:56 |
fungi | Linux lists 5.4.0-84-generic #94-Ubuntu SMP Thu Aug 26 20:27:37 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux | 15:56 |
clarkb | and it is on the 5.4. kernel | 15:56 |
clarkb | ok rebooting is reliable but slow. | 15:56 |
fungi | ftr, it's not our slowest rebootnig server | 15:56 |
fungi | but yeah it did take a bit | 15:57 |
fungi | so looking at mm3 options, we could either go the distro package route for something stable (focal has mm 3.2.2) or the semi-official docker container route (which has options for latest mm 3.3.4 or rolling snapshots of upstream revision control) | 15:58 |
clarkb | I just updated the lists upgrade etherpad with reboot and potential future options. I guess we discuss what we want to do next in our meeting | 15:59 |
fungi | the up side to distro packages, in theory, is we get backported security fixes but fairly stable featureset until the next distro release upgrade. with docker containers we get a version independent of the running distro but end up consuming new versions a lot more frequently if we want security fixes | 15:59 |
clarkb | Do we want to remove lists.o.o from the emergency file before we meeeting? The two things to consider there are we should probably remove the cached ansible facts file for that instance on bridge and it will run autoremove of packages which may remove our 4.x kernels but we seem to be able to boot 5.4 now | 16:00 |
clarkb | fungi: one thing I have really enjoyed with things like gitea where we consume upstream with containers is that we can update frequently and keep the deltas as small as possible | 16:00 |
clarkb | letting things sit for years results in scaryness | 16:00 |
fungi | yeah, i think it's fine. do we want a manual ansible run or just let the deploy jobs do their thing on their own time? | 16:01 |
clarkb | fungi: the deploy jobs already ran so we should probably remove it then manually run the playbook | 16:01 |
clarkb | well deplyo jobs ran for the fixup change I mean | 16:01 |
fungi | do we need to remove cached facts before taking it out of the emergency list? | 16:01 |
clarkb | fungi: yes I think we need to do that otherwise some of our option selections might select xenial options | 16:01 |
clarkb | /var/cache/ansible/facts/lists.openstack.org appears to be the file? | 16:02 |
clarkb | I'm going to find breakfast but can help with that stuff after but if you want to go ahead feel free | 16:03 |
fungi | ahh, we can just delete that file i guess. i was looking through documentation to find out how to clear cached facts | 16:04 |
fungi | hah, first hit on a web search was https://docs.openstack.org/openstack-ansible/12.2.6/install-guide/ops-troubleshooting-ansiblecachedfacts.html | 16:04 |
fungi | that at least seems to confirm your suggested method | 16:05 |
*** ysandeep|dinner is now known as ysandeep | 16:08 | |
fungi | clarkb: after a bit more looking around, i managed to convince myself deleting that file should do what we want, so removed it | 16:11 |
opendevreview | Martin Kopec proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/ https://review.opendev.org/c/openstack/project-config/+/765787 | 16:12 |
fungi | and removed the lists.openstack.org entry from the emergency disable list | 16:12 |
Clark[m] | Cool I guess next up is running the playbook? I should be back at the keyboard in 20-30 minutes | 16:13 |
fungi | don't rush. i'll get things queued up in a root screen session on bridge.o.o | 16:15 |
fungi | i've queued up a command to run the base playbook first, presumably that's what we want to start with | 16:16 |
Clark[m] | ++ | 16:18 |
Clark[m] | That is what does autoremove fwiw | 16:19 |
*** jpena is now known as jpena|off | 16:26 | |
opendevreview | Jiri Podivin proposed zuul/zuul-jobs master: DNM https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 16:27 |
fungi | hard to say for sure since we just reset it with a reboot, but memory consumption on lists.o.o looks slightly lessened after the upgrade than it was prior | 16:28 |
fungi | just based on comparing the memory usage graph in cacti to the same day last week | 16:29 |
clarkb | fungi: I've joined the screen | 16:35 |
clarkb | I guess I'm ready if you are | 16:36 |
fungi | ready for me to flip the big red switch then? | 16:36 |
*** artom_ is now known as artom | 16:36 | |
clarkb | ya I think we run base, then check what it did (if anything) to /boot contents | 16:36 |
clarkb | then we run the service lists playbook and restart mailman and apache services to be sure they are happy with slightly updated contents? | 16:37 |
fungi | starting now | 16:38 |
clarkb | the exim conf task seems to have nooped as expected | 16:40 |
fungi | yeah, so far so good | 16:40 |
clarkb | the older kernels have remained and we did't end up with a recompressed new kernel | 16:41 |
clarkb | I think that is a happy base run | 16:41 |
fungi | so the service-lists playbook next? is that the correct one? | 16:41 |
clarkb | checking | 16:41 |
clarkb | infra-prod-service-lists job runs service-lists.yaml | 16:42 |
clarkb | yes I believe that command is correct | 16:42 |
fungi | i mean, that's a file anyway, i was able to tab-complete it | 16:42 |
fungi | okay, running now | 16:42 |
*** ysandeep is now known as ysandeep|out | 16:42 | |
clarkb | it is updating the things I expect it to update and skipping creation of lists because they arlready eixst if I read the log properly | 16:44 |
fungi | yep, that's my take on the output | 16:45 |
clarkb | the updates for airships init script, apache config and the global mm_cfg.py all look as expected to me. Will just have to restart services and ensure they function when ansible is done I think | 16:45 |
fungi | yep, i'll do that via ssh to the lists server | 16:46 |
fungi | i'm already logged in there | 16:46 |
fungi | yay, completed without errors | 16:46 |
clarkb | yup ansible looked good I think | 16:46 |
fungi | so should i test the mailman service restarts then? | 16:46 |
clarkb | fungi: maybe restart apache first and we check apache for each of the list domains then we can restart mailman-opendev and send an email to lists.opendev.org? | 16:47 |
fungi | sure | 16:47 |
clarkb | then if that is happy restart the other 4 | 16:47 |
fungi | apache is fully restarted now | 16:48 |
clarkb | I can browse all 5 list domains via the web and the lists seem to line up with the domain | 16:48 |
clarkb | if that looks good to you I think we are ready to restart mailman-opendev and send a quick test email to it (just a response to your existing thread on service-discuss?) | 16:49 |
fungi | i browsed from the root page of each of the 5 sites all the way to specific archived messages, for some list on them, so lgtm | 16:50 |
fungi | restarting mailman-opendev now | 16:51 |
fungi | restarted. the list owned processes dropped from 45 to 36 and then went back up to 45 again | 16:51 |
fungi | sending a reply now | 16:52 |
clarkb | ya I see 9 new processes | 16:52 |
fungi | sent | 16:53 |
fungi | it's in the archive | 16:54 |
fungi | and i received a copy | 16:54 |
clarkb | I see it in my inbox too | 16:54 |
fungi | i'll proceed with restarting the other four sets of mailman services now | 16:54 |
clarkb | ++ | 16:54 |
fungi | all of them have been cleanly restarted | 16:56 |
clarkb | I see an appropriate number of new processes | 16:56 |
fungi | i confirmed the 9 processes for each went away and returned | 16:56 |
clarkb | gerrit tasks queue down to 9.8k | 16:58 |
fungi | so nearly halfway done | 16:59 |
fungi | i need to take a quick break, and then i'll try to spend a few minutes digging deeper on the current state of containerized mm3 deployments | 17:01 |
clarkb | I think we're done with lists for now with next steps to be discussed in hte meeting. I'm going to context switch to catching up on some zuul related changes I've been reviewing | 17:01 |
clarkb | fungi: thanks for the help! | 17:01 |
fungi | almost done with lists? we still have some test servers and autoholds we need to clean up, right? | 17:02 |
fungi | clarkb: mind if i delete autohold 0000000332 "Clarkb checking ansible against focal lists.o.o" | 17:03 |
clarkb | fungi: go ahead | 17:03 |
fungi | done | 17:04 |
clarkb | fungi: I think there is an older hold for lists too that can be cleaned up if it is still there | 17:04 |
fungi | looks like the manually booted test servers are already cleaned up | 17:05 |
clarkb | the one I had originally when checking the upgrade from xenial to focal | 17:05 |
fungi | clarkb: nope, just gerrit revert and gitea upgrade | 17:05 |
fungi | no other lists autohosts | 17:05 |
fungi | autoholds | 17:06 |
clarkb | I guess I cleaned that one up after the lists.kc.io upgrade since that was a better check | 17:06 |
fungi | looks that way | 17:06 |
fungi | on the rackspace side of things, i'll clean up the old image we made for lists.o.o in may, but keep the one from last weekend | 17:07 |
clarkb | ++ | 17:08 |
fungi | and done | 17:09 |
clarkb | fungi: should we let westernkansasnews know they have likely been owned? Their WP install is what these phsihing list owner emails link back to | 17:15 |
clarkb | (otehrwise why return to them?) | 17:16 |
*** artom_ is now known as artom | 17:22 | |
fungi | i receive countless phishing messages to openstack-discuss-owner, not sure it's worth my time to follow up on every one. i just delete them | 17:22 |
clarkb | ya I'm just noticing that a legit organization (they have a wikipedia entry so they are real!) seems to maybe be compromised. | 17:24 |
fungi | skimming the mm3 docs, and the documentation for the container images, this doesn't look too onerous. three containers (for the core listserv+rest api, the web archiver/search index, and the administrative webui). the web components are uwsgi listening on a high-numbered port and expects an external webserver, so fits well with our usual deployment model. similarly the core container expects to | 17:26 |
fungi | communicate with an mta on the system | 17:26 |
fungi | they have postfix and exim4 config examples for the mta, only nginx example for the webserver but apache shouldn't be hard to adapt to it | 17:26 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 17:27 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 17:32 |
opendevreview | Martin Kopec proposed openstack/project-config master: Move ansible-role-refstack-client from x/ to openinfra/ https://review.opendev.org/c/openstack/project-config/+/765787 | 18:07 |
clarkb | down to about 7k tasks now | 18:09 |
opendevreview | Clark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement https://review.opendev.org/c/opendev/infra-specs/+/804122 | 18:32 |
clarkb | infra-root ^ it seems everyone was largely happy with the last patchset of that chagne. I updated it based on some of the feedback I got. I expect this si mergeable in the near future if you can take a look to rereview it quickly | 18:32 |
corvus | clarkb: i replied on a change there; i'm a little unclear if i should leave a -1 or +1. maybe you can read the comment and let me know :) | 18:45 |
clarkb | corvus: hrm ya good point. I'm thinking maybe we evaluate both as we boostrap and then commit to one before turning off cacti? | 18:48 |
clarkb | corvus: I can update the text in there to be more explict along those lines if that makes sense to you | 18:48 |
corvus | ok. tbh, if we want to consider node_exporter, i'd do it now and not spend any wasted effort on snmp_exporter. like, it's a bit "why do it once when we can do it twice?" | 18:49 |
fungi | we're down to ~4k tasks remaining in the gerrit queue now | 18:49 |
clarkb | corvus: if we want to not bother with extra evaluation then I would probably say stick with snmp | 18:51 |
fungi | is there a good summary on the benefits of node_exporter over snmp_exporter? or is it just a way to get rid of yet one more uncontainerized daemon on our servers? | 18:51 |
clarkb | fungi: node_exporter is a bit more precanned for its gatherign and graphing | 18:51 |
clarkb | fungi: with snmp we'll have to define the mibs we want to grab and then define graphs for them | 18:51 |
clarkb | the downside to node_exporter is you have to run an additional service and without docker doing that for node_exporter is likely to be difficult and we run some services without docker | 18:52 |
fungi | i'll read up on it since i have no idea what precanned means in this sense. net-snmpd seems nicely precanned already | 18:52 |
corvus | yeah, i think that's a fair summary of the trade off | 18:52 |
clarkb | fungi: the service already knows the sorts of stuff you want to report because it is geared towards server performance metrics | 18:52 |
clarkb | snmp is var more generic and you have to configure the snmp exporter to grab what you want | 18:52 |
clarkb | I'll revert the bit about node_exporter since I think the safest course for us is snmp exporter | 18:53 |
corvus | i think my concern is that if the plan is "get node_exporter running once to try it out" we've done 98% of what's needed for "get node_exporter running everywhere" and we should just do that instead of snmp. the advantage of snmp is we don't have to do any node_exporter work. :) | 18:53 |
fungi | ahh, so the summary is that node_exporter is more opinionated and decides what you're likely to want rather than letting you choose? | 18:53 |
corvus | so if we do both, we'll defeat the advantages. :) | 18:53 |
clarkb | fungi: I think it lets you choose too but yes comes with good defaults | 18:53 |
opendevreview | Clark Boylan proposed opendev/infra-specs master: Spec to deploy Prometheus as a Cacti replacement https://review.opendev.org/c/opendev/infra-specs/+/804122 | 18:53 |
clarkb | But for our environment I strongly suspect we need the flexibiltiy of something like snmp | 18:54 |
clarkb | since running docker everywhere is not always going to be good for us | 18:54 |
corvus | node_exporter has a lot of collectors and i think it can be extended | 18:54 |
fungi | oh, i think i get it. it's not that net-snmpd is the problem, it's that snmp_exporter requires fiddling? | 18:54 |
clarkb | fungi: yes | 18:54 |
clarkb | fungi: you have to tell the prometheus snmp collector what to gather and how often to gather it and where to store it. Then you ahve to tell grafana to pull that data out and render it | 18:55 |
fungi | because prometheus is push-based as opposed to pull-based, so the configuration is on the pushing agent side not the polling server side | 18:55 |
corvus | i thought docker-everywhere was a goal? | 18:55 |
clarkb | corvus: I don't think it ever was? iirc we explicitly didn't use docker for things like the dns servers | 18:55 |
fungi | i can't imagine every single daemon on every server will be in a container though, there's bound to be a line somewhere | 18:55 |
clarkb | fungi: the constraint would be more along the lines of does every server run a dockerd that can have node_exporter running in it | 18:56 |
fungi | for example we've so far considered it simple enough to rely on distro packages of apache rather than requiring an apache container on every webserver | 18:56 |
corvus | running node_exporter everywhere doesn't seem like it should be too much of a challenge? like, there isn't a server where we can't run docker? | 18:56 |
clarkb | corvus: currently there isn't a place where we can't run docker but we have chosen not to on some | 18:56 |
clarkb | the afs infrastructure is without docker and the dns servers are the ones I can think of immediately (mailman too but we're talking about changing that above) | 18:57 |
fungi | i don't really see a problem with deciding to deploy something in docker on every server now that we have orchestration for that though | 18:57 |
corvus | fungi: for node_exporter: prometheus main server will poll node_exporter on leaf-node server. pretty similar to an snmp architecture. just the agent is node_exporter instead of netsnmpd | 18:57 |
clarkb | I guess in today's meeting I'll ask people to specifically think about whether or not they think node exporter with docker everywhere is worthwhile | 18:57 |
clarkb | and to leave comments on the spec with what they decide | 18:58 |
corvus | fungi: for snmp_exporter: prometheus main server will poll snmp_exporter on main server which will poll snmpd on leaf-node servers. | 18:58 |
fungi | is the concern with the additional overhead of dockerd on some of the servers? | 18:58 |
clarkb | fungi: I think that concern is minimal. For me it was more that we had explicitly decided not to run docker as it wasn't necessary for certain services like dns and afs | 18:58 |
fungi | oh, so prometheus does poll, like mrtg/cacti? | 18:59 |
clarkb | and if we want to gather metrics from those services we would need to add docker everywhere | 18:59 |
clarkb | in either case we have to modify firewall rules so that doesn't count against either option | 18:59 |
fungi | and i just doesn't speak snmp, so needs a translating layer to turn its calls into snmp queries? | 18:59 |
tristanC | you can also run the node exporter without docker, the project even publish static binary for many architecture as part of their release | 18:59 |
corvus | fungi: yes it polls | 18:59 |
clarkb | tristanC: I don't think we would do it taht way. | 18:59 |
clarkb | tristanC: that would be worse than running it in docker imo | 18:59 |
clarkb | (because now you need some additional system to keep it up to date etc) | 19:00 |
mordred | and we're already pretty well set up to run docker in places | 19:00 |
mordred | a positive about the polling is that it makes some of the firewall rules simpler - just allow from the prom server on all the endpoints (static data) rather than needing to open the firewall on teh cacti server for each endpoint. I mean - we have that complexity implemented so it's not an issue, but it's a place where a change to a service also impacts the config management of the cacti so there's an overlap | 19:03 |
corvus | mordred: cacti polls snmp on the leaf-node servers, so it's the same firewall situation | 19:04 |
corvus | (we don't use snmp traps, which would require inbound connections to cacti) | 19:04 |
mordred | corvus: oh - what am I thinking about where we need to open the firewall rules centrally for each leaf node? | 19:04 |
mordred | am I just making up stuff in my head? | 19:05 |
corvus | i think that may be it? :) we do have to add cacti graphs for every host... | 19:05 |
fungi | so then is the difference that snmp_exporter would run on the prometheus server and connect to net-snmpd on every system while node_exporter would run on the individual systems in place of net-snmpd and prometheus would query it remotely via its custom protocol? | 19:05 |
corvus | but i reckon we'd probably end up adding grafana dashboards for every host too | 19:06 |
corvus | fungi: yes (note the custom protocol is http carrying plain text in "key:value" form) | 19:07 |
fungi | got it, so node_exporter basically runs a private httpd on some specific port we would allow access to similar to how we control access to the snmpd service today | 19:08 |
corvus | yeah; i assume we'd pick some value we could use consistently across all hosts. maybe TCP 1061 ;) | 19:09 |
corvus | (that wouldn't interfere with service-level prometheus servers also running on the host) | 19:10 |
fungi | oh, so it doesn't have a well-known port assigned | 19:10 |
fungi | we just pick something | 19:10 |
fungi | wfm | 19:10 |
fungi | this new internet where ip has been replaced by http is still somehow foreign to me ;) | 19:11 |
corvus | fungi: prometheus servers can relay data from other prometheus servers, so there's a prometheus network topology on top of the http layer too :) | 19:13 |
fungi | wow | 19:14 |
clarkb | fungi: also applications like gerrit, gitea, and zuul can expose a port from within themselves to report to prometheus polls | 19:14 |
fungi | i see, and then prometheus knows the app-specific metrics endpoints to collect from as well as the system endpoint. that makes for rather a lot of sockets in some cases i'm betting | 19:15 |
mnaser | corvus, fungi: i'm jumping into this but there is a 'registry' of exporter ports | 19:49 |
mnaser | it ain't IANA but.. https://github.com/prometheus/prometheus/wiki/Default-port-allocations | 19:50 |
fungi | mnaser: oh cool | 19:50 |
mnaser | node_exporter is 9100, etc, that is also indirectly a nice 'list' of exporters you can look at :p | 19:50 |
corvus | good, so as long as services aren't abusing 9100 that should be fine, and i agree we should try to follow that if we go that route | 19:51 |
* fungi concurs wholeheartedly | 19:51 | |
mnaser | also, fungi, you mentioned some ipv6 reachability issues over the past .. while, we finally (i believe) got to the bottom of this, could you let me know if you're still seeing those issues | 19:51 |
clarkb | mnaser: the last email we got about failed backups due to ipv6 issues was on the 10th | 19:52 |
fungi | mnaser: i can test again in just a moment | 19:52 |
clarkb | mnaser: seems it has been happier the last few days | 19:52 |
fungi | but yeah, if we're no longer getting notified of backup failures, that's a good sign it's fixed | 19:52 |
fungi | ianw: ^ good news! | 19:52 |
mnaser | wee, awesome. backlogs work! :P | 19:53 |
fungi | mnaser: do you feel like the fix probably took effect on friday or saturday? if so, that coincides with our resumption of backups | 19:53 |
mnaser | fungi: the fix would have been applied by 12pm pt on saturday | 19:54 |
fungi | mnaser: sounds like a correlation to me then | 19:54 |
ianw | fungi / mnaser: debug1: connect to address 2604:e100:1:0:f816:3eff:fe83:a5e5 port 22: Network is unreachable | 19:55 |
mnaser | aw darn | 19:56 |
ianw | so it looks like it's falling back to ipv4, which is making it work | 19:56 |
fungi | aha | 19:56 |
ianw | which is better(?) than before when it just hung? :) | 19:56 |
mnaser | ah dang, i think i have an idea here what's happening | 19:56 |
mnaser | i bet its because the bgp announcement is not being picked up | 19:56 |
mnaser | since the asn that announces that route is the same as all of our regions | 19:56 |
mnaser | and we don't usually install an ipv6 static route | 19:56 |
mnaser | so it makes sense that it cant find a route, gr | 19:57 |
clarkb | fungi: for the kernel pin we have to craft a special file and stick it in a dir under /etc right? The thing that I always get lost on is what goes in the special file (priorities etc) | 19:58 |
clarkb | fungi: maybe if you do the pin I can take a look at the file afterward and ask qusetions about the semantics of the thing? | 19:58 |
fungi | clarkb: nah, we can just echo the package name followed by the word "hold" and pipe that through dpkg --set-selections | 19:59 |
clarkb | fungi: TIL | 19:59 |
fungi | first step is getting the package name right | 20:00 |
clarkb | fungi: in that case I guess just share the command you end up running and I'll feel filled in | 20:00 |
ianw | mnaser: yeah, that was what i was going to say ... not! :) i mean of course just let us/me know if i can do anything to help! | 20:00 |
fungi | clarkb: `dpkg -S /boot/vmlinuz-5.4.0-84-generic` reports the linux-image-5.4.0-84-generic package is what installs that file, so it's what we want to hold | 20:01 |
clarkb | fungi: that makes sense | 20:01 |
fungi | echo linux-image-5.4.0-84-generic hold | sudo dpkg --set-selections | 20:01 |
fungi | i ran that just now | 20:02 |
clarkb | and then if we dpkg -l it should show a hold attribute on the package listing? | 20:02 |
fungi | if you `dpkg -l linux-image-5.4.0-84-generic` you'll see the desired state is reported as "hold" instead of "install" | 20:02 |
fungi | now apt-get and other tools will refuse to replace that package version until the hold flag is manually removed | 20:03 |
mnaser | ianw: thanks, that error helps :) | 20:03 |
clarkb | fungi: will it still update the kernels and install them in grub? | 20:03 |
fungi | if we wanted to revert it, we'd just say install instead of hold on the command i ran earlier | 20:03 |
clarkb | fungi: if so that hold may not be sufficient because we're relying on the first grub entry aiui | 20:03 |
clarkb | maybe we walso hold linux-generic and linux-image-generic ? | 20:04 |
fungi | clarkb: good point, we can also hold the virtual package | 20:04 |
fungi | i've held both of those too now | 20:05 |
fungi | not normally necessary but as you note, with kernel packages they make a new one for each version | 20:05 |
clarkb | there is also linux-image-virtual but it just installs linux-image-generic so I think we're good | 20:05 |
fungi | yes, the problem is with new package names, new versions of the existing package of the same name will be blocked | 20:06 |
fungi | kernel packages are special in that regard | 20:06 |
fungi | most packages don't have version-specific names | 20:06 |
clarkb | right | 20:06 |
clarkb | ok time to eat lunch. | 20:08 |
opendevreview | Marco Vaschetto proposed openstack/diskimage-builder master: Allowing use local image https://review.opendev.org/c/openstack/diskimage-builder/+/809009 | 20:31 |
clarkb | fungi: looking at https://opendev.org/opendev/system-config/src/branch/master/playbooks/gitea-rename-tasks.yaml I think what we want for updating the metadata is a task at the very end of that list of tasks that looks up the metadata from somewhere and applies it | 20:32 |
clarkb | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea-git-repos/library/gitea_create_repos.py#L141-L178 is how we do that in normal projcet creation | 20:32 |
clarkb | and that is called if we set the always update flag. I suppose another option here is to just run the manage projects playbook with always update set? | 20:33 |
clarkb | as a distinct step of the rename process rather than trying to collapse it all. The problem with that is it is likely to be much slower than collapsing it down to only the projects that are renamed | 20:33 |
clarkb | https://www.brendangregg.com/blog/2014-05-09/xen-feature-detection.html might be interesting to others (ran into it when digging around to see if there is any documentation on converting from pv to pvhvm) | 20:46 |
clarkb | I've used that to confirm we are PV mode on lists and HVM elsewhere | 20:47 |
clarkb | Xen does support converting from pv to pvhvm. Not clear if rax/nova do | 20:50 |
clarkb | Looks like the vm_mode metadata in openstack (which you can use to select pv or hvm) is an image attribute. This might be what makes ti tricky | 20:52 |
clarkb | looking at the image properties I don't see the vm_mode set though | 20:54 |
opendevreview | Luciano Lo Giudice proposed openstack/project-config master: Add the cinder-netapp charm to Openstack charms https://review.opendev.org/c/openstack/project-config/+/809012 | 20:55 |
clarkb | I'm not finding anything definitive on the internet saying there is a preexisting process for openstack or rackspace. I guess we might have to file a ticket and ask? | 20:57 |
clarkb | Looking at https://github.com/prometheus/node_exporter/blob/master/CHANGELOG.md to get a sense of what bionic node exporter vs focal node exporter if using packages looks like and they appear to report metrics under different names | 21:16 |
clarkb | 0.18.1 includes a number of systemd related performance updates too | 21:17 |
clarkb | I suppose it is possible to use the distro packages but then when we upgrade servers metric names will change on us | 21:17 |
clarkb | if we deploy latest release with docker we'd avoid that as we could have a consistent version and it seems tehy have avoided changing names like that on the 1.0 release series | 21:18 |
clarkb | I think if the distro packages were at least 1.0 it wouldn't be as big of a deal | 21:21 |
clarkb | there are ~5 tasks in the gerrit task queue from the great big replication that haven't completed | 21:23 |
clarkb | I wonder if we should kill them and then attempt replication for those specific repos | 21:24 |
clarkb | I guess give them another half hour and if they dno't compelte then stop them and reenqueue specifically for those repos | 21:25 |
clarkb | makes me wonder if we're potentially having the ipv6 issues in the other direction now | 21:26 |
clarkb | since the replication plugin should fail then retry with a new task entry iirc | 21:27 |
clarkb | I'm actually going to go ahead and do the deb-liberasurecode task stop and restart it since that repo is not used anymore aiui | 21:30 |
clarkb | ya it completed when I did that. I'll go through the others and just reenqueue them | 21:32 |
*** dviroel is now known as dviroel|out | 21:36 | |
clarkb | that is done. Seems like that worked to clear things out and then also ensure replciation ended up running | 21:36 |
fungi | interesting, so that suggests we're getting some hung tasks, maybe on the order of 0.03% of the time | 21:43 |
fungi | rare, but not so rare that we wouldn't notice at our volume | 21:44 |
clarkb | ya. Most were to gitea03 (3 total) and one each to gitea01 and gitea06 | 21:44 |
clarkb | I suspect it is an issue with creating a network connection similar to the gitea01 backups | 21:47 |
clarkb | since we should retry and create new tasks if we fail, but that doesn't seem to have happened | 21:48 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 22:35 |
opendevreview | Ananya proposed opendev/elastic-recheck rdo: Fix ER bot to report back to gerrit with bug/error report https://review.opendev.org/c/opendev/elastic-recheck/+/805638 | 22:36 |
ianw | clarkb: after killing the ones last night, i'm pretty sure i saw the replication restart | 22:47 |
clarkb | ianw: huh I don't think I saw that here but may be it did | 22:48 |
clarkb | it could've retried quicker than I could relist | 22:48 |
ianw | that was my thinking behind not setting off a full replication. but anyway it would be good to know why these seem stuck permanently | 22:49 |
ianw | you'd think there'd be a timeout | 22:49 |
clarkb | I wonder if network level connection is just sitting there and not doing anything to return a failure but also not completing a SYN ACK SYNACK successfully | 22:50 |
clarkb | similar to how ansible will sometimes ssh forever | 22:50 |
opendevreview | Merged openstack/diskimage-builder master: Fix debian-minimal security repos https://review.opendev.org/c/openstack/diskimage-builder/+/806188 | 23:16 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!