| @priteau:matrix.org | Hello. Do we know if some cloud providers filter ICMP ping to 8.8.8.8? In this case, vexxhost-ca-ymq-1-main. | 06:43 |
|---|---|---|
| -@gerrit:opendev.org- Jan Gutter proposed: [zuul/zuul-jobs] 989502: Extend copy-build-ssh for custom algorithms https://review.opendev.org/c/zuul/zuul-jobs/+/989502 | 10:19 | |
| @fungicide:matrix.org | Pierre Riteau: i can check, but i doubt it | 13:17 |
| @fungicide:matrix.org | Pierre Riteau: i get 0% packet loss when pinging 8.8.8.8 from our mirror there... https://paste.opendev.org/show/bUW5BpuiaG0SgI7TqP4r/ | 13:20 |
| @fungicide:matrix.org | do you have an example build failure or something i can look at? | 13:20 |
| -@gerrit:opendev.org- Zuul merged on behalf of Mauricio Harley: [opendev/system-config] 988775: Add #openstack-pqc to channel logging and statusbot https://review.opendev.org/c/opendev/system-config/+/988775 | 13:35 | |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: | 13:38 | |
| - [openstack/project-config] 959893: Remove mirror.openeuler utilization graph https://review.opendev.org/c/openstack/project-config/+/959893 | ||
| - [openstack/project-config] 980500: Run validate-pyproject during package checks https://review.opendev.org/c/openstack/project-config/+/980500 | ||
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 989465: Enable nodejs heap snapshots with SIGUSR2 against etherpad https://review.opendev.org/c/opendev/system-config/+/989465 | 13:40 | |
| -@gerrit:opendev.org- Zuul merged on behalf of Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org: [opendev/system-config] 958302: Correct Kerberos realm var documentation https://review.opendev.org/c/opendev/system-config/+/958302 | 13:40 | |
| @clarkb:matrix.org | bit of a slow start today, but I want to followup with mnasiadka on backup servers if now is still good then I'll try to capture some heap snapshots from etherpad and see if that provides useful info | 15:11 |
| @mnasiadka:matrix.org | Clark: now is fine actually :) | 15:11 |
| @clarkb:matrix.org | mnasiadka: for backup servers I think what we can likely do is boot a new server in vexxhost and have it fully deploy as a backup server. Then swap over the selection of which backup servers to backup to to use that one instead of the old one. Then after confirming successful backups from all of the backup sources we can clean up the old backup server and attach its volume to the new server for historical record keeping | 15:12 |
| @clarkb:matrix.org | mnasiadka: the piece of that that I am not confident in is how the sources select the target backup server. So that is what I'll look at next | 15:12 |
| @clarkb:matrix.org | mnasiadka: then the other thing to keep in mind is the cloud providers and regions of the backup servers are selected to give us some geographical and provider diversity so we should put replacement servers in the same locations as the current servers. | 15:13 |
| @fungicide:matrix.org | revisiting ze07, it now has two hung `docker-compose pull` processes, the second one from about an hour ago. i'll see if i can work out what's going on there | 15:13 |
| @fungicide:matrix.org | oh, actually most (all?) of them have one or more hung pulls in their process lists now | 15:14 |
| @fungicide:matrix.org | ze04 is still clean | 15:14 |
| @clarkb:matrix.org | mnasiadka: https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/borg-backup/tasks/main.yaml#L64-L82 we actually auto enroll the backup servers on the sources. I think that is fine. We'll end up with the new server being attempted for use along side the old servers. If there are failures we can followup from there while the existing backup servers will presumably continue to work as is | 15:14 |
| @clarkb:matrix.org | fungi: we do run hourly jobs against zuul which may explain the more recent process | 15:15 |
| @fungicide:matrix.org | ze10 and ze12 are also okay still but the other executors have one or two pulls in their process lists from hours ago | 15:15 |
| @mnasiadka:matrix.org | Ok, so I need to spawn backup03.ca-ymq-1.vexxhost.opendev.org, add DNS entries and add it to the inventory as a starter | 15:16 |
| -@gerrit:opendev.org- Zuul merged on behalf of Jan Gutter: [zuul/zuul-jobs] 989502: Extend copy-build-ssh for custom algorithms https://review.opendev.org/c/zuul/zuul-jobs/+/989502 | 15:16 | |
| @clarkb:matrix.org | mnasiadka: yup and then theoretically when we add it to the inventory ansible will auto enroll all the sources and then we can work on ensuring backups look good before removing the old server | 15:17 |
| @mnasiadka:matrix.org | Sure, and then attach the old servers volume in the new server for history | 15:17 |
| @clarkb:matrix.org | and the backup servers rely on cinder volumes to store backup data so we should ensure we're creating a new server that looks like the old one with a new volume and all that | 15:17 |
| @fungicide:matrix.org | interestingly, none of the other zuul servers (executors, launchers, mergers) have hung image pulls, only most of the executors | 15:17 |
| @clarkb:matrix.org | exactly | 15:17 |
| @mnasiadka:matrix.org | Ok, I guess that's a plan then - so let me finish some other thing and get started on this. | 15:18 |
| @clarkb:matrix.org | mnasiadka: great. And if there are any questions or something I can help with just let me know | 15:18 |
| @clarkb:matrix.org | looks like etherpad was properly recreated with the new config after that change ran. So now I'm just going to wait for the next auto restart due to OOM to capture my first heap snapshot. Then I'll wait a minute or two and capture a second | 15:34 |
| @clarkb:matrix.org | (I'm waiting since memory utilization is high enough at the moment that I worry a heap snapshot will push it over the edge) | 15:34 |
| @fungicide:matrix.org | makes sense | 15:35 |
| @fungicide:matrix.org | https://status.redhat.com/ doesn't mention any known quay.io incidents since may 9, nor are there ay warning banners on the https://quay.io/ site either, so i'm not sure what to make of the hung image pulls on zuul executors | 15:38 |
| @clarkb:matrix.org | fungi: strace may give you clues? | 15:39 |
| @fungicide:matrix.org | that's where i'm headed next, yes | 15:39 |
| @fungicide:matrix.org | though the child process is waiting on a futex | 15:40 |
| @fungicide:matrix.org | not very helpful | 15:40 |
| @fungicide:matrix.org | checked both hung children on ze07 (the one from yesterday and the one from today), and their strace output is identical | 15:41 |
| @clarkb:matrix.org | if you use -f it will follow all the children | 15:41 |
| @fungicide:matrix.org | `futex(0x4383800, FUTEX_WAIT_PRIVATE, 0, NULL` | 15:41 |
| @clarkb:matrix.org | which can sometimes show interactions between processes that are harder to see when tracing specific processes individually | 15:41 |
| @fungicide:matrix.org | if i `strace -fp ...` the parent shell process ansible spawned, all i see is: `wait4(-1,` which is even less useful, sadly | 15:42 |
| @fungicide:matrix.org | the great-grandparent process, as it were | 15:43 |
| @fungicide:matrix.org | the grandparent does seem to be in a polling loop checking a futex, checking the clock, sleeping, checking the futex again... | 15:43 |
| @fungicide:matrix.org | and the parent process shows the same as well, though using an `epoll_pwait()` | 15:44 |
| @fungicide:matrix.org | basically seems like it's probably hung in the middle of downloading or waiting for a response | 15:44 |
| @fungicide:matrix.org | but there are no obvious strings that give away what's going on for sure | 15:45 |
| @clarkb:matrix.org | maybe lsof/open file descriptors will give clues to that if it still has a tcp connection open | 15:49 |
| @clarkb:matrix.org | I now have two heap snapshots. The first was quick to write. The second took a while. I'm glad I double checked the file sizes on disk to notice the second was still increasing in size so I think I now have the complete snapshot | 15:49 |
| @clarkb:matrix.org | now to try and understand if they tell us anything interesting | 15:50 |
| @fungicide:matrix.org | interestingly, lsof isn't indicating there are open network sockets for the child or its immediate parent (plenty of unix sockets though) | 15:55 |
| @fungicide:matrix.org | i wonder if we can just ignore this. on ze07 manually running the same `docker-compose pull` completes normally for me, so i don't think the hung processes are blocking subsequent runs, and will disappear on their own when the executors get rebooted over the weekend | 16:00 |
| @clarkb:matrix.org | ok I think the leak is happening in string objects and likely tied to the database | 16:01 |
| @clarkb:matrix.org | snapshot one has ~747k strings directly consuming 93MB of memory. snapshot two has ~2.1million strings directly consuming ~179MB. Meanwhile the database goes from basically no retained memory (memory that would be freed if this object is freed) to ~256MB | 16:02 |
| @clarkb:matrix.org | I think whatever data management with string data returned by the database is not releasing those objects and is leaking them | 16:03 |
| @clarkb:matrix.org | fungi: makes sense to me if current runs are happy | 16:03 |
| @clarkb:matrix.org | ok this heap snapshot tooling is magic. I can start at the db connection drill down and see there is a query object with an array called _rows with tons of entries. Then I can even inspect those rows to see that they largely seem to map to sessionstorage values. Then I can go look at the etherpad commit history and find: https://github.com/ether/etherpad/commit/605ad280682d522fafa736d4f52f06c0c1973844 and https://github.com/ether/etherpad/commit/da9f5ac4eed750af2448c42d7973b9b0b243fd54 | 16:18 |
| @clarkb:matrix.org | my hunch here is taht the session cleanup routine is not memory efficient (and maybe just striaght up leaks things?) | 16:18 |
| @clarkb:matrix.org | but those commits also indicate we can delete session cleanups. In theory this is safe since we weren't cleaning up sessions prior to the upgrade? I want to double check that, but I think the next step is to disable that config option and see if things are happier afterwards | 16:19 |
| @clarkb:matrix.org | and if so file a bug | 16:19 |
| @clarkb:matrix.org | (maybe file a bug either way with what we learn after disabling the cleanups) | 16:19 |
| @clarkb:matrix.org | confirmed that sessionCleanup: true comes from the 2.7.3 upgrade | 16:20 |
| @clarkb:matrix.org | so let me get a change to toggle that to false and we can see if that helps if we feel comfortable with it | 16:21 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 989564: Disable etherpad session cleanups https://review.opendev.org/c/opendev/system-config/+/989564 | 16:27 | |
| @clarkb:matrix.org | infra-root ^ while I'm not 100% certain that will fix things I think it is a likely enough thread to pull on to make that change | 16:27 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zone-opendev.org] 989566: Add backup03.vexxhost https://review.opendev.org/c/opendev/zone-opendev.org/+/989566 | 16:34 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 989567: Add backup03 https://review.opendev.org/c/opendev/system-config/+/989567 | 16:39 | |
| @clarkb:matrix.org | mnasiadka: looks like the new backup server may not have its volume attached and mounted yet? I think its fine to land the dns update before then. but we should hold off on the inventory update until that is done so the server configuration can work properly | 16:40 |
| @mnasiadka:matrix.org | Ah, second volume - right | 16:42 |
| @mnasiadka:matrix.org | wanted to create the changes to not loose the output :) | 16:42 |
| @clarkb:matrix.org | yup thats fine. I just want to avoid landing changes too quickly :) | 16:42 |
| @mnasiadka:matrix.org | Did w -1 ;-) | 16:43 |
| @mnasiadka:matrix.org | Ok, attached - any special script to be run like in mirror server case? | 16:47 |
| @clarkb:matrix.org | mnasiadka: I think its the normal script https://opendev.org/opendev/system-config/src/branch/master/launch/src/opendev_launch/mount_volume.sh which launch node can actuall run for you if given the correct set of options when executed. But it should be fine to run now | 16:48 |
| @clarkb:matrix.org | you should be able to compare the fslabel and mount path args to the existing backup server | 16:48 |
| @mnasiadka:matrix.org | the vg on backup02 is main-202010 compared to main in the script | 16:51 |
| @clarkb:matrix.org | mnasiadka: oh we must custom set that to help keep track of the different volumes to make it easier to attach them to the next/new backup server? | 16:51 |
| @clarkb:matrix.org | mnasiadka: I think we should keep that convention. But I don't think we have a special script for it so maybe need to edit the script I linked | 16:52 |
| @mnasiadka:matrix.org | looking at history on backup02 - that was done manually | 16:52 |
| @mnasiadka:matrix.org | let me retrace these steps | 16:53 |
| @clarkb:matrix.org | ok | 16:53 |
| @clarkb:matrix.org | fungi: ^ you may be more help here than I am as my lvm knowledge is minimal | 16:53 |
| @mnasiadka:matrix.org | any guidance on the main- suffix? | 16:53 |
| @mnasiadka:matrix.org | no worries, I'm an ex AIX admin, LVM is my life ;) | 16:54 |
| @clarkb:matrix.org | perfect :) I would call it main-202605 ? | 16:54 |
| @clarkb:matrix.org | that way its clear this is the new one from ~now | 16:54 |
| @fungicide:matrix.org | yeah, i'm happy to provide linux lvm2 management guidance, but we're not doing anything fancy there. just sticking with the naming conventions you see for other servers should be fine for the sake of consistency | 16:56 |
| @fungicide:matrix.org | and correct, the reason we dated the names of the volumes for backups was for when we rotated them out, so we'd have some idea of when we stopped using one and started using another, with the intent of deleting the old ones after a while | 16:57 |
| @mordred:waterwanders.com | smit ftw | 16:59 |
| @fungicide:matrix.org | fond memories of swearing at smit | 16:59 |
| @fungicide:matrix.org | granted, aix was far less painful than hp/ux or *shudder* sco | 17:00 |
| @mnasiadka:matrix.org | Ok then, FS mounted, we're good | 17:01 |
| @mnasiadka:matrix.org | `/dev/mapper/main--202605-backups--202605 1007G 28K 1007G 1% /opt/backups-202605` | 17:01 |
| @fungicide:matrix.org | lgtm | 17:01 |
| @clarkb:matrix.org | I wonder what makes the symlink is that automated somehow or manual? | 17:02 |
| @clarkb:matrix.org | hrm actually /opt/backups is defined as a directory in our ansible | 17:03 |
| @fungicide:matrix.org | mnasiadka: also you've likely already seen it due to working on mirror replacements, but a lot of our lvm-oriented conventions and habits are reflected in https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#cinder-volume-management | 17:03 |
| @clarkb:matrix.org | so how does that work if /opt/backups is meant to be a symlink to the data volume mount? | 17:04 |
| @mnasiadka:matrix.org | fungi: thanks, saw that | 17:04 |
| @clarkb:matrix.org | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/borg-backup-server/tasks/main.yaml#L1-L4 this is what I'm referring to | 17:04 |
| @mnasiadka:matrix.org | Clark: I made a symlink and we'll just find it out if Ansible overwrites that? But I assume it shouldn't, because that runs on all other backup servers? | 17:04 |
| @clarkb:matrix.org | do we understand how that works with the existing servers? | 17:04 |
| @clarkb:matrix.org | mnasiadka: ya that. My concern is that it will create a new directory overriding the symlink and then we'll get backups going to the root disk | 17:05 |
| @clarkb:matrix.org | but ya maybe just finding out is the easiest thing. Maybe state: directory follows the symlink to discover the target is a directory and its happy? | 17:05 |
| @clarkb:matrix.org | pvs/lvs/vgs look ok to my untrained eye. Do we want to reboot to ensure the post boot mounting is happy? | 17:08 |
| @clarkb:matrix.org | otherwise I guess we can proceed if we're happy to find out about the symlink behavior. And I agree it seems like it just works on the existing servers? | 17:08 |
| @clarkb:matrix.org | mnasiadka: I also have a question on https://review.opendev.org/c/opendev/system-config/+/989567 | 17:11 |
| @fungicide:matrix.org | i do encourage a reboot just to make sure things (mainly fstab) are set up correctly to mount everything at the expected place during bootup | 17:11 |
| @harbott.osism.tech:regio.chat | not sure if the stuck pulls may be related, but I'm still seeing DNS issues, and there seems to be an issue with IPv6 to ns04, even when testing from bridge, although only sometimes: | 17:11 |
| ``` | ||
| frickler@bridge01:~$ ping ns04.opendev.org | ||
| PING ns04.opendev.org(ns04.opendev.org (2604:e100:1:0:f816:3eff:feff:bd1c)) 56 data bytes | ||
| From 2001:550:2:16::2d:2 (2001:550:2:16::2d:2) icmp_seq=1 Destination unreachable: Address unreachable | ||
| ``` | ||
| @fungicide:matrix.org | Jens Harbott: the executors are in rackspace classic dfw, while ns04 is in vexxhost ca-ymq-1 | 17:14 |
| @fungicide:matrix.org | for ns04 being generally unreachable over ipv6 at times, i wonder if it's related to prior incidents we've seen of guests expiring their global v6 addresses for a while before relearning them | 17:19 |
| @fungicide:matrix.org | at the moment `ip -6 ad sh ens3` and `ip -6 ro sh default` look correct to me on ns04 | 17:20 |
| @clarkb:matrix.org | The service-review.yaml playbook still has the net plan static config for review02 in it iirc. We weren't sure if review03 would need that as well | 17:22 |
| @mnasiadka:matrix.org | Clark: I’ll respond tomorrow after understanding what it really does, it’s past 7pm and I have some daughter duties ;) | 17:23 |
| @clarkb:matrix.org | mnasiadka: sounds good. Enjoy your evening | 17:24 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 989564: Disable etherpad session cleanups https://review.opendev.org/c/opendev/system-config/+/989564 | 17:36 | |
| @clarkb:matrix.org | the settings file has updated. I'll probably just let it restart on its own to pick that up (worked well enough with the stats enablement) | 17:40 |
| @clarkb:matrix.org | it has restarted. I'm monitoring it to see how memory growth happens (or not) | 17:52 |
| @clarkb:matrix.org | It has been running for 6 minutes and memory usage has actually fallen | 17:57 |
| @clarkb:matrix.org | I want to see it go longer than 15 minutes but I'm feeling pretty confident this was the issue. I'll file a bug if we make it past 15 minutes with consistently good memory usage | 17:58 |
| @fungicide:matrix.org | seems to support your theory | 18:01 |
| @clarkb:matrix.org | we are past 15 minutes and memory looks great. I'm filing a bug now | 18:08 |
| @fungicide:matrix.org | great detective work! | 18:10 |
| @fungicide:matrix.org | (about defective work!) | 18:10 |
| @clarkb:matrix.org | https://github.com/ether/etherpad/issues/7830 | 18:33 |
| @clarkb:matrix.org | I'm going to pop out for lunch now, but etherpad has been running for almost 90 minutes without a restart. | 19:13 |
| @fungicide:matrix.org | yay! | 19:15 |
| @clarkb:matrix.org | fungi: following up on the ansible inventory rewrite change. https://zuul.opendev.org/t/openstack/build/7d3b61522bfd49528e2d8d241bf40ebf/log/job-output.txt this gate job failed and ran in rax ord. This check job succeeded and ran in rax ord as well: https://zuul.opendev.org/t/openstack/build/e420b859144c44a9955a6a1a050ba340/log/job-output.txt | 20:02 |
| @clarkb:matrix.org | I don't think that job behaves different whether it is in check or gate. So I think we can rule out environmental issues or pipeline configs (or at least they seem far less likely now) | 20:02 |
| @clarkb:matrix.org | successful task: https://zuul.opendev.org/t/openstack/build/e420b859144c44a9955a6a1a050ba340/log/job-output.txt#29840-29853 failed task: https://zuul.opendev.org/t/openstack/build/7d3b61522bfd49528e2d8d241bf40ebf/log/job-output.txt#30945-30957 | 20:05 |
| @clarkb:matrix.org | I think these may talk to the acme test instance. Maybe we're hitting a problem with that external communication? | 20:05 |
| @clarkb:matrix.org | fungi: yup that is exactly it https://zuul.opendev.org/t/openstack/build/7d3b61522bfd49528e2d8d241bf40ebf/log/letsencrypt02.opendev.org/acme.sh/acme.sh.log#47-48 | 20:06 |
| @clarkb:matrix.org | https://letsencrypt.org/docs/staging-environment/ here are the limits for that staging env | 20:07 |
| @fungicide:matrix.org | aha | 20:08 |
| @fungicide:matrix.org | i was looking in the complete wrong place | 20:08 |
| @fungicide:matrix.org | so if i recheck now, it'll probably pass | 20:08 |
| @clarkb:matrix.org | https://community.letsencrypt.org/t/new-service-busy-responses-beginning-during-high-load/184174 this indicates that the issue is on their end not ours | 20:08 |
| @clarkb:matrix.org | so ya I think a recheck is appropriate | 20:08 |
| @fungicide:matrix.org | will do, thanks! | 20:08 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!