clarkb | bah my ssh key agent just expired my keys | 00:02 |
---|---|---|
fungi | probably a sign you woke up too early | 00:02 |
clarkb | the one thing I wanted to check was sudo docker image list on ze01 as that was yellow during the prune( but I didn't think there were any images to prune) | 00:02 |
fungi | looking | 00:02 |
corvus | re-enqueue is done | 00:03 |
clarkb | fungi: I would expect latest and 4.2.0 to be present | 00:03 |
clarkb | I'll just reload my keys for a short time | 00:03 |
fungi | clarkb: yep, just those two | 00:03 |
fungi | i don't see any others, and both are present | 00:03 |
clarkb | cool I got on and checked the zuul.conf and docker-compose.yaml as well on ze01 those looks good | 00:05 |
clarkb | zm01 looks good too. And docker compose on scheduler is all using latest | 00:06 |
clarkb | I think we should be in a steady state now | 00:06 |
clarkb | we can leave zuul01 up for now. services are stopped on it and it is in the emergency file. I want to unenroll it from esm before I dleete it too | 00:06 |
clarkb | infra-root ^ maybe double check you don't want to prserve anything you've got on that server? | 00:06 |
clarkb | and I can aim to clean it up tomorrow or modnay? | 00:06 |
corvus | clarkb: all clear from me; we've done everything on zuul02 i'd want to do on zuul01 | 00:07 |
clarkb | cool | 00:07 |
corvus | unless we want to copy the logs to zuul02 first? | 00:08 |
corvus | but it's pretty rare we need to go back far in scheduler debug logs | 00:08 |
corvus | so i'm okay rolling the dice on that | 00:08 |
clarkb | probably not a bad idea. I'm running out of steam today (as evidenced by my keys expiring) and can do that tomorrow if we think it is a good idea | 00:08 |
clarkb | corvus: re queue dumping I wonder if the background queue dumps are working | 00:09 |
clarkb | that just fetches the json file right? so that should still work on the new server but let me see | 00:09 |
corvus | i don't see zuul02 in cacti | 00:10 |
corvus | did the script to add hosts bitrot? | 00:10 |
clarkb | corvus: I think its a more subtle issue, its that the lsit on the left doesn't update for some reason | 00:11 |
clarkb | if you go to http://cacti.openstack.org/cacti/graph_view.php?action=list and search zuul02 stuff shows up | 00:11 |
corvus | ah | 00:11 |
clarkb | from that you can get things like http://cacti.openstack.org/cacti/graph.php?local_graph_id=70200&rra_id=all | 00:11 |
clarkb | confirmed the json status backups don't seem to be working at the moment | 00:12 |
clarkb | I don't think those are critical though and I can followup with that tomorrow | 00:13 |
clarkb | I've disconnected from the bridge screen but left it running for now in case we want to refer to anything tomorrow. Anything else you can think of that I should be checking on before i call it a day? | 00:13 |
corvus | i see the cacti prob | 00:14 |
clarkb | the deploy pipeline base job failed due to the apt-get autoremove issue I mentioend earlier seems to have hit a couple of mirrors | 00:16 |
corvus | the next time the create graphs job runs it should work (the name of the tree in cacti didn't match the name in the script) | 00:16 |
clarkb | ianw: fungi: ^ it seems to be shim-signed and other related problems with dpkg | 00:16 |
clarkb | corvus: thanks! | 00:17 |
clarkb | but ya I think we are sufficiently steady state now that I can go help with dinner and stuff. I'll followup on the json backups and log copies tomorrow | 00:17 |
clarkb | thanks for all the help today | 00:17 |
corvus | ++ | 00:17 |
clarkb | #status Log swapped out zuul01.openstack.org for zuul02.opendev.org. The entire zuul + nodepool + zk cluster is now running on focal | 00:18 |
openstackstatus | clarkb: finished logging | 00:18 |
clarkb | oh good it accepted the Log instead of log | 00:18 |
clarkb | one last thought before my day ends: the json status backups may still work on zuul01 if we need them in the near future | 00:21 |
ianw | i can log into cacti and remove all the old .openstack.org hosts, that's the only way i've found to do it | 01:23 |
ianw | there's a few old mirrors too | 01:23 |
*** ysandeep|away is now known as ysandeep | 01:35 | |
ianw | afs01.dfw.openstack.org 99 48 62 Down 70d 4h 40m | 01:48 |
ianw | i wonder why | 01:48 |
openstackgerrit | yang yawei proposed openstack/project-config master: setup.cfg: Replace dashes with underscores https://review.opendev.org/c/openstack/project-config/+/791343 | 01:51 |
ianw | snmpwalk -v1 -c public afs01.dfw.openstack.org from cacti doesn't return anything | 01:55 |
ianw | udp6 0 0 ::1:161 :::* 957/snmpd | 01:56 |
ianw | seems like it should be listening | 01:57 |
ianw | ok, the server is *getting* snmp requests 01:58:07.218502 IP cacti02.openstack.org.38162 > afs01.dfw.openstack.org.snmp: GetNextRequest(25) | 01:58 |
openstackgerrit | Steve Baker proposed openstack/diskimage-builder master: WIP Add a growvols utility for growing LVM volumes https://review.opendev.org/c/openstack/diskimage-builder/+/791083 | 01:59 |
ianw | excellent, i restarted snmpd and it "just works" ... sigh | 02:01 |
*** timburke_ has quit IRC | 02:10 | |
*** timburke__ has joined #opendev | 02:10 | |
*** brinzhang0 has quit IRC | 03:46 | |
*** hemanth_n has joined #opendev | 03:48 | |
*** hemanth_n has quit IRC | 04:08 | |
*** brinzhang0 has joined #opendev | 04:25 | |
ianw | #status log cleared out a range of old hosts on cacti.openstack.org | 04:27 |
openstackstatus | ianw: finished logging | 04:27 |
ianw | i've restarted a bunch of snmpd's that seemed to have stopped working, although i have no root cause | 04:28 |
*** ykarel has joined #opendev | 04:44 | |
*** marios has joined #opendev | 04:55 | |
*** ykarel_ has joined #opendev | 05:36 | |
*** ykarel has quit IRC | 05:38 | |
*** lpetrut has joined #opendev | 06:01 | |
*** darshna has joined #opendev | 06:03 | |
*** slaweq has joined #opendev | 06:06 | |
*** brinzhang_ has joined #opendev | 06:31 | |
*** whoami-rajat_ has joined #opendev | 06:32 | |
*** brinzhang0 has quit IRC | 06:34 | |
*** fressi has joined #opendev | 06:35 | |
*** amoralej|off is now known as amoralej | 07:11 | |
*** ykarel_ is now known as ykarel | 07:12 | |
*** andrewbonney has joined #opendev | 07:13 | |
*** jpena|off is now known as jpena | 07:30 | |
*** tosky has joined #opendev | 07:47 | |
*** DSpider has joined #opendev | 07:48 | |
*** sshnaidm|afk is now known as sshnaidm|pto | 08:00 | |
*** lucasagomes has joined #opendev | 08:00 | |
*** yoctozepto6 is now known as yoctozepto | 08:03 | |
*** ysandeep is now known as ysandeep|lunch | 08:08 | |
*** ykarel is now known as ykarel|lunch | 08:25 | |
openstackgerrit | Merged opendev/glean master: Remove Fedora 32 job https://review.opendev.org/c/opendev/glean/+/790368 | 08:51 |
*** ysandeep|lunch is now known as ysandeep | 09:12 | |
*** prometheanfire has quit IRC | 09:24 | |
*** prometheanfire has joined #opendev | 09:24 | |
*** ykarel|lunch is now known as ykarel | 09:34 | |
openstackgerrit | Lucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791085 https://review.opendev.org/c/zuul/zuul-jobs/+/791117 | 09:37 |
*** ysandeep is now known as ysandeep|brb | 11:17 | |
frickler | clarkb: seems logrotate config is broken for zuul02, is that a known issue? e.g. I see config for /var/log/zuul/zuul-debug.log while our log is named /var/log/zuul/debug.log | 11:25 |
*** jpena is now known as jpena|lunch | 11:34 | |
*** whoami-rajat_ is now known as whoami-rajat | 11:51 | |
*** mlavalle has joined #opendev | 12:08 | |
*** ysandeep|brb is now known as ysandeep | 12:13 | |
*** brinzhang0 has joined #opendev | 12:13 | |
*** brinzhang_ has quit IRC | 12:16 | |
*** jpena|lunch is now known as jpena | 12:31 | |
*** lpetrut has quit IRC | 12:36 | |
lucasagomes | hi, someone knows how can I test https://review.opendev.org/c/zuul/zuul-jobs/+/791117/ ? Apparently the Depends-On is not honored in the "zuul-jobs-test-ensure-devstack" test run | 12:42 |
*** lpetrut has joined #opendev | 12:48 | |
*** amoralej is now known as amoralej|lunch | 13:04 | |
*** brinzhang_ has joined #opendev | 13:06 | |
*** brinzhang0 has quit IRC | 13:09 | |
*** ysandeep is now known as ysandeep|away | 13:16 | |
*** brinzhang0 has joined #opendev | 13:18 | |
*** brinzhang_ has quit IRC | 13:21 | |
*** d34dh0r53 has joined #opendev | 13:38 | |
*** amoralej|lunch is now known as amoralej | 13:38 | |
openstackgerrit | Lucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791085 https://review.opendev.org/c/zuul/zuul-jobs/+/791117 | 13:39 |
*** ysandeep|away is now known as ysandeep | 13:56 | |
dmsimard | btw: https://news.ycombinator.com/item?id=27153338 "I am resigning along with most other Freenode staff" | 14:09 |
tosky | dmsimard: it seems it's still under discussion | 14:11 |
dmsimard | tosky: yeah, it doesn't seem like it's a done deal but concerning in any case | 14:12 |
dmsimard | just sharing for visibility | 14:12 |
fungi | lucasagomes: according to this it did checkout the depends-on change into src/opendev.org/openstack/devstack: https://zuul.opendev.org/t/zuul/build/7263ee5c71c84cc581deb26b4657dfc9/log/zuul-info/inventory.yaml#60-69 | 14:15 |
fungi | it's possible the zuul-jobs-test-ensure-devstack doesn't install devstack the way a normal devstack job would | 14:15 |
gmann | fungi: yeah it needs to be mention in ensure_devstack_git_refspec https://review.opendev.org/c/zuul/zuul-jobs/+/791117/4/zuul-tests.d/cloud-roles-jobs.yaml#9 | 14:18 |
fungi | though it looks like it cloned from there into /opt/devstack and then changed to that directory and ran ./stack.sh | 14:18 |
gmann | with new PS it should pickup | 14:18 |
fungi | ahh, okay | 14:18 |
fungi | ahh, i see it got discussed over in #openstack-infra too | 14:20 |
*** mlavalle has quit IRC | 14:41 | |
clarkb | frickler: no that isn't a known issue | 14:59 |
clarkb | frickler: the log file was also /var/log/zuul/debug.log on zuul01. I suspect that we added rules for zuul-debug.log in the ansible transition and it was wrong but the old puppet config remained | 15:00 |
clarkb | frickler: I'll take a look at that along with the status json backups today | 15:00 |
fungi | yeah we've generally failed to clean up old cronjobs created by puppet when switching to ansible | 15:01 |
lucasagomes | fungi, sorry for the delay, yeah I need to set those ensure_devstack_git_{refspec,version}. Now it seems to be working... before I was only set the depends-on | 15:01 |
fungi | so that explanation wouldn't surprise me | 15:01 |
lucasagomes | thanks | 15:01 |
fungi | lucasagomes: no worries, i honestly wasn't sure what the fix was, i just knew that the ensure-devstack tests didn't do quite what the devstack abstract jobs do | 15:02 |
fungi | because they're targeted primarily at use in jobs which just need "a devstack" present to interact with, and not focused on testing any of the components which go into devstack itself | 15:03 |
fungi | namely, testing nodepool, where we need some functional openstack as a fixture to test interactions in the openstack provider driver | 15:03 |
clarkb | infra-root are we generally happy with zuul's operation other than the log rotation and status backups? Should I give the openstack release team the all clear? | 15:05 |
fungi | yeah, things seem fine so far this morning. i caught up on all the irc channels and mailing lists i monitor and see no alarms raised | 15:06 |
clarkb | cool I'll let them know | 15:06 |
openstackgerrit | Lucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791085 https://review.opendev.org/c/zuul/zuul-jobs/+/791117 | 15:08 |
clarkb | fwiw I think I see the issues on the config management side for both logrotate and status json backups. BUt I need to load ssh keys and verify against hosts before I push a change up | 15:09 |
clarkb | also I need tea | 15:09 |
*** mlavalle has joined #opendev | 15:11 | |
*** tkajinam has quit IRC | 15:14 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Fixup small issues on new zuul scheduler https://review.opendev.org/c/opendev/system-config/+/791508 | 15:23 |
clarkb | infra-root ^ I think that will address the issues we'ev identified so far | 15:23 |
*** gothicserpent has quit IRC | 15:31 | |
*** marios is now known as marios|out | 15:34 | |
*** amoralej is now known as amoralej|off | 15:48 | |
*** marios|out has quit IRC | 15:49 | |
*** ykarel has quit IRC | 15:52 | |
*** lpetrut has quit IRC | 16:10 | |
openstackgerrit | Lucas Alvares Gomes proposed zuul/zuul-jobs master: [dnm] testing devstack 791436 https://review.opendev.org/c/zuul/zuul-jobs/+/791117 | 16:11 |
clarkb | fungi: can I get a review on https://review.opendev.org/c/opendev/system-config/+/791508 as zuul is happy with it now? | 16:15 |
fungi | yeah, can do | 16:15 |
clarkb | I'll double check things after that lands then start looking at copying log files from the old server | 16:16 |
clarkb | Then I guess plan to cleanup the old server monday | 16:18 |
fungi | approved it, but left comments... i don't see fingergw creating any logs on the new server | 16:18 |
clarkb | fungi: ya I saw that too and haven't had a chnce to look at it. Same situation on the old server too | 16:18 |
clarkb | I suspect that we don't provide a logging config and it is just going to stdout/stderr? | 16:18 |
fungi | that's what i assumed | 16:19 |
fungi | but the logrotate entries are good for when we decide to change that | 16:19 |
clarkb | yup exactly | 16:20 |
clarkb | if we fix the logging we dno't want to miss the rotation beacuse we helpfully cleaned it up :) | 16:20 |
clarkb | if I copy the zuul01 log files over and keep them in logrotate .1.gz .2.gz etc will that confuse logrotate when it runs on zuul02? | 16:22 |
clarkb | I guess I can also manually run logrotate by hand and see what happens | 16:23 |
clarkb | one thing at a time, first one logrotate to be in the correct config | 16:23 |
*** gothicserpent has joined #opendev | 16:25 | |
*** gothicserpent has quit IRC | 16:25 | |
fungi | it shouldn't confuse them as long as the names are what it expects. logrotate won't know the difference | 16:25 |
clarkb | cool | 16:25 |
fungi | logrotate just looks at filenames, after all | 16:25 |
fungi | well, and file size when determining whether to rotate under certain configurations | 16:26 |
clarkb | well I know it has to run at least once before it starts rotating because it keeps a record of some sort | 16:26 |
fungi | but that generally only matters for the active log | 16:26 |
*** lucasagomes has quit IRC | 16:26 | |
clarkb | Looking ahead to next week I think it would be good to try and land the mailman ansiblification too before all that context goes away | 16:27 |
fungi | oh, yes absolutely | 16:27 |
fungi | also i have the base nodeset change to ubuntu-focal scheduled for tuesday, planning to approve that an hour before the meeting | 16:27 |
*** jpena is now known as jpena|off | 16:28 | |
clarkb | the way the changes are stacked is the first change should stop automatic management of the list servers. We can then run it manually against each server (probably with lists.kc.io first as it is simpler) check the results, then land the followup which will add the job to the periodic list | 16:28 |
clarkb | fungi: ++ | 16:28 |
clarkb | There is also a hostvars update that needs to be done with that mailman change | 16:28 |
clarkb | small one not a big deal. Just need to remember to do it | 16:29 |
clarkb | does ansible have a noop mode? that would probably be useful in this scenario | 16:29 |
fungi | ansible-test? | 16:30 |
clarkb | ansible-playbook --check looks like | 16:30 |
fungi | ahh, no ansible-test is for conformance testing of collections | 16:31 |
*** gothicserpent has joined #opendev | 16:31 | |
clarkb | check doesn't provide a way to simulate registered command output though so may not work well for us | 16:31 |
fungi | and yes, ansible-playbook --help indicates --check is what you're looking for | 16:31 |
fungi | well, it does pretty clearly indicate that it only tries to predict what changes might occur when running a playbook | 16:32 |
clarkb | it may still be useful to see that all the file and directory changes noop | 16:32 |
openstackgerrit | Merged opendev/system-config master: Fixup small issues on new zuul scheduler https://review.opendev.org/c/opendev/system-config/+/791508 | 16:54 |
fungi | clarkb: ^ now we just need it to deploy | 16:54 |
fungi | i'm going to self-approve 791176 so i can proceed with some dnm testing of that | 16:55 |
clarkb | fungi: sounds good and ya I'll wait for deploy to get zuul updated then take a look at cleaning stuff up and making sure it is happy now | 16:56 |
*** timburke_ has joined #opendev | 16:57 | |
*** timburke__ has quit IRC | 16:59 | |
clarkb | ok zuul scheduler has status.json backups now and logrotate updated | 17:04 |
clarkb | I'll rmove the zuul-debug.log config | 17:04 |
fungi | cool | 17:06 |
clarkb | fungi: `/usr/sbin/logrotate /etc/logrotate.conf` seems to be the command logrotate's systemd timer/service runs do you think it is worth running that by hand now? or just copy the old logs and let it sort it out on its own? | 17:09 |
fungi | i'm indifferent. if you're impatient or don't want to have to sort it out later, then sure run it manually and make sure it's working as intended | 17:10 |
openstackgerrit | Merged opendev/base-jobs master: Test VERSION_INFO default for mirror-info role https://review.opendev.org/c/opendev/base-jobs/+/791176 | 17:10 |
clarkb | we have plenty of disk there so I don't think its urgent, I'll just do the file copies for now | 17:11 |
clarkb | also I won't bother with the none debug logs as the debug log should be a superset of the non debug | 17:12 |
clarkb | infra-root logs for scheduelr and web have been moved over | 17:23 |
fungi | awesome, thanks! | 17:23 |
fungi | jrosser: what's an example of a job which was hitting broken version info on bullseye? i'll do some do-not-merge tests of it reparented to base-test now that 791176 has merged and make sure it fixes things there | 17:25 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Sync zuul status json backup list with current tenants https://review.opendev.org/c/opendev/system-config/+/791521 | 17:33 |
clarkb | that is another cleanup / sync up I noticed | 17:33 |
clarkb | I'll cleanup the root screen we used yesterday now | 17:36 |
fungi | sounds good | 17:38 |
clarkb | infra-root for gerrit_ssh_rsa_pubkey_contents should we just update the all.yaml value to be what is in private host and group vars? then we can clean up the private host and group vars? | 17:39 |
clarkb | then everything should be in sync and far less confusing | 17:39 |
clarkb | I wonder if one reason we don't do that is gerrit testing? | 17:39 |
clarkb | we'd end up writing out the wrong ssh host key for the private key and then things won't be happy? | 17:39 |
clarkb | we don't set that var as a test specific var | 17:40 |
*** andrewbonney has quit IRC | 17:41 | |
jrosser | fungi: the patch which triggered the bullseye version trouble was https://review.opendev.org/c/openstack/openstack-ansible/+/783606 | 17:41 |
clarkb | ya I think that may be the reason it is the way it is | 17:41 |
jrosser | fungi: though i did add a temporary hack to that so i could keep working on the rest of it https://review.opendev.org/c/openstack/openstack-ansible/+/783606/14/scripts/bootstrap-ansible.sh | 17:42 |
jrosser | feel free to adjust that patch to test base-test | 17:43 |
clarkb | Now I'm thinking the right fix for this is to put the public key for testing in the zuul specific group vars then we can put our prod value in all.yaml. I need to look more closely at stuff before I feel confident in that though | 17:44 |
*** timburke_ is now known as timburke | 17:44 | |
clarkb | hrm I bet it is more than just review that needs that in testing though. I bet that is part of the struggle | 17:48 |
clarkb | however, if the current value is only valid in testing and only valid for tset gerrit maybe we can address any of those problems as they pop up | 17:49 |
*** ysandeep is now known as ysandeep|away | 18:10 | |
mordred | clarkb: if you have a sec - https://review.opendev.org/c/openstack/openstacksdk/+/791023 ... there is a feature/r1 branch for openstacksdk but it's not running functional tests. that patch is an attempt from gtema to fix it - which is an obviously wrong patch. but looking at the branch I can't see why they wouldn't be running | 18:35 |
mordred | clarkb: I feel like there is something obvious I'm not seeing | 18:35 |
fungi | branch restrictions placed on the same job in a master branch? | 18:37 |
clarkb | you expect hte -ironic job to run? | 18:50 |
clarkb | I think maybe the problem is actually devsatck not having a feature/r1 branch | 18:53 |
clarkb | iirc with grenade if you want all the child jobs to stop running on $stablebranch in openstack you just delete the job/branch from grenade? | 18:53 |
clarkb | do you need a pragma that tells it that this is mapped onto master everywhere else? | 18:53 |
fungi | yeah, i forget which of branch-override or override-checkout that is | 18:56 |
fungi | though it should fall back to master | 18:56 |
clarkb | do job definitions fall back to master though? I thought they didn't which is why this works for grenade to simply remove the jobs from the old branches | 18:57 |
fungi | oh, maybe that's only true for checkouts and not for job inheritance | 18:57 |
clarkb | ya I think there is a pragma directive that you can use to avoid this problem | 18:58 |
clarkb | https://zuul-ci.org/docs/zuul/reference/pragma_def.html#attr-pragma.implied-branches | 18:59 |
clarkb | "This may be useful if two projects share jobs but have dissimilar branch names." | 18:59 |
clarkb | mordred: ^ fyi | 18:59 |
fungi | aha, i wasn't familiar with that one | 18:59 |
openstackgerrit | Merged opendev/system-config master: Sync zuul status json backup list with current tenants https://review.opendev.org/c/opendev/system-config/+/791521 | 19:03 |
clarkb | oh cool I'll remove the kata crontab entries once ^ has had a chance to apply | 19:04 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Double the default number of ansible forks https://review.opendev.org/c/opendev/system-config/+/791528 | 19:15 |
clarkb | after forgetting to use -f 50 on the base playbook I wonder if part of the deploy job throughput slowness is simply using 5 forks by default | 19:15 |
clarkb | that change doubles it to 10 | 19:15 |
fungi | a worthwhile experiment. do we have a good baseline for the current jobs so we can compare? | 19:16 |
clarkb | fungi: we have the logs on bridge we can compare | 19:17 |
clarkb | and also job runtimes in zuul | 19:17 |
fungi | wfm | 19:17 |
clarkb | I have removed the kata containers status json cron job entries on zuul02 | 19:17 |
fungi | clarkb: i wonder if we can extend the current export script to fetch the zuul tenant list and iterate over it? then we don't have to remember to add and remove cronjobs | 19:19 |
clarkb | fungi: ya we probably can | 19:19 |
clarkb | I assume the current dump script does similar but then also does the conversion to reenqueue commands | 19:20 |
clarkb | we'd basically want the everything before reenqueue? | 19:20 |
fungi | https://zuul.opendev.org/api/tenants seems to give is json we can parse | 19:20 |
clarkb | the dump script also needs updates to spit out docker exec commands so maybe we can sort something out in there | 19:20 |
fungi | yeah, though thinking about it, this is all soon moot | 19:21 |
clarkb | oh ya because zuulv5 | 19:21 |
clarkb | so ya maybe better to just leave this as is and clean it up when we get to v5 | 19:21 |
fungi | so maybe better we just leave it as is. i doubt we'll add or remove tenants before we get to the point that the queues are persisted in zk | 19:21 |
*** whoami-rajat has quit IRC | 19:21 | |
fungi | yep, totes | 19:22 |
mordred | clarkb: hrm. I mean - the issue is that none of the functional jobs are being triggered on patches to that branch: https://review.opendev.org/c/openstack/openstacksdk/+/791527 is running right now | 19:43 |
clarkb | mordred: ya beacuse they all parent to a job in devstack and devstack doesn't define that job on that branch I think | 19:44 |
mordred | and it's running dib-nodepool-functional-openstack-centos-8-stream-src but no other functional | 19:44 |
mordred | OH | 19:44 |
clarkb | mordred: its the same situation in reverse when we delete a job in old grenade to stop that running everwhere | 19:44 |
mordred | it's the parenting | 19:44 |
clarkb | yes | 19:44 |
clarkb | to devstack-tox-functional I think | 19:44 |
mordred | so we want the pragma in the sdk repo to point at master? | 19:44 |
clarkb | I think so or maybe in devstack-tox-functional to include the r1 branch? I'm not sure which direction would be better | 19:45 |
mordred | doesn't really make much sense for the devstack repo to know anything about the feature/r1 branch in sdk | 19:45 |
clarkb | infra-root most of us have a few things in our zuul01 homedirs. I'd like to delete the server on Monday if possible. Can you check and make sure you don't have anything in there you want to keep? | 19:46 |
clarkb | corvus: ^ you mentioned you have what you want, but not sure if you looked in your homedir? it has a fair bit of stuff | 19:46 |
mordred | nope. pragma in sdk repo doesn't do anything | 19:46 |
mordred | :( | 19:46 |
mordred | is the other option to just delete the stuff in the branch .zuul.yaml and let the master definitions pick up implied branch matchers? | 19:47 |
clarkb | I don't know if it will fallback that way. | 19:48 |
mordred | it doesn't | 19:49 |
mordred | it did not work :) | 19:49 |
mordred | I'm stumped | 19:50 |
mordred | let me try adding the pragma to devstack just to see | 19:50 |
mordred | I don't think that's the right thing to do - but let's test the hypothesis | 19:50 |
fungi | tried turning on debug in the pipeline? | 19:52 |
fungi | in the project pipeline i mean | 19:52 |
mordred | actually - ... | 19:52 |
mordred | nodepool-build-image-siblings is being run | 19:53 |
mordred | and it also doesnt' have a feature/r1 in the nodepool repo | 19:53 |
mordred | ok. adding the pragma to the devstack repo worked | 19:53 |
fungi | wacky | 19:53 |
mordred | oh - wait - is it because devstack has branches ? | 19:56 |
mordred | while nodepool doesn't? | 19:57 |
mordred | "In the case of an untrusted-project, if the project has only one branch, no implied branch specifier is applied to Job definitions. If the project has more than one branch, the branch containing the job definition is used as an implied branch specifier." | 19:57 |
fungi | yeah, that would make sense | 19:57 |
fungi | and matches what clarkb was indicating | 19:57 |
clarkb | ya it is unambiguous in the single branch case | 19:57 |
clarkb | in the multi branch case it doesn't know what is correct so does the more conservative thing with an outlet to bypass | 19:58 |
mordred | but there's no mechanism within the sdk repo to steer this in the right direction? Or would adding an explicit branch matcher to the sdk child job help do you think? | 19:59 |
clarkb | I half expected the pragma on the child jobs in sdk to do it, but I guess not | 19:59 |
clarkb | I don't know that an explicit branch matcher would help since it should already implicitly match feature/r1 and setting it to master would do the wrong thing | 20:00 |
clarkb | thats interesting, I've just realized that for whatever reason the swap device that was created by launch node on zuul02 is only 7MB large or so | 20:43 |
clarkb | I wonder if make_swap.sh isn't working properly on focal when memory is quite large? | 20:44 |
clarkb | I noticed bceause i looked at cacti | 20:44 |
clarkb | I'm not sure what the best appraoch to fixing that is. Maybe a swapfile on / ? or we can probably schedule a zuul downtime, copy logs off /var/log/zuul, reformat it the way we want, put the logs back, remount and start zuul again? | 20:45 |
*** fressi has left #opendev | 20:46 | |
clarkb | ianw: ^ review02 has done the same thing | 20:46 |
clarkb | I suspect this is a bug with focal and large memory hosts | 20:47 |
clarkb | ze01 which is also focal but has less memory looks the way I would expect it | 20:48 |
clarkb | oh interesting zk04 is like these other servers though | 20:48 |
clarkb | ugh | 20:48 |
fungi | all the new zk hosts, or just 04? | 20:49 |
clarkb | all of them. Looks like zk04-zk06, zuul02, and review02 exhibit this. zm*, ze, nl, nb seem ok | 20:50 |
fungi | okay, so basically anything we've tried to build on focal recently, i guess | 20:51 |
clarkb | no that is what is confusing. ze, zm and nl are recent too | 20:52 |
clarkb | review02 is older than all of these | 20:52 |
clarkb | review02 uses a swapfile not a swapdevice so I suspect this is related to maths in make_swap.sh | 20:53 |
clarkb | thinking out loud here: since this is a bigger problem than just zuul02 and zuul02 has plenty of memory for the moment I think we should debug the script through the use of asking it to make swapfiles on a test node. Fix the script, then swing around and either add/enlarge swapfiles to zuul02 and zk04-06 and review02 or redo the xvde partitionong on zuul02 and zk04-06 and enlarge the swapfile on | 20:55 |
clarkb | review02 | 20:55 |
clarkb | I think redoing the partitioning on zk04-06 will be much easier than zuul02 since it is just moutned as a tiny swap and /opt there | 20:56 |
clarkb | maybe other infra-root can take a look at that and we can dig into making those changes next week? | 20:56 |
clarkb | fungi: my hunch is that the output of some tool has changed to make things more human redable on newer distros and now make_swap.sh works depending on the size of available memory | 20:57 |
fungi | and calculated on several orders magnitude less than it should have | 20:58 |
clarkb | https://opendev.org/opendev/system-config/commit/2e629bfb969c444a345503e5bcb0842f2f467f2d I think that did it | 21:01 |
clarkb | we want MB not GB so the min of 8 should be min of 8192 | 21:02 |
clarkb | I think the reason ze's and zm's are ok is that I ran launch out of an older checkout in my homedir | 21:03 |
clarkb | something like that | 21:03 |
clarkb | I'm just trying to double check that parted mkpart wants MB values by default | 21:07 |
clarkb | and I'll push a fix for make_swap.sh after | 21:07 |
clarkb | the manpage is completely useless | 21:07 |
clarkb | https://www.gnu.org/software/parted/manual/parted.html#mkpart implies that megabytes are the default | 21:08 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Fix min swap value in make_swap.sh https://review.opendev.org/c/opendev/system-config/+/791554 | 21:10 |
clarkb | I think that fixes it | 21:10 |
clarkb | well for new boots | 21:10 |
clarkb | anyone know why zk servers get a mostly empty /opt/containerd dir? | 21:11 |
clarkb | `sudo lsof | grep /opt/containerd` doesn't show any results there so I suspect we can simply copy the contents of /opt to another fs, unmount and repartition xvde, remount and copy /opt back again | 21:12 |
clarkb | its trickier with zuul02 because we write the logs to that parition so we have size constraints (may need to trim logs prior to doing this) as well as active services using the device | 21:12 |
clarkb | considering the zk servers have been up for a while now with no apparent issues and zuul02 has significant memory overhead I think I'm going to pause here, let others take a look and make sure I'm not missing anything obvious then we can dive into fixing them when it isn't beer thirty on a friday :) | 21:14 |
clarkb | But assuming no one finds anything different I guess I'll try starting with one of the zks on monday | 21:15 |
clarkb | (and we should probably do a more thorough audit) | 21:15 |
fungi | yeah, this doesn't seem urgent enough for a friday evening | 21:16 |
fungi | but i agree we should regroup on monday and not lose track of it | 21:17 |
fungi | i'm happy to help swizzle partitions around on servers next week | 21:17 |
clarkb | looking at the logs in our hosts file the reason the ze, zm, nl servers are good is they happened before the above change | 21:17 |
clarkb | looking there I've discovered we have two mirror nodes that also exhibit this problem. I think that list is complete | 21:18 |
clarkb | two mirrors, zuul02, zk04-06, and review02 but having a second set of eyes double check would be appreciated | 21:18 |
clarkb | of those I suspect the only one that really poses a problem is zuul02 | 21:18 |
fungi | the mirrors might | 21:18 |
clarkb | the mirrors and review02 should all be swapfiles, we can simply make a new bigger swapfile | 21:19 |
clarkb | fungi: well the mirrors aren't in rax so don't have an xvde so should use swapfiles | 21:19 |
fungi | ahh, yeah if they're not partitions of the ephemeral disk then it's easy | 21:19 |
clarkb | we can just swap off, rm swapfile, make new spapfile larger, spwaon | 21:19 |
fungi | agreed, simple as long as memory isn't exhausted at the time | 21:19 |
clarkb | so maybe we do those first next week, then try fixing xvde on a zk since they are redudant, if that goes well enough we do all the zks and then plan for zuul02 outage | 21:20 |
clarkb | ianw: can you please review https://review.opendev.org/c/opendev/system-config/+/791554 and read the scrollback about make_swap.sh when your weekend ends? | 21:20 |
fungi | for zuul02 we could do it with two scheduler restarts and a temporary cinder volume | 21:21 |
clarkb | fungi: I think we may just have enough space on / to copy the logs over | 21:21 |
fungi | or even just one restart if we pause long enough to move the active logfile over and back. if we force a logrotate before we start that could even be so small it requires very little outage | 21:21 |
clarkb | currently need 16GB for all the zuul logs (this will probably grow a bit as the non debug logs grow since I didn't copy those over) and we have about 35GB free on / | 21:22 |
fungi | ahh, yeah that'll hold through monday at leadt | 21:22 |
fungi | least | 21:22 |
clarkb | fungi: it didn't take super long to copy the logs from one fs to the other on zuul02 after I copied tehm from 01 to 02 | 21:22 |
clarkb | I think we can probably stop the scheduler and web stuff on 02, copy the logs to a staging dir on /, unmount, partiion, format, remount, copy logs back then start zuul again | 21:23 |
fungi | so we can force a logrotate, copy all the compressed logs to the rootfs, stop the scheduler, copy the active log to the rootfs, redo partitioning on the ephemeral disk, move the active log back to /opt, start the scheduler, then move all the compressed logs back to /opt | 21:23 |
clarkb | note it isn't /opt on zuul02 it is /var/log/zuul, but ya | 21:24 |
fungi | er, right | 21:24 |
fungi | prepping the repartitioning/formatting commands would also help shorten the outage | 21:24 |
clarkb | the two mirrors with this problem are osuosl and inmotion fwiw | 21:25 |
fungi | happy to put together a maintenance plan on an etherpad on monday, fires permitting | 21:25 |
clarkb | fungi: that would be great! I suspect that the swapfile hosts we don't need such a thing but something like that for the zks and zuul02 would be great | 21:25 |
fungi | for the zks i wouldn't even bother, we can take one node out of rotation at a time since it's a redundant cluster? | 21:26 |
clarkb | we can and good point | 21:26 |
fungi | mainly concerned with the zuul scheduler host | 21:26 |
fungi | but it shouldn't be hard to shorten that one to almost as quick as a straight up scheduler restart | 21:27 |
clarkb | ++ | 21:27 |
clarkb | fwiw zuul01 has a 30GB swap partition but the min change made to mk_swap.sh intended for it to have an 8GB partition | 21:28 |
clarkb | I think I'm ok with 8GB in that case | 21:28 |
fungi | yeah, we also dropped the ram for 02 anyway right? | 21:28 |
clarkb | we did not. I had planned to but corvus requested that we don't | 21:29 |
fungi | ahh, okay | 21:29 |
fungi | once we have redundant schedulers we can change that fairly easily though | 21:29 |
clarkb | yup | 21:31 |
clarkb | seems like while this is annoying none of the services that got hit by it are immediately having trouble from it | 21:32 |
clarkb | I'm going to step out for a bit now and enjoy some sunshine. Back in a bit | 21:32 |
fungi | go enjoy, i sat on the patio and grilled hamburgers and corn | 21:33 |
fungi | it was lovely | 21:33 |
clarkb | nice | 21:34 |
fungi | waiting for the hardware store to tell me my chopsaw is ready for pickup | 21:34 |
*** timburke has quit IRC | 22:08 | |
*** timburke_ has joined #opendev | 22:08 | |
*** dpawlik has quit IRC | 22:45 | |
*** dpawlik7 has joined #opendev | 22:52 | |
*** tosky has quit IRC | 23:11 | |
*** mlavalle has quit IRC | 23:46 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!