ianw | i hacked in a mkdir -p and touch/chmod+x of the file in nb02; let's see if the next build works with that | 00:04 |
---|---|---|
*** stevebaker has quit IRC | 00:07 | |
*** tosky has quit IRC | 00:10 | |
openstackgerrit | Merged opendev/system-config master: Use upstream jitsi-meet web image https://review.opendev.org/c/opendev/system-config/+/778308 | 00:23 |
*** hamalq has quit IRC | 00:26 | |
fungi | i'm stumped by the failure on https://review.opendev.org/780942 | 00:40 |
fungi | the error message seems to completely contradictory to what's implemented by the depends-on change, and the zuul inventory even indicates it included that change | 00:40 |
ianw | Class[Ptgbot]: has no parameter named 'aliases' at /opt/system-config/production/modules/openstack_project/manifests/eavesdrop.pp:114:3 on node eavesdrop01.openstack.org ? | 00:45 |
ianw | yeah, i wonder if we use the zuul checkout correctly? | 00:46 |
ianw | tumbleweed is converting. i guess it's an open question how it goes | 01:19 |
*** lbragstad_ is now known as lbragstad | 01:23 | |
*** stevebaker has joined #opendev | 01:31 | |
*** mlavalle has quit IRC | 01:33 | |
corvus | fungi, ianw: let's not drop that; that could be the sort of error we should be on the lookout with zuul; i can pitch in tomorrow to help verify if it's not confirmed by then | 01:39 |
corvus | maybe start by seeing if there are git shas in the build log | 01:40 |
fungi | yeah, i'll try to dig into it tomorrow, getting late for me | 01:46 |
*** ysandeep|away is now known as ysandeep|bbl | 01:56 | |
*** redrobot2 has joined #opendev | 01:59 | |
*** redrobot has quit IRC | 02:03 | |
*** redrobot2 is now known as redrobot | 02:03 | |
openstackgerrit | Ian Wienand proposed openstack/project-config master: nodepool elements: create suse boot rc directory https://review.opendev.org/c/openstack/project-config/+/781002 | 02:40 |
*** priteau has quit IRC | 03:03 | |
*** SWAT has quit IRC | 03:37 | |
*** SWAT has joined #opendev | 03:40 | |
ianw | #status log kdc03/04 manually upgraded to focal. they are in emergency until 779890; we will run manually first time to confirm operation | 03:50 |
openstackstatus | ianw: finished logging | 03:50 |
openstackgerrit | Merged opendev/system-config master: kerberos: switch servers to Ansible control https://review.opendev.org/c/opendev/system-config/+/779890 | 04:03 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: borg-backup-server: fix verification run https://review.opendev.org/c/opendev/system-config/+/781010 | 04:10 |
*** ykarel|away has joined #opendev | 04:17 | |
*** ykarel|away is now known as ykarel | 04:17 | |
*** artom has quit IRC | 05:40 | |
*** artom has joined #opendev | 05:40 | |
*** SWAT has quit IRC | 06:09 | |
*** SWAT has joined #opendev | 06:11 | |
*** ykarel_ has joined #opendev | 06:12 | |
*** marios has joined #opendev | 06:12 | |
*** ykarel has quit IRC | 06:13 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/781019 | 06:15 |
*** ykarel_ has quit IRC | 06:26 | |
*** ykarel has joined #opendev | 06:30 | |
*** lpetrut has joined #opendev | 06:41 | |
*** ysandeep|bbl is now known as ysandeep | 06:57 | |
*** ykarel_ has joined #opendev | 07:03 | |
*** ykarel has quit IRC | 07:03 | |
*** ykarel_ has quit IRC | 07:07 | |
*** ykarel_ has joined #opendev | 07:08 | |
*** ykarel_ has quit IRC | 07:12 | |
*** sboyron has joined #opendev | 07:15 | |
*** sboyron has quit IRC | 07:16 | |
*** sboyron has joined #opendev | 07:17 | |
*** eolivare has joined #opendev | 07:22 | |
openstackgerrit | Artem Goncharov proposed opendev/irc-meetings master: Move meeting to 16 UTC as agreed https://review.opendev.org/c/opendev/irc-meetings/+/781033 | 07:35 |
*** ykarel has joined #opendev | 07:42 | |
*** DSpider has joined #opendev | 07:59 | |
*** DSpider has quit IRC | 08:00 | |
*** amoralej|off is now known as amoralej | 08:08 | |
*** andrewbonney has joined #opendev | 08:09 | |
openstackgerrit | Artem Goncharov proposed opendev/irc-meetings master: Move meeting to 16 UTC as agreed https://review.opendev.org/c/opendev/irc-meetings/+/781033 | 08:09 |
*** hashar has joined #opendev | 08:10 | |
*** rpittau|afk is now known as rpittau | 08:23 | |
*** slaweq has joined #opendev | 08:24 | |
*** zbr has joined #opendev | 08:42 | |
*** ysandeep is now known as ysandeep|lunch | 08:46 | |
*** fressi has quit IRC | 08:47 | |
*** fressi has joined #opendev | 08:48 | |
*** LowKey has joined #opendev | 08:52 | |
*** jpena|off is now known as jpena | 08:57 | |
*** dtantsur|afk is now known as dtantsur | 09:01 | |
*** tosky has joined #opendev | 09:03 | |
*** ykarel has quit IRC | 09:03 | |
*** fressi has quit IRC | 09:06 | |
*** fressi has joined #opendev | 09:08 | |
openstackgerrit | Roman Gorshunov proposed opendev/git-review master: Add missing -h to manpage https://review.opendev.org/c/opendev/git-review/+/781053 | 09:16 |
openstackgerrit | Roman Gorshunov proposed opendev/git-review master: Add missing -h to manpage and remove -c from it https://review.opendev.org/c/opendev/git-review/+/781053 | 09:21 |
openstackgerrit | Roman Gorshunov proposed opendev/git-review master: Add missing -h to manpage and remove -c from it https://review.opendev.org/c/opendev/git-review/+/781053 | 09:25 |
openstackgerrit | Roman Gorshunov proposed opendev/git-review master: Add missing -h to manpage and remove -c from it https://review.opendev.org/c/opendev/git-review/+/781053 | 09:26 |
lourot | ttx o/ I have a few governance reviews open. Are you the right person to ping for that? thanks! https://review.opendev.org/q/project:openstack/governance+AND+owner:lourot | 09:30 |
*** openstackgerrit has quit IRC | 09:33 | |
ttx | lourot: no that would be the TC members, I don;t approve those changes anymore as I'm not elected to the TC anymore. You can ask them on #openstack-tc | 09:34 |
ttx | I'll do a pass on them and sprinkle Codereview+1 magic, but that only goes so far | 09:34 |
*** ykarel has joined #opendev | 09:44 | |
*** sboyron_ has joined #opendev | 09:47 | |
*** stevebaker_ has joined #opendev | 09:49 | |
*** sboyron has quit IRC | 09:55 | |
*** stevebaker has quit IRC | 09:55 | |
*** ysandeep|lunch is now known as ysandeep | 09:56 | |
*** hemanth_n has joined #opendev | 09:58 | |
lourot | understood, thanks! | 10:00 |
ykarel | looks like openstackgerrit bot is down or have some issues | 10:42 |
ykarel | not getting IRC notification | 10:43 |
ykarel | can someone check | 10:43 |
*** hashar has quit IRC | 10:51 | |
*** artom has quit IRC | 11:04 | |
*** LowKey has quit IRC | 11:14 | |
*** priteau has joined #opendev | 11:24 | |
*** artom has joined #opendev | 11:28 | |
*** hemanth_n has quit IRC | 12:15 | |
*** jpena is now known as jpena|lunch | 12:34 | |
*** fressi has quit IRC | 12:39 | |
*** ykarel has quit IRC | 12:59 | |
*** ykarel has joined #opendev | 13:12 | |
*** marios is now known as marios|call | 13:19 | |
*** amoralej is now known as amoralej|lunch | 13:21 | |
*** marios|call is now known as marios | 13:31 | |
*** jpena|lunch is now known as jpena | 13:32 | |
fungi | ykarel: yep, sorry, taking a look now | 13:45 |
fungi | 2021-03-17 09:33:30 <-- openstackgerrit (~openstack@eavesdrop01.openstack.org) has quit (Quit: Changing servers) | 13:46 |
fungi | i guess that was the last we heard from it | 13:46 |
fungi | #status log Restarted gerritbot container since it never returned after a 09:33 UTC server change | 13:49 |
openstackstatus | fungi: finished logging | 13:49 |
fungi | looks like it reported a change in #openstack-ansible after the restart, 2021-03-17T13:48:04 | 13:50 |
fungi | ykarel: it should be back to normal now | 13:50 |
*** openstackgerrit has joined #opendev | 13:54 | |
openstackgerrit | Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474 | 13:54 |
*** fressi has joined #opendev | 13:57 | |
ykarel | fungi, Thanks it's working now | 13:59 |
*** amoralej|lunch is now known as amoralej | 14:01 | |
fungi | i wish we could figure out why it doesn't detect that it's no longer connected to freenode and keeps shouting into the void | 14:06 |
fungi | it never even logged anything from the server about it | 14:10 |
ykarel | yes that would be good | 14:15 |
fungi | though interestingly, it did log a traceback right at that time because of the length of the commit subject on this change: https://review.opendev.org/780282 | 14:18 |
fungi | author forgot to add a blank as the second line | 14:18 |
fungi | i doubt it's related since the quit message freenode showed was "Changing servers" but i'll try to see if i can recreate the problem later | 14:19 |
*** whoami-rajat_ has joined #opendev | 14:30 | |
*** ysandeep is now known as ysandeep|afk | 14:32 | |
*** ysandeep|afk is now known as ysandeep | 14:55 | |
frickler | fungi: I saw similar logs about overlong messages on earlier disconnects, but never was able to figure out whether they were just happening because the connection got broken just a bit earlier or whether they might actually to related to triggering the issue | 14:56 |
frickler | would sure be interesting to try and reproduce | 14:57 |
*** marios is now known as marios|call | 15:00 | |
*** lpetrut has quit IRC | 15:03 | |
clarkb | fungi: frickler: any objections to me starting to land https://review.opendev.org/c/openstack/project-config/+/780982 https://review.opendev.org/c/opendev/system-config/+/780986 https://review.opendev.org/c/opendev/system-config/+/780989 https://review.opendev.org/c/opendev/zone-opendev.org/+/780988 this morning to begin the rollout of nl02-04 today? | 15:06 |
*** ykarel is now known as ykarel|away | 15:06 | |
clarkb | some of those need reviews too | 15:06 |
clarkb | I'm happy to do the +As and watch them if others can sanity check them first | 15:06 |
fungi | clarkb: no objection | 15:09 |
clarkb | great, I've approved the first 3. The 4th is child of one of the other three and depends on a different one in the other three so I'll let things settle a bit before landing that one | 15:12 |
clarkb | thanks for the reviews! | 15:12 |
fungi | trying to work out why a system-config change with a depends-on to a puppet-ptgbot change seems not to have used the modified module source... how do we go about making install_modules.sh use zuul-prepared repository states? | 15:15 |
fungi | i can see where it cloned puppet-ptgbot in the job, but unfortunately it doesn't say where it cloned it from or what command it ran to do the clone | 15:16 |
clarkb | corvus: is there any reason to not make the change at https://review.opendev.org/c/openstack/project-config/+/780985/ ? I noticed that our existing nodepool configs were confusing when I was updating them to do this rolling replacement of the servers. That change is an attempt to make it less confusing but I am not sure if maybe the confusing aspect was intentional for some reason | 15:16 |
clarkb | fungi: have a link to the job? | 15:16 |
clarkb | I think I'll need the context to help make sense of it at this point | 15:16 |
fungi | this was the build: https://zuul.opendev.org/t/openstack/build/ab3c56b2e58b46afa142e9e8c83814e2 | 15:16 |
fungi | it was triggered by system-config change https://review.opendev.org/780942 | 15:17 |
fungi | which depends-on https://review.opendev.org/780947 | 15:17 |
openstackgerrit | Merged opendev/zone-opendev.org master: Add nl02-04 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/780988 | 15:17 |
fungi | the zuul inventory reflects that set of changes, but i'm trying to figure out if we somehow didn't actually check out the patched version of the module, or my change is just subtly wrong in some way i can't see | 15:18 |
fungi | the log indicates /etc/puppet/modules did not exist on the bridge node, and then it was created and our needed modules (including the ptgbot module) were cloned into it | 15:19 |
clarkb | ya and that uses modules.env in system-config as the input list | 15:21 |
clarkb | looking at modules.env I expect we are supposed to override OPENSTACK_GIT_ROOT maybe? | 15:22 |
fungi | yeah, that's what i'm guessing too | 15:22 |
clarkb | however it doesn't check for a value before setting https://opendev.org | 15:22 |
fungi | well, also codesearch turns up no existing references to that variable outside that one file | 15:23 |
fungi | there's this at the end of the file: ttps://opendev.org/opendev/system-config/src/branch/master/modules.env#L134-L139 | 15:24 |
openstackgerrit | Merged openstack/project-config master: Add idle configs for nl02-04.opendev.org https://review.opendev.org/c/openstack/project-config/+/780982 | 15:24 |
clarkb | fungi: I think the way it worked with some jobs was we used zuul-cloner to put the repos in place first | 15:24 |
clarkb | and then install_modules.sh would largely noop | 15:25 |
fungi | yeah, maybe system-config-run-eavesdrop is missing that somehow | 15:25 |
clarkb | I expect all the system-config-run jobs are because they rely on more modern zuulisms and we don't seem to have adapted this to new zuul | 15:26 |
clarkb | # If puppet integration tests are not being run, merge SOURCE and INTEGRATION modules <- thats from modules.env and is a clue | 15:28 |
fungi | oh, you know what? system-config-run-eavesdrop probably assumes we're only doing integration testing for ansibilified/containerized services | 15:28 |
clarkb | we need to set PUPPET_INTEGRATION_TEST=1 when running install_modules.sh and separately clone the repos ourselves | 15:28 |
clarkb | fungi: yes exactly | 15:28 |
openstackgerrit | Merged opendev/system-config master: Cleanup nl01.openstack.org https://review.opendev.org/c/opendev/system-config/+/780986 | 15:29 |
clarkb | for the separate clone step we may be able to simply do a ln -s of the module from the zuul dir to puppet modules dir? | 15:29 |
fungi | okay, well, i'm not super concerned about it in that case, it's a problem which will solve itself once we move the rest of eavesdrop's services off puppet | 15:29 |
clarkb | yup | 15:29 |
clarkb | PUPPET_INTEGRATION_TEST=1 seems to be the key though as that is what causes install_modules.sh to say it is your problem | 15:29 |
fungi | i see little point in spending time trying to fix our ansible integration testing to do puppet integration testing correctly | 15:30 |
clarkb | ++ | 15:30 |
fungi | corvus: mystery solved ^ i assumed incorrectly that was one of our existing puppet integration test jobs, it's not | 15:31 |
* clarkb finds breakfast while ansible sorts out the three changes that just landed. Will cleanup nl01.openstack.org afterwards then land the change to add nl02-04.opendev.org | 15:32 | |
*** mlavalle has joined #opendev | 15:35 | |
*** lourot has quit IRC | 15:36 | |
*** lourot has joined #opendev | 15:36 | |
*** lourot has quit IRC | 15:37 | |
*** lourot has joined #opendev | 15:38 | |
dtantsur | anyone knowing about glean around now? I'm pondering adding `NetworkManager restart` somewhere (where?) otherwise static DHCP won't work until reboot.. | 15:39 |
fungi | i'm about to disappear for a routine dental checkup, or i'd try to figure it out now | 15:41 |
*** ykarel|away has quit IRC | 15:42 | |
openstackgerrit | Sorin Sbârnea proposed zuul/zuul-jobs master: Upgrade ansible-lint to 5.0 https://review.opendev.org/c/zuul/zuul-jobs/+/773245 | 15:43 |
fungi | okay, headed out, back as soon as possible | 15:44 |
clarkb | dtantsur: what does static dhcp mean? | 15:44 |
dtantsur | clarkb: ouch. It was static IP. | 15:44 |
* dtantsur brain boiling | 15:44 | |
fungi | ahh, okay, so not a reservation | 15:44 |
openstackgerrit | Sorin Sbârnea proposed zuul/zuul-jobs master: Upgrade ansible-lint to 5.0 https://review.opendev.org/c/zuul/zuul-jobs/+/773245 | 15:45 |
dtantsur | no. I'll post more details to the older thread with ianw | 15:45 |
clarkb | dtantsur: we use glean + static IPs on all our rax nodes. The centos and fedora nodes use network manager. They do not require a separate restart step | 15:45 |
dtantsur | clarkb: yeah, I know. No idea how it works for you, it 100% does not for me. But that's on a ramdisk, that's the only difference. | 15:45 |
clarkb | dtantsur: does the ramdisk use systemd or some other init setup? I bet it has to do with service startup ordering | 15:46 |
dtantsur | systemd, we've already ruled it out with ianw as it seems | 15:46 |
dtantsur | I'll paste links to the relevant ML posts, hold on | 15:46 |
clarkb | ok | 15:46 |
dtantsur | this is my earlier message with systemd-analyze output: http://lists.openstack.org/pipermail/openstack-discuss/2020-November/019022.html | 15:53 |
dtantsur | this is today's with boot logs: http://lists.openstack.org/pipermail/openstack-discuss/2021-March/021155.html | 15:53 |
dtantsur | clarkb: ^^ | 15:53 |
dtantsur | This issue may be related to reading from a (virtual) CD ROM? | 15:54 |
clarkb | dtantsur: you need glean to run before netowrk manager | 15:54 |
clarkb | and that is a unit ordering problem aiui | 15:54 |
dtantsur | so, the DIB simple-init element is broken? | 15:54 |
clarkb | possibly | 15:55 |
dtantsur | I think it simply uses 'glean install' or something like that | 15:55 |
clarkb | no it also setups up the udev and unit rule as explained in ianw's response | 15:55 |
dtantsur | udev rules also come from glean itself: https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-udev.rules | 15:56 |
clarkb | https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-udev.rules and https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.service | 15:56 |
dtantsur | right. so these are wrong? | 15:56 |
clarkb | dtantsur: yes but the pip install can't install them | 15:56 |
dtantsur | they do run, otherwise glean wouldn't be triggered at all | 15:57 |
clarkb | no, I'm explaining that the simple-init element installs those files. glean is merely a containment vessel since pip isn't robust enough to do those installations properly across various distros | 15:57 |
dtantsur | (and simple-init doesn't to pip install, it uses glean-install) | 15:57 |
dtantsur | okay, fine, I think we're talking about the same thing in the end | 15:57 |
dtantsur | I would assume Before=network-pre.target is enough. maybe it's not? | 15:58 |
dtantsur | maybe it needs Before=networkmanager.service? | 15:58 |
clarkb | possibly. The first thing I notice is that there is a different unit file for nm use and the udev rule refers to the unit file name | 15:59 |
*** lpetrut has joined #opendev | 15:59 | |
clarkb | I wonder if the udev rule isn't configured properly for the nm unit | 15:59 |
dtantsur | "Started Glean for interface enp1s0 with | 15:59 |
dtantsur | NetworkManager | 15:59 |
dtantsur | this seems the correct unit | 15:59 |
dtantsur | matches the description: https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.service#L2 | 16:00 |
clarkb | do we know if that is https://opendev.org/opendev/glean/src/branch/master/glean/init/glean@.service or https://opendev.org/opendev/glean/src/branch/master/glean/init/glean-nm@.service ? | 16:00 |
clarkb | (I don't think we do) | 16:00 |
dtantsur | clarkb: see "with NetworkManager" above | 16:00 |
dtantsur | the one you link does not have this bit in the description | 16:01 |
clarkb | oh it went across multiple lines | 16:01 |
dtantsur | yeah, sorry for that | 16:01 |
dtantsur | I think I'll try before=networkmanager, it may give us a better insight | 16:01 |
clarkb | ok that is good, that means the udev rule is triggering the expected unit at least | 16:02 |
dtantsur | yep | 16:02 |
dtantsur | I'm worried that NM has a habit of always initializing a connection | 16:02 |
clarkb | looking at your log, not only are nm and glean started roughly at the same time but glean.sh doesn't do anything for almost a minute? | 16:02 |
dtantsur | yeah, that's the most surprising bit | 16:03 |
dtantsur | I wonder if it has anything to do with reading from a virtual CD | 16:03 |
dtantsur | (silly, but that's the only guess I have) | 16:03 |
clarkb | reading the config-drive data from virtual cdrom? ya that could be it I usppose | 16:03 |
dtantsur | yep. nested VM, so everything is slooooooooow. | 16:03 |
clarkb | what you need for NM to be configured properly is for glean.sh to have written the interface config file before NM evaluates the interface | 16:04 |
clarkb | and that log clearly shows we aren't writing that file until well after NM has done its thing | 16:04 |
dtantsur | yep | 16:04 |
clarkb | is network manager also before network-pre.target? | 16:05 |
dtantsur | ouch. I've just realized that Before=NetworkManager may not work since NetworkManager starts very early | 16:05 |
dtantsur | mmm, lemme try to figure out | 16:05 |
clarkb | dtantsur: I think the before=network-pre.target is intended to run glean prior to NM | 16:06 |
clarkb | with the assumption that NM and friends won't happen until network-pre.target starts | 16:06 |
dtantsur | After=network-pre.target dbus.service | 16:06 |
dtantsur | Before=network.target network.service | 16:06 |
dtantsur | This is on my normal machine, not inside the ramdisk | 16:06 |
dtantsur | but it's also centos 8 | 16:06 |
clarkb | my NetworkManager is After=network-pre.target dbus.service and Before=network.target | 16:07 |
clarkb | ya for both of these machines the glean-nm service should be fine | 16:07 |
clarkb | but it certainly seems like this isn't the case in the system whose log you've pasted | 16:07 |
clarkb | (thinking out loud here) could it potentially be that in your nested VM udev doesn't process its events until well after systemd has decided to move along? | 16:08 |
dtantsur | easily.. | 16:08 |
clarkb | the glean unit is also implicitly expecting that udev will have fired off an event to systemd saying add this to your startup graph I think | 16:08 |
clarkb | systemd-udev-trigger.service is the unit on my suse system that coldplugs all devices and it runs Before sysinit.target | 16:10 |
clarkb | double checking ^ on the test env might also be worthwhile | 16:10 |
dtantsur | [ 40.128942] systemd[1]: Started udev Coldplug all Devices. | 16:12 |
dtantsur | it's much earlier than NetworkManager (timestamps 63) | 16:12 |
*** roman_g has joined #opendev | 16:12 | |
*** lpetrut has quit IRC | 16:12 | |
dtantsur | .. which doesn't necessary mean that it has processed all events by NM start-up... | 16:13 |
roman_g | Hello team. Could I ask you to check kna1 Ubuntu mirror, please? Thank you! https://zuul.opendev.org/t/openstack/build/cfce2713a39a4dcb9b36a93ebd1799a4 | 16:15 |
roman_g | Errors: E: Failed to fetch https://mirror.kna1.airship-citycloud.opendev.org/ubuntu/pool/main/p/python-certifi/python-certifi_2018.1.18-2_all.deb Unable to connect to mirror.kna1.airship-citycloud.opendev.org:https: | 16:15 |
clarkb | dtantsur: ya my udev plug unit does execstart and as soon as the process tarts running I think systemd will consider the unit started? | 16:15 |
roman_g | dtantsur Hi Dmitry. Привет, Дмитрий :) | 16:16 |
dtantsur | roman_g: Привет! o/ | 16:16 |
clarkb | roman_g: you can check it too :) https://mirror.kna1.airship-citycloud.opendev.org/ubuntu/ is available from here | 16:16 |
clarkb | roman_g: can you link to the job that hit that problme so we can see timestamps and cross check against the server (but without timetamps the best I can do quickly is say the server is up and running now according to my browser) | 16:16 |
roman_g | clarkb I did. But unreachable from VMs | 16:17 |
roman_g | clarkb this job: https://zuul.opendev.org/t/openstack/build/cfce2713a39a4dcb9b36a93ebd1799a4 | 16:17 |
clarkb | hrm that doesn't log timestamps? | 16:17 |
clarkb | ah the other parts of the job do thats is good | 16:18 |
clarkb | https://zuul.opendev.org/t/openstack/build/cfce2713a39a4dcb9b36a93ebd1799a4/log/job-output.txt#332 | 16:18 |
roman_g | clarkb thanks for pointing out to a way to link to specific line. This seems to be something new. | 16:19 |
clarkb | dmesg on the mirror doesn't show any recent afs connectivity issues. I think we can rule that out | 16:20 |
roman_g | clarkb I'm also missing a tab in Zuul UI to list all jobs on specific provider (label). You know we are often having troubles :) | 16:20 |
roman_g | This could have allowed me to see if this is common problem for all jobs using this provider (or label). | 16:21 |
clarkb | roman_g: zuul doesn't track that iirc | 16:22 |
clarkb | the logstash service tries to fill that gap though | 16:22 |
TheJulia | I guess meetpad got updated? | 16:22 |
clarkb | TheJulia: yes, or at least I think the change to do that landed | 16:22 |
roman_g | clarkb yes, that's righ | 16:22 |
roman_g | right | 16:22 |
clarkb | TheJulia: just the web portion though, we had had a fork for a while but then they upstreamed corvus' changes and so that got updated | 16:22 |
TheJulia | Cool, are we expecting ny etherpad integration breakages? | 16:23 |
clarkb | TheJulia: no, but the component that was updated does handle that so is possible | 16:24 |
* dtantsur adds moar logging to glean.sh and rebuilds | 16:24 | |
clarkb | TheJulia: is the etherpad integration not working as expected? | 16:25 |
TheJulia | clarkb: did not but it could just be cached items | 16:25 |
TheJulia | trying another browser | 16:25 |
clarkb | roman_g: I see the test node hitting the access logs around the time the build logs it cannot access things | 16:26 |
*** ysandeep is now known as ysandeep|dinner | 16:26 | |
TheJulia | clarkb: nope, not working. https://meetpad.opendev.org/ironic the embedded etherpad says 400 Bad Request in the background on a brand new fresh browser | 16:26 |
clarkb | TheJulia: ok, it is possible the integration stuff which we thought was working upstream isn't actually working upstream. corvus did they not take your change as is? | 16:27 |
clarkb | we can revert the cleanup and go back to our forked version easily enough. Though I'm helping to debug a couple of other things right now so may be a bit | 16:27 |
TheJulia | I'm not too worried about it, tbh | 16:28 |
TheJulia | so don't rush anything on my account | 16:28 |
clarkb | TheJulia: ok, ya you should be able to screenshare an etherpad window and/or tell people to open it up separately | 16:28 |
clarkb | roman_g: here is a curious thing, I see the same IP hitting the mirror ~13 minutes propr | 16:28 |
TheJulia | clarkb: exactly | 16:28 |
clarkb | s/propr/prior/ | 16:28 |
clarkb | roman_g: but the job had only just started ~3 minutes prior | 16:29 |
roman_g | clarkb Thank you. This is very interesting. I don't have ideas how to debug it then. | 16:29 |
roman_g | :) | 16:29 |
clarkb | roman_g: I wonder if this is an arp caching problem with stale arp tables and IP reuse | 16:29 |
roman_g | clarkb on provider side? | 16:29 |
clarkb | roman_g: basically the packets from the host get to the mirror but then the return path goes to some node that doesn't exist anymore | 16:29 |
clarkb | roman_g: ya, mostly just calling that out as a possibility given the use of the IP outside of this job context only a few minutes prior | 16:29 |
roman_g | Then this is very occasional. Happens not so often | 16:29 |
clarkb | roman_g: let me get a paste together that tries to show this | 16:30 |
*** hamalq has joined #opendev | 16:31 | |
clarkb | roman_g: http://paste.openstack.org/show/TYMBvtqYyKrA2Je4NWow/ | 16:37 |
clarkb | roman_g: assuming apt doesn't say "connection time out" when it gets a 404 (that would be a pretty big bug) I think we can probably blame provider networking since the mirror sees the requests but the host doesn't seem to agree | 16:37 |
clarkb | TheJulia: I can reproduce the etherpad problem | 16:38 |
roman_g | TheJulia, clarkb I confirm that I also see Bad Request page there | 16:39 |
clarkb | https://meetpad.opendev.org/etherpad/p/padnamehere <- is the url that jitsi tries to fetch the etherpad at and that is producing the same result | 16:40 |
*** marios|call is now known as marios | 16:40 | |
clarkb | the nginx config to do that proxying seems to still exist | 16:43 |
clarkb | I need to find the server side logs to see what it doesn't like about this request I guess | 16:44 |
openstackgerrit | Dmitry Tantsur proposed openstack/diskimage-builder master: Install glean from source-repository https://review.opendev.org/c/openstack/diskimage-builder/+/781126 | 16:44 |
openstackgerrit | Dmitry Tantsur proposed openstack/diskimage-builder master: Install glean from source-repository https://review.opendev.org/c/openstack/diskimage-builder/+/781126 | 16:55 |
*** ysandeep|dinner is now known as ysandeep | 16:57 | |
*** marios is now known as marios|out | 16:57 | |
dtantsur | clarkb: added debugging logging, found our something. this first invocation: https://opendev.org/opendev/glean/src/branch/master/glean/init/glean.sh#L46 takes half a minute | 17:00 |
dtantsur | which pushes the actual network configuration to much later. why is it needed? | 17:00 |
dtantsur | it seems like we could just default to --ssh and --hostname and avoid it | 17:01 |
clarkb | dtantsur: I'm not sure, will need to do some digging | 17:03 |
dtantsur | and I can confirm that networkmanager ignores before=network-pre :) | 17:05 |
clarkb | that seems particularly problematic | 17:06 |
dtantsur | yep. it's a recipe for races | 17:06 |
*** rpittau is now known as rpittau|afk | 17:08 | |
fungi | okay, back and trying to catch up while i make lunch | 17:09 |
clarkb | dtantsur: the line you call out is just for when the config drive isn't present. I think there is a bug there where we should put the last line in the script in an else block though | 17:09 |
clarkb | dtantsur: we want one or the other to run not both I think. | 17:10 |
dtantsur | clarkb: the other way around: it's when configdrive IS present | 17:10 |
dtantsur | (which I can confirm in vivo) | 17:10 |
clarkb | dtantsur: however, in your case if the no config drive block is running that will default to dhcp which may explain part of your problem? | 17:10 |
clarkb | dtantsur: the test is -n "$CONFIG_DRIVE_LABEL" | 17:10 |
clarkb | oh ya what the comment there is confusing | 17:11 |
clarkb | ya I really don't understand what that is trying to achieve | 17:11 |
dtantsur | -n checks for presence, you're confusing with -z | 17:11 |
dtantsur | I'm trying without that line now | 17:11 |
clarkb | the code has alays been there too :/ | 17:12 |
clarkb | dtantsur: yes, the comment above it was confusing me | 17:12 |
clarkb | it says if teh config drive isn't present skip it | 17:12 |
clarkb | which is not what the code is doing at all | 17:12 |
dtantsur | right, yeah. I had a minute of confusion as well | 17:12 |
clarkb | fungi: the meetpad upgrade broke etherpad embedment | 17:12 |
clarkb | fungi: jitsi meet should look to https://meetpad.opendev.org/etherpad/p/padnamehere for the pads. That too fails with an http 400 | 17:13 |
dtantsur | I think I understand why After=network-pre does not work. network-pre is optional, it gets pulled in via Wants=network-pre in glean@ units. But they appear too late for NM. | 17:13 |
dtantsur | this is how I read https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/ | 17:13 |
clarkb | fungi: if you look at the nginx logs on meetpad (using docker logs for the web container) it shows the 400s but I don't see any explanation of why it doesn't like that url | 17:13 |
clarkb | dtantsur: I think I get it, the glean.sh script is only wanting to write out the ssh key and the hostname if the config drive is present as that data is in the config drive only. It then runs again to set up the network which can handle config drive being present or not (fallback to dhcp) | 17:16 |
clarkb | dtantsur: I can write up a patch you can test that executes glean once hanlding both scenarios | 17:16 |
dtantsur | will gladly do (I'm trying my own hacked-together version now) | 17:16 |
dtantsur | glean takes 40 seconds just to start executing O_____o | 17:18 |
dtantsur | what the.... | 17:18 |
* dtantsur rewrites it in rust | 17:18 | |
dtantsur | :D | 17:18 |
dtantsur | ideally we should be able to deal with networkmanager somehow | 17:19 |
*** amoralej is now known as amoralej|off | 17:19 | |
dtantsur | the only problem is: I haven't found a way to tell it to re-read configuration files short of restarting | 17:20 |
openstackgerrit | Clark Boylan proposed opendev/glean master: Run glean fewer times in glean.sh https://review.opendev.org/c/opendev/glean/+/781133 | 17:20 |
clarkb | dtantsur: you have to delete the interface configuration. It skips if it sees they are already there iirc | 17:20 |
clarkb | dtantsur: ^ something like that maybe | 17:21 |
dtantsur | clarkb: no, I mean a different thing. I think NM keeps its configuration somewhere else. | 17:21 |
clarkb | oh NM yes it does. The /etc/ configs are a compat convenience thing but it maintains a db or somethign iirc | 17:21 |
dtantsur | yup | 17:22 |
dtantsur | so this DB is initialized early with a dummy "Wired Connection 1" | 17:22 |
dtantsur | and then it refuses to read the files glean creates... | 17:22 |
* dtantsur shakes first at NM | 17:22 | |
dtantsur | first... fist | 17:22 |
*** frigo has joined #opendev | 17:24 | |
dtantsur | frigo (not here) has put together workarounds: https://bugs.launchpad.net/diskimage-builder/+bug/1916348 | 17:25 |
openstack | Launchpad bug 1916348 in diskimage-builder "simple-init/glean missing some requirements (centos-minimal 8)" [Undecided,New] | 17:25 |
dtantsur | but it boils down to restarting NM networking | 17:25 |
dtantsur | (and I question some of these steps) | 17:25 |
dtantsur | hey frigo, I've just posted your link | 17:25 |
frigo | haha:D I did all that without thinking, and did not put the "updated" version of the glean.sh | 17:25 |
dtantsur | I'm not sure why /dev/sr0 doesn't work for you, it works for me | 17:26 |
dtantsur | the NM changes essentially amount to restarting the full networking, right? | 17:26 |
clarkb | frigo: dtantsur: I don't think glean should handle the multiple config drive situation (that isn't valid is it?) | 17:26 |
frigo | /dev/sr0 works but | 17:26 |
clarkb | your cloud has broken something very badly if that happens and you should fix it under glean | 17:26 |
frigo | it does not if you already have a /dev/sda with a config-2 drive label in it | 17:27 |
dtantsur | aaaaaah. oooh! | 17:27 |
* dtantsur runs away | 17:27 | |
clarkb | frigo: but yuo shouldn't have that? | 17:27 |
frigo | in the context of bifrost, for some reason, the first time you enroll a server, the clean-up does not run | 17:27 |
clarkb | if nova gave me a host with two config drives I would immediately ask the nova devs to fix it :) | 17:27 |
dtantsur | frigo: fair enough. it's an issue with microversions in ansible openstack modules. | 17:27 |
frigo | also sometimes, it's useful to disable the automated clean-up | 17:27 |
dtantsur | every time you disable cleaning somewhere far-far away cries a lonely ironic developer | 17:28 |
clarkb | and yes dhcp-all-interfaces and simple-init need to be used XOR each other | 17:28 |
dtantsur | (but yes, there is an actual bug with enrollment) | 17:28 |
clarkb | but that is up to you as the person compiling the elements list | 17:28 |
frigo | well, I used to envision to leverage the cleaning steps to run a lot of wild things | 17:29 |
frigo | like firmware upgrades | 17:29 |
frigo | then I opened https://storyboard.openstack.org/#!/story/2008643 | 17:29 |
frigo | and more I think | 17:29 |
dtantsur | funny enough, dhcp-all-interfaces worked fine in my earlier testing on debian | 17:30 |
dtantsur | IIRC it has logic to skip DHCP if there is configuration | 17:30 |
dtantsur | frigo: please report it to HPE folks | 17:31 |
frigo | I report things one after the other:D I opened quite a lot of tickets already | 17:32 |
dtantsur | that's good, thank you | 17:32 |
*** hashar has joined #opendev | 17:36 | |
corvus | clarkb: nodepool zk change lgtm | 17:37 |
corvus | fungi: re mystery solved, great! | 17:37 |
clarkb | corvus: cool, just wanted to double check on that | 17:37 |
corvus | clarkb: afaik they took the change as written, but who knows, maybe they renamed the variable or reverted it | 17:38 |
clarkb | corvus: ya digging around it seems they kept it pretty stable. | 17:39 |
clarkb | I have just discovered that the request is making it all the way to etherpad but with an extra /p/ prefix so you get /p//p/padnamehere in the url and that breaks it | 17:40 |
clarkb | I don't understand why this is happening yet though | 17:40 |
*** ysandeep is now known as ysandeep|away | 17:41 | |
*** frigo has quit IRC | 17:41 | |
clarkb | infra-root I'm going to delete nl01.openstack.org shortly (say in 5 minutes). Please say something if that doesn't work for you | 17:42 |
dtantsur | clarkb: left one suggestion on your patch | 17:42 |
openstackgerrit | Merged opendev/irc-meetings master: Move meeting to 16 UTC as agreed https://review.opendev.org/c/opendev/irc-meetings/+/781033 | 17:43 |
clarkb | dtantsur: that is an interesting idea, my only concern is I would need to think about what that means for logging (I don't think it means anything since fds should be inherited but need to run it through in my head) | 17:44 |
dtantsur | well, I'm going to try it out now | 17:44 |
* dtantsur is trying to understand what is taking so much time | 17:45 | |
clarkb | nl01.openstack.org has been cleaned up | 17:54 |
dtantsur | Pondering another idea: a very early execution of glean for whatever interfaces are already initialized (without --interface) | 17:57 |
clarkb | TheJulia: are you done with meetpad? I think I see what a fix is but I'd like to test it before I push anything up | 18:04 |
clarkb | (and to do that I need to restart services) | 18:04 |
clarkb | essentially we set ETHERPAD_URL_BASE=https://etherpad.opendev.org/p/ and that gets picked up by the default nginx config. What I don't understand is how the default nginx config is being used when we supply an alternative | 18:05 |
clarkb | but I think this is close enough taht updating ETHERPAD_URL_BASE to drop the p/ is worth a go | 18:05 |
clarkb | and if that fixes things, push that update, then work out why our supplied nginx config is ignored | 18:06 |
*** jpena is now known as jpena|off | 18:06 | |
clarkb | ok that didn't fix things but it did confirm that that var seems to be in play as now I have //p/padnamehere instead of /p//p/padnamehere | 18:09 |
clarkb | but also that may have broken more things? yay | 18:10 |
fungi | lunch prep and consumption took longer than expected, so still catching up | 18:10 |
johnsom | FYI, just got an odd rsync POST_FAILURE | 18:11 |
dtantsur | food, mmmmm | 18:11 |
johnsom | https://www.irccloud.com/pastebin/dmWblaAa/ | 18:11 |
johnsom | https://zuul.opendev.org/t/openstack/build/065d412b8ab04a049bd60eaecc73ca80 | 18:11 |
clarkb | I think restarting services to pick up the config change has resulted in sadness | 18:11 |
clarkb | I'm going to undo the /p/ removal in case that somehow made things worse | 18:12 |
openstackgerrit | Merged opendev/system-config master: Add new opendev.org nodepool launchers https://review.opendev.org/c/opendev/system-config/+/780989 | 18:13 |
clarkb | well I see the /p//p/padnamehere behavior has returned but jitis still says I've been disconnected | 18:13 |
dtantsur | I think I'll continue suffering tomorrow, dinner is calling | 18:13 |
*** dtantsur is now known as dtantsur|afk | 18:14 | |
clarkb | interestingly the mobile app seems to load it up ok | 18:14 |
fungi | johnsom: that looks like the node died before test result collection was attempted | 18:14 |
fungi | the executor is saying it couldn't reach it | 18:15 |
clarkb | something about my browser? can anyone else confirm or deny that jitsi meet is alive for them | 18:15 |
johnsom | fungi Yeah, not a lot in the main log. Thought I would mention it in case it was a bad sign of things to come. | 18:15 |
fungi | clarkb: i get the jitsi-meet main page when i go to http://meetpad.opendev.org/ | 18:17 |
fungi | and i can start a meeting with the start meeting button | 18:18 |
clarkb | does it say you have been disconnected when you start a meeting? | 18:18 |
*** whoami-rajat_ is now known as whoami-rajat | 18:18 | |
*** andrewbonney has quit IRC | 18:18 | |
fungi | johnsom: thanks, looks like that happened in ovh-gra1 so i guess if we see more and they're in the same region, we'll have reason to believe it's correlated | 18:19 |
fungi | clarkb: oh, i wasn't trying from a machine which i expected to actually be able to use it, switching rooms now | 18:19 |
fungi | connected from my workstation with chromium, i get "you have been disconnected" yeah | 18:21 |
fungi | retrying with firefox because i saw something different there | 18:22 |
clarkb | ya firefox gives you a little diablog that says connection failed | 18:22 |
fungi | oh, firefox shows a little red "connection failed" after the browser warning in the bottom-left | 18:23 |
fungi | so that happened after you switched the etherpad base url? | 18:23 |
clarkb | https://github.com/jitsi/docker-jitsi-meet/issues/902 is the problem I think | 18:23 |
clarkb | fungi: yes, more precisely after I restarted all the jitsi meet services to pick up the therpad base url change | 18:23 |
clarkb | but I think that issue gives me a fix | 18:24 |
fungi | aha, "proxy the incoming WSS connection, not only WS" | 18:30 |
*** calcmandan_ has quit IRC | 18:33 | |
*** calcmandan has joined #opendev | 18:33 | |
*** marios|out has quit IRC | 18:36 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Disable xmpp websocket in jitsi meet config https://review.opendev.org/c/opendev/system-config/+/781145 | 18:41 |
clarkb | I manually did ^ and that fixed audio and video. etherpad doc sharing is still broken but we have more clues to debug that | 18:41 |
clarkb | I need to eat lunch but I'll pick this back up again after | 18:41 |
*** calcmandan has quit IRC | 18:46 | |
*** sboyron_ has quit IRC | 18:52 | |
*** eolivare has quit IRC | 18:54 | |
clarkb | ok I think I see what may be happening config wise with jitsi meet. We bind mount /var/jitsi-meet/web/ to /config in the container. When the container starts it populates the contents of /config from /defaults with env vars replaced | 19:01 |
clarkb | we then later run ansible which updates the /var/jitsi-meet/web contents causing all sorts of confusion | 19:02 |
clarkb | but the file contents actually loaded by nginx at startup are the ones produced from the defaults file | 19:02 |
clarkb | knowing that I now believe that the problem with the etehrpad doc loading is at least partially related to https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/meet.conf#L84 and https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/meet.conf#L90 | 19:08 |
clarkb | I think we are sending a host head to eterhpad.opendev.org with the value localhost in it | 19:09 |
clarkb | we may still need to fork their images to fix this; however, it should be possible to use a much lighter weight fork that simply patches the etherpad proxy config | 19:12 |
fungi | aha, and that's causing it to redirect with an additional /p/? | 19:13 |
clarkb | fungi: we set ETHERPAD_URL_BASE with a /p/ suffix. It shouldn't have that suffix at all. But even fixing that (I've manaully done this) we still get an http 400 response from etherpad.opendev.org and I suspect ti is because of the bad Host header values | 19:14 |
clarkb | I'm now checking if I can manually edit the config and have nginx reload it somehow | 19:15 |
clarkb | this way the container doesn't resetart and rewrite the files | 19:15 |
clarkb | you can `nginx -s reload` and that fixed https://meetpad.opendev.org/etherpad/p/padnamehere | 19:17 |
clarkb | however the shared document is still not showing up in the meetings | 19:17 |
clarkb | config.etherpad_base = '"https://meetpad.opendev.org"/etherpad/p/'; I suspect this is the problem | 19:21 |
clarkb | ya so that is being written out from the configs too | 19:24 |
clarkb | I'm somewhat amazed that this is working at all to be honest :( | 19:24 |
clarkb | oh I see the beginning of the generated config has a bunch of generic stuff then it redefines things with the vars we pass in later which is how this works | 19:27 |
clarkb | ok I think I fixed it by hand. But not really sure how to properly fix it yet | 19:30 |
clarkb | I removed the ""s from config.etherpad_base = '"https://meetpad.opendev.org"/etherpad/p/'; and hard refreshed and it loads after I also fixed the ETHERPAD_BASE_URL and the proxy config | 19:30 |
*** stevebaker_ is now known as stevebaker | 19:31 | |
clarkb | I'm beginning to wonder if we should consider a soft revert. Basically add the opendev image back but base it off of an up to date jitsi meet web and then reapply our configs | 19:35 |
fungi | is the problem that there's configuration baked into the image which we can't easily override? | 19:39 |
*** gmann is now known as gmann_afk | 19:41 | |
*** frigo has joined #opendev | 19:42 | |
clarkb | fungi: yes | 19:43 |
clarkb | fungi: the image has a bunch of configs in /defaults it then runs the frep templating tool over them to produce working configs and outputs the results to /config. /config is where we bind mount /var/jitsi-meet/web | 19:43 |
clarkb | which essentially means that all of the configs we manage with ansible are ignored | 19:43 |
fungi | i guess we could try to work out how to bind-mount over top of some of the templates? | 19:44 |
clarkb | fungi: the old setup "worked" because we had the same configs in ansible and in our forked docker image | 19:44 |
clarkb | fungi: I don't know how that would work since the cotnainer is writing over what we bind mount | 19:44 |
fungi | i mean bind-mount over the templates in /defaults | 19:45 |
clarkb | oh ya that could potentially work | 19:45 |
fungi | replace what the templating tool reads rather than what it wants to write | 19:45 |
clarkb | fwiw I think most things are working except for the doc sharing due to some issues with our env vars that get substituated and due to the meet.conf setting bad headers | 19:45 |
clarkb | we also set the browser warning and a coupel of other flags that i Think are probably less of a concern | 19:46 |
clarkb | fungi: ya, that may work | 19:46 |
*** frigo has quit IRC | 19:46 | |
clarkb | I'll give that a try with the meet.conf nginx config shortly | 19:47 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Improve meetpad env options for templating https://review.opendev.org/c/opendev/system-config/+/781152 | 19:49 |
clarkb | fungi: ^ thats not what we have been talking about but also seems to be necessary | 19:49 |
corvus | clarkb: where's your zk 4 letter word change? | 19:53 |
corvus | ah found it https://review.opendev.org/780303 | 19:54 |
clarkb | corvus: https://review.opendev.org/c/opendev/system-config/+/780303 | 19:55 |
clarkb | yup | 19:55 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add mntr to ZK whitelist https://review.opendev.org/c/opendev/system-config/+/781153 | 19:57 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Manage jitsi-meet meet.conf as a template input for the container https://review.opendev.org/c/opendev/system-config/+/781154 | 20:05 |
clarkb | infra-root ^ maybe that will work? | 20:05 |
clarkb | I think ansible undid my manual changes so jitsi meet may not be working again. The first change in that stack should be very safe. The second too | 20:13 |
clarkb | and that will get at least audio and video working again | 20:14 |
*** roman_g has quit IRC | 20:18 | |
*** roman_g has joined #opendev | 20:19 | |
*** roman_g has quit IRC | 20:19 | |
*** roman_g has joined #opendev | 20:19 | |
*** roman_g has quit IRC | 20:20 | |
*** roman_g has joined #opendev | 20:20 | |
*** roman_g has quit IRC | 20:20 | |
*** roman_g has joined #opendev | 20:21 | |
*** roman_g has quit IRC | 20:21 | |
*** roman_g has joined #opendev | 20:22 | |
*** roman_g has quit IRC | 20:22 | |
*** roman_g has joined #opendev | 20:23 | |
*** roman_g has quit IRC | 20:23 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Restore some meetpad settings we had previously set https://review.opendev.org/c/opendev/system-config/+/781156 | 20:25 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Restore meetpad etherpad settings. https://review.opendev.org/c/opendev/system-config/+/781159 | 20:31 |
clarkb | I think ^ that stack may be largely complete now assuming it works and people are happy with it in review | 20:32 |
clarkb | I haven't touched the cleanups yet because i Figure that will be easier once we've got something that works | 20:32 |
clarkb | fungi: do you think we should go ahead and approve the first two chagnes so that things minimally work again? | 20:32 |
fungi | clarkb: yeah, i'll single-core approve them. if corvus gets a chance to look (since the original implementation was his he might have a different take on it) we can always revert | 20:34 |
corvus | fungi, clarkb: ++ go aheand and approve; i'm about to push up some changes i'd like you to review and i'll retro-review that then. | 20:34 |
fungi | awesome | 20:35 |
clarkb | ansible failed on nl02.opendev.org for an odd reason "reason": "Could not find or access '/home/zuul/src/opendev.org/opendev/system-config/playbooks/logrotate' on the Ansible Controller." but nl03 and nl04 did not and I started their launchers which seem to be idling as expected | 20:37 |
clarkb | the hourly nodepool deployment should run once the directly triggered set of jobs finish and I expect it will finish up nl02 at that point. Once that is done I'll remove my WIP from the change that flips from the openstack.org to opendev.org servers for launching | 20:38 |
fungi | "the ansible controller" is bridge.o.o in that sense? | 20:39 |
clarkb | yes I believe so | 20:39 |
fungi | cannot access '/home/zuul/src/opendev.org/opendev/system-config/playbooks/logrotate': No such file or directory | 20:39 |
fungi | doesn't seem to exist on bridge.o.o | 20:39 |
clarkb | yup it doesn't exist | 20:39 |
clarkb | not in the git repo either | 20:40 |
fungi | the directory is there, yeah | 20:40 |
*** whoami-rajat has quit IRC | 20:40 | |
fungi | odd | 20:40 |
clarkb | no it shouldn't be there | 20:40 |
clarkb | at least from what I can tell it isn't a valid path, I expect ansible had a problem doing logrotate role lookup and somehow that bubbled up as an actual problem | 20:41 |
fungi | right, that's what i mean, the parent directory exists and seems to be a current checkout, so not sure why it was looking for something not in the repo | 20:42 |
fungi | but yeah, maybe that was a fallback and the actual error was earlier | 20:42 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add zookeeper-statsd https://review.opendev.org/c/opendev/system-config/+/781160 | 20:48 |
fungi | ooh! | 20:48 |
corvus | infra-root: ^ if you could take a look at that relatively soonish, i'd like to get that running ASAP so we have baseline data before we start to increase our load on zookeeper with the zuul ha scheduler work | 20:48 |
corvus | infra-root: also note the parent change... actually, let me just squash them. | 20:49 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add zookeeper-statsd https://review.opendev.org/c/opendev/system-config/+/781160 | 20:50 |
*** roman_g has joined #opendev | 20:50 | |
corvus | infra-root: ^ squashed | 20:50 |
corvus | mordred, tristanC, tobiash: ^ fyi | 20:51 |
fungi | squishy | 20:51 |
clarkb | corvus: I +2'd it, seems to do what it says on the tin. The only thing I am not quite sure of is if there is any risk to exposing those gauges and counters though I expect not | 20:58 |
fungi | it's hard to cram sensitive information into a gauge/counter | 20:59 |
corvus | i don't think they're sensitive (at least, no more sensitive than any other metric we're exposing for anything) | 20:59 |
corvus | clarkb: meetpad changes seem reasonable; it seems like a lot of stuff has changed in the interim; i'm sure we had a good reason for /p/ at the time but it feels like with so much changing, all original assumptions are invalid, so if it it works, great. i don't see any red flags in there. | 21:01 |
corvus | clarkb: (but it sure does seem like there's a bunch of stuff that's going to bitrot before the next update unless we can start to get some things upstream in the dockerfiles) | 21:01 |
clarkb | corvus: ya, it feels like rolling forward with the enw assumptions make sense. A lot of config things seem to have gone away (I think because they are moving forward too) | 21:01 |
fungi | also in testing new meetpad, i spotted an exciting feature: experimental end-to-end encryption! | 21:02 |
clarkb | ya I was starting to think about how to make the meet.conf upstreamable | 21:02 |
clarkb | I think the issue is they primarily support people proxying to localhost for etherpad | 21:02 |
clarkb | in that case they don't want Host header to be localhost | 21:02 |
openstackgerrit | Merged opendev/system-config master: Disable xmpp websocket in jitsi meet config https://review.opendev.org/c/opendev/system-config/+/781145 | 21:02 |
clarkb | but we can probably add in another template switch to toggle that | 21:02 |
corvus | yeah, so seems like we'll need some more conditions in their template. yeah that. :) | 21:02 |
openstackgerrit | Merged opendev/system-config master: Improve meetpad env options for templating https://review.opendev.org/c/opendev/system-config/+/781152 | 21:02 |
clarkb | I think that use the room name as the doc name may also be the default | 21:03 |
clarkb | at least meetpad seemed to do that for me when I had it working for a short time | 21:03 |
fungi | corvus: any feel for whether we should be trying to get wss support working/proxied? sounded like they added it because it was more stable and better supported by browsers | 21:03 |
clarkb | and then maybe add another toggle for start the meeting with shared doc open | 21:03 |
corvus | clarkb: cool, definitely was not before | 21:03 |
corvus | fungi: i have no current knowledge of that | 21:03 |
clarkb | fungi: I think it may already be since we're using the upstream meet.conf for nginx | 21:03 |
fungi | part of why they made it the new default, i suppose | 21:04 |
clarkb | fungi: I just didn't want to also debug websockets on top of everythin else | 21:04 |
fungi | sure, makes sense to test that separately later | 21:04 |
clarkb | it probably makes sense to do another change on top of all that to toggle that var to 1 if you want to do that | 21:04 |
fungi | yeah, let's save that for after the other changes are in | 21:04 |
clarkb | fungi: https://github.com/jitsi/docker-jitsi-meet/blob/master/web/rootfs/defaults/meet.conf#L55-L68 | 21:04 |
fungi | also easier to revert if we find it's terrible before or even during the ptg | 21:04 |
clarkb | there are definitely upside to relying on upstream more. Upgrades should be easier as we improve our side | 21:06 |
clarkb | but we're in that awkawrd place where we need ot redo things and then catch upstream up to speed on some of our changes | 21:06 |
clarkb | https://community.jitsi.org/t/etherpad-in-docker-version/24432/12 makes me wonder if I'm missing something with how the config templating is intended to work | 21:10 |
clarkb | like we're doing it the hard way now maybe | 21:10 |
clarkb | actually no. I think people are editing the configs by hand after they are templated then simply restarting services without redoing the containers | 21:13 |
clarkb | which is basically what I was doing earlier | 21:13 |
corvus | i think us building the container was masking that issue earlier | 21:14 |
clarkb | ya | 21:14 |
corvus | clarkb: this looks really weird: https://grafana.opendev.org/d/5Imot6EMk/zuul-status?viewPanel=21&orgId=1 | 21:18 |
corvus | clarkb: i'm guessing that's related to launcher turnover | 21:19 |
corvus | the blue line doesn't usually move like that :) | 21:19 |
clarkb | ya it sets max-servers to 0 for the providers I bet it is related | 21:19 |
clarkb | but I checked the two I started and they both were running without any active requests. | 21:19 |
clarkb | I think it may just be a reporting artifact if max-servers is sent as is | 21:20 |
clarkb | (then the two conflicting servers fight over the statsd data) | 21:20 |
*** roman_g has quit IRC | 21:21 | |
clarkb | corvus: fyi https://github.com/jitsi/docker-jitsi-meet/issues/568 and https://github.com/jitsi/docker-jitsi-meet/issues/609 | 21:23 |
corvus | ++ | 21:24 |
clarkb | fungi: posted a response to your question on https://review.opendev.org/c/opendev/system-config/+/781154 | 21:30 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Restore meetpad etherpad settings. https://review.opendev.org/c/opendev/system-config/+/781159 | 21:33 |
clarkb | yay testing ^ it caught a bug with my ansible update | 21:33 |
corvus | i'm going to restart zuul now | 21:40 |
fungi | thanks for the heads up | 21:42 |
corvus | #status log restarted zuul on commit 8a06dc90101c4b5285aaed858a62dadc5ae27868 | 21:46 |
openstackstatus | corvus: finished logging | 21:46 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add zookeeper-statsd https://review.opendev.org/c/opendev/system-config/+/781160 | 21:48 |
corvus | clarkb, fungi: ^ that was missing a job dependency | 21:48 |
corvus | oh wait one more error | 21:48 |
clarkb | corvus: the nodepool job | 21:49 |
corvus | ok 2 errors :) | 21:49 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Add zookeeper-statsd https://review.opendev.org/c/opendev/system-config/+/781160 | 21:52 |
corvus | the nice thing is that 2 jobs failed; one because of a missing python import in the testinfra file, the other due to the missing job dependency. the one that failed due to python did have the right job dependencies and it ran the image | 21:53 |
fungi | another example of why short-circuiting after the first failure isn't a clear win | 21:54 |
corvus | i think we may not have the latest images pulled; i may need to redo that restart | 21:55 |
corvus | re-restarting | 21:56 |
ianw | reading https://gerrit-review.googlesource.com/Documentation/backup.html has got me thinking that we should store /home/gerrit2 on the new server on LVM, and modify the backup script to snapshot, incrementally backup from that, then remove the snapshot | 22:04 |
clarkb | ianw: what is the advantage to that? seems like it would suffer from the same problems with git packs changing? | 22:05 |
fungi | that's generally safer if you have data changing continuously, but then you have the problem that the fs you're snapshotting may not be quiescent | 22:05 |
ianw | yeah, that's true | 22:06 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Restore meetpad etherpad settings. https://review.opendev.org/c/opendev/system-config/+/781159 | 22:06 |
fungi | there are filesystems designed for fs-level snapshotting, which solves that particular challenge | 22:07 |
ianw | yeah, we could make /home/gerrit2 btrfs | 22:07 |
fungi | as long as you can be sure transactional writes your application makes are atomic | 22:07 |
fungi | otherwise you still have the same problem another layer up | 22:07 |
ianw | no idea if jgit provides that | 22:07 |
clarkb | btrfs needs defragging too | 22:08 |
clarkb | and they aren't upfront about it until your disk fills and you wonder why and then find some esoteric arch wiki article on the subject | 22:08 |
clarkb | note the disk will report many hundreds of gigabytes of disk free when it happens too | 22:08 |
ianw | i'm pretty sure gerrit's suggestion to copy the h2 .db files does not make for consistent backups | 22:09 |
fungi | basically to get true point-in-time backups, your application needs to be designed to accommodate that first and foremost | 22:09 |
clarkb | corvus: is the hourly opendev-prod-hourly enqueued chagnes supposed to have a commit of 0000000... ? | 22:09 |
clarkb | I'm slightly concerned that that might not do what we expect there :/ | 22:09 |
fungi | well, there is no valid commit, maybe that's a placeholder? | 22:10 |
clarkb | it could be, but I thought it reported something else like master instead | 22:10 |
clarkb | note it hasn't run any jobs yet because the meetpad job has to finish first but it should start shortly | 22:10 |
clarkb | https://opendev.org/opendev/system-config/commit/0000000000000000000000000000000000000000 is what it is linking too which si why I'm concerned | 22:11 |
clarkb | should I remove ssh keys from bridge? | 22:11 |
clarkb | or jsut let it happen and see what happens? | 22:11 |
fungi | oh! | 22:11 |
fungi | mmm | 22:11 |
corvus | let's watch for a sec | 22:12 |
clarkb | ok, it is starting a job now | 22:12 |
corvus | but be ready to kill it | 22:12 |
fungi | gitea says that commit doesn't exist, so it should probably just break? | 22:12 |
fungi | was that from a reenqueu? | 22:12 |
fungi | reenqueue | 22:12 |
fungi | git also says that object doesn't exist in my copy | 22:13 |
clarkb | fungi: yes it would have been manualyl reenqueued after the restart | 22:13 |
corvus | <Branch 0x7f3b5112b7c0 opendev/system-config deletes refs/heads/master from 0000000000000000000000000000000000000000 | 22:13 |
corvus | so it thinks master was deleted | 22:13 |
fungi | right, wondering if that's a bug in the reenqueue script not handling timer triggered pipelines correctly | 22:14 |
corvus | zuul enqueue-ref --tenant openstack --pipeline opendev-prod-hourly --project opendev.org/opendev/system-config --ref refs/heads/master | 22:14 |
corvus | fungi: that seems plausible | 22:14 |
clarkb | it hasn't tried to run ansible on bridge yet according to my tail -F install-ansible.yaml.log | 22:15 |
corvus | i'm guessing that means "enqueue refs/heads/master with oldrev=0 and newrev=0" | 22:15 |
clarkb | corvus: can you manually dequeue it? | 22:15 |
corvus | clarkb: possibly; i'd like to see if it breaks though | 22:15 |
clarkb | ok | 22:15 |
corvus | that way we know if this is dangerous | 22:15 |
corvus | (if it breaks harmlessly, no big deal, if it doesn't then we know there's danger lurking in the enqueue script) | 22:16 |
clarkb | the job doesn't seem to be doing much. The console doesn't show anything and tailing the log file for the playbook it should run shows it hasn't started doing that yet | 22:16 |
ianw | this feels like http://lists.zuul-ci.org/pipermail/zuul-discuss/2019-May/000909.html which I never quite got to the bottom of | 22:16 |
corvus | ianw: yep | 22:17 |
clarkb | it just started according to the console stream | 22:17 |
corvus | ianw: if that's the case we can expect retry limits | 22:17 |
corvus | though im surprised it made it this far | 22:17 |
corvus | clarkb: if it really checks out the null commit; there won't be any ansible files to run | 22:18 |
clarkb | corvus: good point | 22:18 |
corvus | it looks like it checked out master | 22:19 |
*** hashar has quit IRC | 22:19 | |
clarkb | ya the console stream seems to confirm that | 22:19 |
clarkb | and git log in the actual repo dir does as well | 22:19 |
corvus | oh good i was about to check that; glad you did | 22:19 |
corvus | so for whatever reason, it seems to actually be doing the thing we want it to do | 22:20 |
clarkb | ya, I think the only other concern is if that will somehow find the wrong project-config version but I don't think it will since that commit shouldn't affect project-config right? | 22:20 |
corvus | there's at least a minor bug in that if it's going to checkout master, it should report the correct sha. there may be a larger bug in what it's actually deciding to check out (depending on whether that's undefined behavior) | 22:21 |
corvus | but it doesn't seem to be a major bug | 22:21 |
corvus | clarkb: yeah. i'm inclined to say it's looking harmless and we can let it run | 22:21 |
clarkb | ok | 22:21 |
clarkb | ianw: heh luca is saying to use an h2 db now? ugh I feel like this question comes up constantly and I have to go dig in emails and docs and find where it says to not do that | 22:25 |
ianw | clarkb: yeah. tbh i feel performance is not much of a consideration. being able to manage the db not using odd .jars downloaded from the web and using our (now) standard backup mechanisms i think still makes an external db worth it | 22:27 |
clarkb | that is a fair point | 22:27 |
fungi | saying to use h2 for the reviewed files db? | 22:28 |
clarkb | yes | 22:28 |
ianw | fungi: https://groups.google.com/g/repo-discuss/c/_mXoguCr1kw/m/62dIMrK4BAAJ | 22:29 |
fungi | huh, okay... | 22:29 |
clarkb | I swear I just dug up where someone (I thought luca) said not to use h2 | 22:35 |
clarkb | it was corvus asking iirc | 22:35 |
clarkb | but I cannot find it in my logs | 22:35 |
corvus | i think all the notes i took are in etherpads but i don't have links handy | 22:36 |
ianw | the old documentation says not to https://gerrit-documentation.storage.googleapis.com/Documentation/2.8/database-setup.html#createdb_h2 | 22:37 |
clarkb | ya not sure it is urgent, it just bugs me that I distinctly remember looking this up recently and can't manage it again | 22:37 |
clarkb | ianw: that is for the reviewdb though | 22:37 |
ianw | yeah, but all the points about everything but performance i think still count | 22:38 |
ianw | i'm going to manually run the new ansible kerberos playbook on kdc04 just to make sure it is as idempotent as testing suggests it is | 22:39 |
*** gmann_afk is now known as gmann | 22:39 | |
clarkb | ianw: did you do the double run of playbook idea? | 22:42 |
ianw | yep, that passed | 22:42 |
fungi | one of the changes adds it | 22:42 |
clarkb | cool | 22:42 |
clarkb | infra-root we are ready for https://review.opendev.org/c/openstack/project-config/+/780983 the hourly run updated nl02.opendev.org after its failure and I have started the launcher there | 22:50 |
fungi | lgtm | 22:51 |
fungi | same as the previous flip | 22:52 |
fungi | just several times as many launchers | 22:52 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Clean up the old openstack.org nodepool launchers. https://review.opendev.org/c/opendev/system-config/+/781171 | 22:54 |
clarkb | I've WIP'd ^ as we don't want that landing until the new servers take over | 22:54 |
clarkb | fungi: thanks I went ahead and approved it since we've done this once already | 22:55 |
fungi | i'm around to assist if something goes wrong | 22:57 |
clarkb | fungi: cool, I plan to stick around until we've at least got the old ones headed to idle | 22:58 |
clarkb | then probably pick up meetpad stuff again tomorrow | 22:58 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add kerberos-client group https://review.opendev.org/c/opendev/system-config/+/781173 | 23:00 |
ianw | infra-root: ^ that will help with a missing var | 23:01 |
openstackgerrit | Merged openstack/project-config master: Flip nl02-04.openstack.org to nl02-04.opendev.org https://review.opendev.org/c/openstack/project-config/+/780983 | 23:08 |
clarkb | I've got my nodepool launcher tails running on both old and new to see when old has taken over | 23:11 |
clarkb | and new servers have begun taking requests | 23:17 |
ianw | clarkb: what's with the removal of "--skip-network" in https://review.opendev.org/c/opendev/glean/+/781133/1/glean/init/glean.sh ? | 23:25 |
clarkb | ianw: dtantsur|afk pointed out that one of the things making glean slwo in their testing is that glean is run twice | 23:26 |
clarkb | ianw: in the old code we ran glean ignoring the network to configure ssh keys and the hostname if config drive was present then ran it again to configure the network regardless fo the config drive state | 23:26 |
clarkb | ianw: instead we should be able to run it once with the config drive and setup ssh keys, hostname, and network and once without config drive where we only configure the network | 23:26 |
ianw | ohh, right, ok | 23:27 |
clarkb | and that should improve startup costs on slow qemu vms pretending to be baremetal | 23:27 |
clarkb | from dtantsur|afk logs it was taking like 30 seconds each time or something like that | 23:27 |
ianw | looking at https://opendev.org/opendev/glean/src/branch/master/glean/cmd.py#L1391 we could probably just leave -ssh and --hostname turned on always, it looks like it will gracefully ignore it if there's no configdrive | 23:30 |
clarkb | ianw: thatwould simplify the code even more | 23:31 |
clarkb | though its possible users may not always want it? | 23:31 |
clarkb | (we should be fine as we use it through simple-init but maybe some users dont and write their own units/shell scripts?) | 23:32 |
ianw | we could also optimise in glean there to only read meta_data.json once, i think it's doing multiple times | 23:32 |
clarkb | I think nl02 and nl03.openstack.org have gone idle now. I'll give them a little longer then stop their containers | 23:32 |
ianw | yeah each step opens and json loads it | 23:32 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Reduce metadata read/parsing overhead https://review.opendev.org/c/opendev/glean/+/781174 | 23:41 |
clarkb | ianw: fwiw I think dtantsur|afk itnended on testing these cahnges pre merge with the ironic fake baremetal stuff | 23:42 |
clarkb | so we can probably safely wait on approving things until dtantsur|afk shows they help | 23:42 |
ianw | ok, i think they're generally correct at any rate, and hopefully help | 23:42 |
clarkb | yup and I think yours may really help those small systems | 23:43 |
clarkb | its our own little version of the gtav startup bug | 23:43 |
ianw | haha yes. i await a bounty ;) | 23:44 |
ianw | (i'm sure they spend more on m&m's for the office than they paid that guy :) | 23:44 |
clarkb | I think my favorite part of the bug is that part of why it was a problem was they were parsing the entire manifest of things they would sell you for real money | 23:45 |
fungi | not sure the sweatshop studios provide office m&ms for their slave laborers | 23:47 |
fungi | it's not like they're valve or something | 23:47 |
fungi | unless the m&ms are laced with amphetamines to keep them awake through those 16-hour workdays | 23:48 |
clarkb | now nl04 has gone idle | 23:49 |
clarkb | I've stopped the nodepool launchers on all 3 old hosts now | 23:50 |
clarkb | I've set https://review.opendev.org/c/opendev/system-config/+/781171 active (it was wip) but I'm happy to land that tomorrow morning after we've confirmed the new launchers are all happy | 23:50 |
clarkb | then once that lands we can land the project-config cleanups | 23:50 |
clarkb | and delete the servers | 23:50 |
clarkb | I'll also try to land the meetpad fixes if they haven't gone in by then (I don't think we're in a rush as the service is minimally functional now or should be anyway) | 23:51 |
*** hamalq has quit IRC | 23:57 | |
*** hamalq has joined #opendev | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!