openstackgerrit | Merged opendev/system-config master: Add zookeeper-statsd https://review.opendev.org/c/opendev/system-config/+/781160 | 00:08 |
---|---|---|
*** gothicserpent has joined #opendev | 00:14 | |
*** tosky has quit IRC | 00:22 | |
corvus | i'm going to restart zuul again to get that metrics fix | 00:24 |
*** mlavalle has quit IRC | 00:25 | |
corvus | #status log restarted zuul at commit 4bb45bf2a0223c1c624dbd8f44efff207e6b4097 | 00:31 |
openstackstatus | corvus: finished logging | 00:31 |
*** tbarron has joined #opendev | 00:39 | |
ianw | i don't think i realised we were setting up mirror nodes as kerberos clients | 00:53 |
ianw | if they're not authenticating, i don't think they need client setup? | 00:54 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Add kerberos-client group https://review.opendev.org/c/opendev/system-config/+/781173 | 01:00 |
*** stevebaker has quit IRC | 01:03 | |
fungi | ianw: yeah, doesn't seem like it would be needed for read-only use | 01:07 |
corvus | ftr i'm manually running the service-zookeeper playbook because i don't want to wait forever | 01:34 |
fungi | looks like we're getting node_failure from amd64 jobs | 01:34 |
corvus | we're in the middle of a deployment run with a bunch of non-zk related playbooks, so i don't think there's a conflict | 01:34 |
corvus | btw, here's the stat from zuul: https://graphite.opendev.org/?width=586&height=308&target=stats.timers.zuul.tenant.openstack.event_enqueue_processing_time.mean | 01:36 |
corvus | looks like the zk stats are showing up, but they're all zero, likely because the mntr whitelist needs a zk restart to take effect | 01:39 |
corvus | i'm going to start doing rolling zk server restarts now | 01:39 |
corvus | https://graphite.opendev.org/?width=586&height=308&from=-1hours&target=stats.gauges.zk.*.zk_avg_latency | 01:43 |
corvus | that's looking good -- data from all 3 now | 01:44 |
corvus | https://graphite.opendev.org/?width=586&height=308&from=-1hours&target=stats.gauges.zk.*.zk_followers | 01:44 |
corvus | that one is interesting -- it tells us that zk02 is the leader | 01:44 |
corvus | (it's the only one with followers, and there are 2 of them, so all good) | 01:45 |
corvus | we apparently have about 10500 znodes | 01:45 |
*** brinzhang has joined #opendev | 01:59 | |
*** brinzhang_ has quit IRC | 02:00 | |
openstackgerrit | James E. Blair proposed openstack/project-config master: Add ZooKeeper stats to Zuul dashboard https://review.opendev.org/c/openstack/project-config/+/781182 | 02:07 |
corvus | infra-root: looks like we're collecting all the data now; that should let us see it | 02:07 |
*** hamalq has quit IRC | 02:12 | |
*** hemanth_n has joined #opendev | 02:13 | |
openstackgerrit | Merged opendev/system-config master: Add kerberos-client group https://review.opendev.org/c/opendev/system-config/+/781173 | 02:44 |
*** ykarel has joined #opendev | 04:07 | |
*** bhagyashri|rover is now known as bhagyashris | 04:53 | |
*** chandankumar is now known as chkumar|ruck | 04:57 | |
*** brinzhang_ has joined #opendev | 04:57 | |
*** whoami-rajat_ has joined #opendev | 04:58 | |
*** brinzhang has quit IRC | 05:01 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: kerberos-kdc: add realm value https://review.opendev.org/c/opendev/system-config/+/781192 | 05:05 |
*** roman_g has joined #opendev | 05:30 | |
ianw | kevinz: it seems we're seeing some node failures in the arm cloud again ... | 05:42 |
openstackgerrit | Merged opendev/system-config master: kerberos-kdc: add realm value https://review.opendev.org/c/opendev/system-config/+/781192 | 05:45 |
*** roman_g has quit IRC | 05:47 | |
*** ykarel_ has joined #opendev | 06:03 | |
*** ysandeep|away is now known as ysandeep | 06:03 | |
*** ykarel has quit IRC | 06:05 | |
*** marios has joined #opendev | 06:09 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/781019 | 06:13 |
*** ykarel_ has quit IRC | 06:20 | |
*** jaicaa has quit IRC | 06:31 | |
*** calcmandan has joined #opendev | 06:32 | |
*** jaicaa has joined #opendev | 06:33 | |
*** calcmandan has quit IRC | 06:38 | |
*** calcmandan has joined #opendev | 06:48 | |
*** sboyron has joined #opendev | 06:56 | |
*** ykarel has joined #opendev | 07:05 | |
*** ralonsoh has joined #opendev | 07:18 | |
kevinz | ianw: let me check | 07:26 |
*** eolivare has joined #opendev | 07:41 | |
*** rpittau|afk is now known as rpittau | 07:51 | |
kevinz | ianw: re-triagger to see if it better? | 07:51 |
*** lpetrut has joined #opendev | 07:53 | |
*** jpena|off is now known as jpena | 08:02 | |
ianw | kevinz: i can see a bunch of queue djobs @ https://zuul.openstack.org/status (search -arm64) | 08:08 |
*** amoralej|off is now known as amoralej | 08:08 | |
ianw | kevinz: none running though | 08:09 |
kevinz | ianw: well, double check | 08:09 |
kevinz | ianw: any clue from Zuul side? | 08:12 |
openstackgerrit | Merged openstack/project-config master: Add ZooKeeper stats to Zuul dashboard https://review.opendev.org/c/openstack/project-config/+/781182 | 08:13 |
kevinz | ianw: actually I can easily create the instance from the dashboard | 08:16 |
openstackgerrit | Rico Lin proposed openstack/project-config master: Add ubuntu-focal-arm64-xxlarge for linaro-us https://review.opendev.org/c/openstack/project-config/+/781219 | 08:17 |
*** tosky has joined #opendev | 08:37 | |
*** ykarel is now known as ykarel|away | 08:46 | |
ianw | 2021-03-17 19:58:56,455 ERROR nodepool.driver.openstack.OpenStackProvider: Couldn't consider invalid node | 08:51 |
ianw | Traceback (most recent call last): | 08:53 |
ianw | File "/usr/local/lib/python3.7/site-packages/nodepool/driver/utils.py", line 280, in estimatedNodepoolQuotaUsed | 08:53 |
ianw | if node.type[0] not in provider_pool.labels: | 08:53 |
ianw | IndexError: list index out of range | 08:53 |
ianw | kevinz, clarkb: ^ i think there's something else going on | 08:53 |
kevinz | ianw: wow... | 08:54 |
kevinz | node.type lost? | 08:54 |
ianw | kevinz, 449f31dc-8207-420a-a182-da7c58d962bd 402bcb9d-f40f-4b71-81e9-3aba5b4b432c 74485f01-22e3-4df5-bc23-d408af99cd6f | 08:55 |
ianw | are three id's in error state, maybe you can grep logs and see what happened? | 08:55 |
ianw | then we have a bunch of ready nodes | 08:58 |
ianw | but nothing seems to be starting | 08:58 |
kevinz | Aha | 08:58 |
kevinz | I see. those servers are deleted but still list at the DB | 08:59 |
kevinz | I will check | 08:59 |
ianw | 2021-03-18 08:25:28,121 DEBUG nodepool.driver.NodeRequestHandler[nl03.opendev.org-PoolWorker.linaro-us-main-42c43c589f774838b16d1aa77a51216d]: [e: 3a9eeb1213d948cfb88e7f2199e1aec8] [node_request: 300-0013320208] Declining node request because pool is disabled by max_servers | 09:05 |
ianw | clarkb: ^ i'm not 100% sure what state the nl openstack/opendev servers are in. i get the feeling somehow arm64 requests are being rejected by one or the other | 09:05 |
ianw | i'm afraid i'm out of time for now to keep poking at this. but kevinz I do think this is mostly something on our side at this point | 09:06 |
*** andrewbonney has joined #opendev | 09:08 | |
*** hashar has joined #opendev | 09:12 | |
*** ykarel|away has quit IRC | 09:25 | |
*** ysandeep is now known as ysandeep|afk | 09:26 | |
zbr | ianw: do you happen to know why we get NODE_FAILURE on arm64? https://zuul.opendev.org/t/zuul/builds?job_name=zuul-jobs-test-ensure-docker-ubuntu-bionic-arm64 | 09:40 |
openstackgerrit | Sorin Sbârnea proposed zuul/zuul-jobs master: Make ensure-docker ubuntu arm64 non voting https://review.opendev.org/c/zuul/zuul-jobs/+/781234 | 09:47 |
kevinz | ianw: thanks, I will check. No worries | 09:53 |
zbr | there is something really weird around those NODE_FAILUREs, as I did not see them as failing. I wonder if this happens only when lots of jobs are created for a single change, as we maybe reach a limit of nodes to be spawned because we have a builldset that requires more than we can allocate. | 09:53 |
*** whoami-rajat_ is now known as whoami-rajat | 09:53 | |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: Update testing to Python 3.9 and linters https://review.opendev.org/c/opendev/gear/+/780103 | 10:10 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: [DNM] gear debug log https://review.opendev.org/c/opendev/gear/+/780422 | 10:10 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: WIP: trying to solve gear+zuul+ssl(tls1.3) https://review.opendev.org/c/opendev/gear/+/781238 | 10:10 |
openstackgerrit | Merged opendev/irc-meetings master: Remove usused Zaqar meeting slot https://review.opendev.org/c/opendev/irc-meetings/+/777194 | 10:24 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: WIP: trying to solve gear+zuul+ssl(tls1.3) https://review.opendev.org/c/opendev/gear/+/781238 | 10:34 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: [DNM] gear debug log https://review.opendev.org/c/opendev/gear/+/780422 | 10:34 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: WIP: trying to solve gear+zuul+ssl(tls1.3) https://review.opendev.org/c/opendev/gear/+/781238 | 10:48 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: [DNM] gear debug log https://review.opendev.org/c/opendev/gear/+/780422 | 10:48 |
*** dtantsur|afk is now known as dtantsur | 11:02 | |
*** hashar has quit IRC | 11:11 | |
*** ysandeep|afk is now known as ysandeep | 11:19 | |
*** frigo has joined #opendev | 11:58 | |
zbr | kevinz ianw: let me know if there is something i can do to help with arm64 nodes issue. | 12:05 |
*** slaweq has quit IRC | 12:22 | |
*** slaweq has joined #opendev | 12:25 | |
*** jpena is now known as jpena|lunch | 12:30 | |
*** ricolin has joined #opendev | 12:44 | |
*** hemanth_n has quit IRC | 12:49 | |
*** hashar has joined #opendev | 13:00 | |
openstackgerrit | Dmitry Tantsur proposed opendev/glean master: NM: add an early service to configure networking https://review.opendev.org/c/opendev/glean/+/781460 | 13:02 |
*** artom has quit IRC | 13:24 | |
*** jpena|lunch is now known as jpena | 13:26 | |
*** marios is now known as marios|call | 13:32 | |
dtantsur | clarkb, ianw, another glean awesomeness: it may happen that glean starts before the configdrive device is initialized.. | 13:45 |
dtantsur | or to put it another way: I managed to start glean pretty early, now it starts too early :) | 13:46 |
dtantsur | I honestly have no ideas better than After=systemd-udev-settle.service :( | 14:04 |
*** artom has joined #opendev | 14:14 | |
openstackgerrit | Dmitry Tantsur proposed opendev/glean master: NM: add an optional early service to configure networking https://review.opendev.org/c/opendev/glean/+/781460 | 14:23 |
openstackgerrit | Dmitry Tantsur proposed openstack/diskimage-builder master: simple-init: allow passing additional arguments to glean install https://review.opendev.org/c/openstack/diskimage-builder/+/781491 | 14:26 |
*** fressi has quit IRC | 14:29 | |
*** stand has joined #opendev | 14:45 | |
*** marios|call is now known as marios | 14:48 | |
*** mgoddard has quit IRC | 14:49 | |
*** lpetrut has quit IRC | 14:53 | |
clarkb | ianw: kevinz: nl03.opendev.org should be processing linaro cloud node requests now and nl03.openstack.org has been turned off. I'll check on them now though | 15:08 |
fungi | if you can catch an api error response or something, that would help | 15:10 |
clarkb | nl03.openstack.org is still off as expected and nl03.opendev.org is running. nodepool list shows linaro nodes in use with timestamps as recent as 3 minutes ago | 15:12 |
clarkb | I've found a server it is building now that I'll keep an eye on before proceeding further with the old nodepool server cleanups | 15:15 |
fungi | so maybe the node_failure results are intermittent (repeated boot failures/timeouts exceeding nodepool's retry count?) | 15:16 |
fungi | or maybe they're solved now | 15:16 |
clarkb | ya it sounded like kevinz had found a db snyc problem | 15:19 |
clarkb | maybe that was the root cause and it just happened to coincide with replacing the launchers | 15:19 |
clarkb | 2021-03-18 15:15:58,673 INFO nodepool.NodeLauncher: [e: 7b4feb53efac4fada316a7ed5839871e] [node_request: 900-0013327236] [node: 0023563053] Node is ready | 15:19 |
clarkb | its looking happy from here | 15:20 |
openstackgerrit | Dmitry Tantsur proposed opendev/glean master: Allow disabling DHCP fallback https://review.opendev.org/c/opendev/glean/+/781500 | 15:20 |
fungi | probably skim for any third try boot failures/timeouts, but otherwise it's probably fine now | 15:20 |
*** hashar has quit IRC | 15:21 | |
clarkb | fungi: I've responded to your question at https://review.opendev.org/c/opendev/system-config/+/781154 | 15:22 |
fungi | thanks | 15:23 |
clarkb | also I think I'm ready to approve that change if you are | 15:23 |
fungi | and yeah, that explains my confusion. so the /etc/letsencrypt is actually unused. i couldn't find where we were referencing anything under /config but maybe that's baked into part of the nginx config i didn't see | 15:24 |
clarkb | yup skimming for failed launch attempts does show some are failing with 'Detailed node error: No valid host was found. There are not enough hosts available.' in linaro | 15:24 |
fungi | and yes, i'll approve it now | 15:24 |
fungi | if there's a host shortage, we might need to drop our max-servers? | 15:24 |
clarkb | fungi: yes, the upstream image configs for nginx's ssl.conf assume the cert/key will be in the location we put them | 15:25 |
clarkb | fungi: I suspect that this may be more due to the db issues kevinz identified. In theory our quota and max-servers are double covering that aspect for us | 15:25 |
clarkb | but ya we could reduce max-servers too | 15:25 |
clarkb | importantly, I'm not seeing anything that says the problem is on the new nodepool launcher side of things | 15:26 |
clarkb | it seems to be cloud side or config | 15:26 |
fungi | right, that's reassuring | 15:26 |
clarkb | fungi: do you want to do your own double checking or do you think we should be ok to land https://review.opendev.org/c/opendev/system-config/+/781171 too? | 15:32 |
clarkb | once ^ is landed we can alnd the project-config nodepool yaml cleanups as well as delete the servers | 15:32 |
*** hamalq has joined #opendev | 15:33 | |
clarkb | openstack.exceptions.ResourceTimeout: Timeout waiting for the server to come up. <- we are also seeing this occasionally in linaro | 15:34 |
*** mlavalle has joined #opendev | 15:37 | |
clarkb | dtantsur: why is glean-early used in addition to glean-nm? | 15:37 |
clarkb | dtantsur: is it because all of the interfaces may not be plugged yet? If so I think we should probably write that down | 15:38 |
clarkb | it feels like the osrt of thing that will be forgotten | 15:38 |
fungi | clarkb: i'll refresh my memory on which change that is in a sec. in two meetings simultaneously right now | 15:42 |
*** lpetrut has joined #opendev | 15:43 | |
dtantsur | clarkb: you mean, as a comment in the code? | 15:44 |
clarkb | dtantsur: ya, I'm worried that could easily get optimized away similar to my optimization of reducing two glean calls to one | 15:45 |
clarkb | dtantsur: also similarly I think we should capture motivation on https://review.opendev.org/c/opendev/glean/+/781500 because it isn'y clear to me why you'd be doing that | 15:45 |
dtantsur | clarkb: agreed. please leave a comment in gerrit, I'll get to it when I finish my testing around | 15:45 |
dtantsur | * testing round | 15:46 |
clarkb | dtantsur: done comments on both changes. Basically I think we've done a bad job capturing motivations and the why of fixes in glean. Early boot is complicated and can vary across platforms etc and writing down more hints will likely help us | 15:48 |
dtantsur | totally | 15:48 |
*** hashar has joined #opendev | 15:49 | |
dtantsur | I wrote a comment, but will also update both patches | 15:50 |
*** ysandeep is now known as ysandeep|away | 15:51 | |
clarkb | dtantsur: hrm my concern with that is dhcp-all-interfaces and simple-init are not intended to be used together | 15:52 |
clarkb | they are supposed to be used XOR each other iirc | 15:52 |
dtantsur | clarkb: this is true, but they actually do play well together | 15:52 |
dtantsur | simply because one is Before=network-pre and the other is After=network | 15:52 |
dtantsur | and both check for an existing configuration file | 15:53 |
clarkb | right but glean will already configure dhcp and ipv6 | 15:53 |
clarkb | even without a config drive | 15:53 |
dtantsur | this is a grey area for me. I know a lot of fixes have been done by our folks to it. | 15:54 |
clarkb | it just feels like we're trying to address an explicitly undesired state | 15:54 |
openstackgerrit | Merged opendev/system-config master: Manage jitsi-meet meet.conf as a template input for the container https://review.opendev.org/c/opendev/system-config/+/781154 | 15:54 |
dtantsur | also yes, the "no, never use DHCP because it's always wrong" aspect is of value | 15:54 |
*** lpetrut has quit IRC | 15:55 | |
dtantsur | does glean, for example, do all this: https://opendev.org/openstack/diskimage-builder/commit/f94508d537432817619932074a5f98ea08d93055 https://opendev.org/openstack/diskimage-builder/commit/5d23d8e6b09c623cb910159dee40df6af2844856 ? | 15:55 |
* dtantsur is sad we even started with dhcp-all-interfaces, not glean, but the ship has sailed | 15:56 | |
clarkb | right I feel like first of all that is a completely unfair qusetion | 15:57 |
clarkb | you arguably added features to the simple tool that should have gone into the complicated tool | 15:57 |
clarkb | but yes I blieve glean supports vlans (and config on top of them) and ipv6 via RAs | 15:57 |
clarkb | we certainly use glean + ipv6 relying on router advertisements on our nodes in clouds like limestone | 15:57 |
clarkb | we don't use vlan features, but they were added for ironic users so I would hope they support ironic's use cases | 15:58 |
fungi | and had to work around some nasty nm behaviors relative to that too | 15:58 |
fungi | ipv6 slaac related i mean | 15:58 |
dtantsur | I personally have little control of dhcp-all-interfaces.. I'm trying to somehow navigate this space without telling folks "hey, I'm switching you over to a new DHCP helper, which I hope works the same as before" :) | 15:58 |
clarkb | dtantsur: right and I'm trying to avoid adding too much special cases to glean that it becomes unwieldy | 15:58 |
clarkb | to me don't try to fallback to dhcp risk becoming that because falling back to something seems sane in almost every situation | 15:59 |
* dtantsur -> meeting, brb | 16:00 | |
clarkb | dtantsur: looking at the dhcp-all-interfaces change for ipv6 mroe closely I think I see how that is different than what glean does | 16:00 |
clarkb | dtantsur: in that change you are baking the behavior into the image. With glean you provide that config at boot time with a config drive | 16:00 |
clarkb | but otherwise I would expect similar behaviors at runtime | 16:00 |
dtantsur | yeah | 16:01 |
dtantsur | honestly, to me a switch whether to use fallback or not seems quite logical, but I can understand why you see it otherwise | 16:01 |
dtantsur | but otherwise I'll need to take some completely different direction, like calling glean from IPA manually or something like that.. and do nm restart probably? | 16:02 |
clarkb | dtantsur: fwiw I'm not saying no. At this point I'm mostly saying we need to capture the use case because I don't understand it :) | 16:03 |
clarkb | but I did want to point out that it is my understanding we explicitly say do not use those two elements together | 16:03 |
*** mgoddard has joined #opendev | 16:04 | |
clarkb | not having a fallback likely makes sense without dhcp-all-interfaces I'm just struggling to undersatnd it. I thought of another potential reason which may be speed up time to boot failure so you can retry more quickly | 16:04 |
clarkb | but I don't know if that would help in practice | 16:04 |
dtantsur | the security aspect too (not rogue DHCP on a remote edge node) | 16:05 |
clarkb | ya, though you could subvert config drive in that case too | 16:05 |
clarkb | (plug a usb device in with a fat32 config-2 partition) | 16:06 |
clarkb | but maybe the risk of ^ is lower than the risk of plugging into a switch. As mentioned I'm not against it but thnik we should capture the why and what better as we seem to struggle with that in glean then have trouble making changes later | 16:07 |
dtantsur | yep, I will | 16:07 |
clarkb | woot etherpads apepar to be working again on meetpad though you have to explicitly open the shared odcument. The flag to open it by default is in a change that I suppose we can keep iterating through now | 16:15 |
clarkb | I've approved the next meetpad fix | 16:16 |
clarkb | this one disables p2p and mutes video on connection | 16:16 |
clarkb | the one after that should open the shared doc by default | 16:17 |
*** frigo has quit IRC | 16:21 | |
*** Tengu has quit IRC | 16:23 | |
fungi | clarkb: sorry for the delay, i've approved 781171 now since the servers are confirmed idle | 16:30 |
clarkb | thanks | 16:32 |
*** roman_g has joined #opendev | 16:39 | |
*** mlavalle has quit IRC | 16:56 | |
openstackgerrit | Dmitry Tantsur proposed opendev/glean master: NM: add an optional early service to configure networking https://review.opendev.org/c/opendev/glean/+/781460 | 16:58 |
dtantsur | okay, the first explanation inserted | 16:58 |
openstackgerrit | Merged opendev/system-config master: Restore some meetpad settings we had previously set https://review.opendev.org/c/opendev/system-config/+/781156 | 16:59 |
openstackgerrit | Merged opendev/system-config master: Clean up the old openstack.org nodepool launchers. https://review.opendev.org/c/opendev/system-config/+/781171 | 16:59 |
*** mlavalle has joined #opendev | 16:59 | |
roman_g | clarkb thank you for the logs! I've tried to filter kna1 on logstash, but seems that it does not get collected. http://logstash.openstack.org/#/dashboard/file/logstash.json query is node_provider:(\"airship-citycloud-kna1\") . What am I missing here? | 17:00 |
*** marios is now known as marios|out | 17:01 | |
clarkb | dtantsur: thanks that really helps. Though I noticed one thing I asked a question about inline | 17:02 |
dtantsur | clarkb: clean-nm@ is not enabled explicitly because it's enabled by udev | 17:02 |
dtantsur | glean-nm of course | 17:02 |
clarkb | dtantsur: aha | 17:02 |
dtantsur | glean-early is a normal service and has to be enabled | 17:03 |
clarkb | roman_g: try node_provider:"airship-kna1" | 17:03 |
openstackgerrit | Dmitry Tantsur proposed opendev/glean master: Allow disabling DHCP fallback https://review.opendev.org/c/opendev/glean/+/781500 | 17:04 |
dtantsur | tried explaining this one is clear as possible ^^ | 17:04 |
clarkb | dtantsur: ya that helps, thanks | 17:04 |
*** marios|out has quit IRC | 17:08 | |
*** rpittau is now known as rpittau|afk | 17:11 | |
*** eolivare has quit IRC | 17:14 | |
*** jpena is now known as jpena|off | 17:16 | |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/781159 is the last meetpad change. You've arleady reviewed it so I guess its a question of whether or not we want to get a second reviewer on it before approving | 17:16 |
clarkb | On the nodepool side of things https://review.opendev.org/c/openstack/project-config/+/780984 has been marked active. Its child should be ready for review nwo too | 17:17 |
*** amoralej is now known as amoralej|off | 17:18 | |
fungi | clarkb: oh, yep, i forgot i reviewed that. i'll go ahead and approve it | 17:22 |
clarkb | fwiw I discovered why my android jitsi client didn't break when my browser one did due to the xmpp websocket issue | 17:22 |
fungi | looks like corvus was already +2 on an earlier patchset anyway | 17:22 |
clarkb | the apps don't use websockets so continued to use the old config happily | 17:22 |
roman_g | clarkb works! Thank you. | 17:25 |
clarkb | looking at the prosody and web images config templates I'm fairly certain that if we toggle ENABLE_XMPP_WEBSOCKETS over to 1 that we shoudl get a working setup | 17:26 |
clarkb | https://github.com/jitsi/docker-jitsi-meet/commit/d747bfbe6b72dd86789f61a97372b94c2c19677f is the commit that added all that | 17:28 |
*** hashar has quit IRC | 17:43 | |
openstackgerrit | Merged opendev/git-review master: Add missing -h to manpage and remove -c from it https://review.opendev.org/c/opendev/git-review/+/781053 | 17:50 |
*** dtantsur is now known as dtantsur|afk | 17:55 | |
openstackgerrit | Merged opendev/system-config master: Restore meetpad etherpad settings. https://review.opendev.org/c/opendev/system-config/+/781159 | 18:05 |
openstackgerrit | Elod Illes proposed openstack/project-config master: Use nodejs8-publish-to-npm for monasca-grafana-datasource https://review.opendev.org/c/openstack/project-config/+/781536 | 18:16 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: More jitsi meet config cleanups https://review.opendev.org/c/opendev/system-config/+/781544 | 19:01 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Enable jitsi-meet xmpp websockets https://review.opendev.org/c/opendev/system-config/+/781545 | 19:01 |
clarkb | fungi: ^ the first change there was inspired by your questions asking :) I tried to clean up any bits that may cause confusion in the config management or on the server (because ansible and container will fight over the content of configs) | 19:02 |
clarkb | the second thing is a bit scarier, but has a straightforward revert path and apparently it performs better | 19:02 |
*** andrewbonney has quit IRC | 19:03 | |
fungi | yeah, i'd love to test something like that asap, and then *if* we discover it's causing problems before or during the ptg, we can revert quickly | 19:03 |
*** priteau has quit IRC | 19:08 | |
fungi | but hopefully it's an improvement, and leads to a better experience for users than previously | 19:14 |
*** openstackgerrit has quit IRC | 19:34 | |
fungi | aha! ^ | 19:35 |
fungi | if you have weechat smart filters on then you probably don't see it | 19:35 |
fungi | <-- openstackgerrit (~openstack@eavesdrop01.openstack.org) has quit (Quit: Changing servers) | 19:35 |
fungi | that happened the *exact* moment i pushed https://review.opendev.org/781551 | 19:36 |
fungi | i'll restart the bot and then look into possible fixes | 19:36 |
clarkb | wow nice sleuthing | 19:37 |
fungi | #status log Restarted the gerritbot container after it disconnected from Freenode after trying to handle a very long commit message subject | 19:38 |
openstackstatus | fungi: finished logging | 19:38 |
fungi | yeah, the traceback in the log doesn't give any hint it's related, there is no disconnect or anything back from the irc server side about it | 19:38 |
fungi | but if you match the exit time up with the logs, you find a traceback for a very long commit message subject each time we get the "Quit: Changing servers" reason | 19:39 |
fungi | "irc.client.MessageTooLong: Messages limited to 512 bytes including CR/LF" | 19:43 |
fungi | that's the exception we hit | 19:43 |
fungi | but it's wrapped in a try with except Exception that does a self.connection.reconnect() | 19:44 |
fungi | i'm guessing the reconnect just never happens | 19:44 |
fungi | added by https://review.openstack.org/257174 in 2015 | 19:45 |
fungi | so it's not new | 19:45 |
fungi | i need to switch to dinner prep, but will ponder this | 19:45 |
*** openstackgerrit has joined #opendev | 19:46 | |
openstackgerrit | Merged opendev/system-config master: More jitsi meet config cleanups https://review.opendev.org/c/opendev/system-config/+/781544 | 19:46 |
fungi | debugging may be deeper in the irc module, but since we can reproduce the failure we should be able to suss it out | 19:46 |
clarkb | for the nodepool launchers I was going to double check with ianw when ianw's day starts then delete the old servers | 19:46 |
fungi | sounds good to me | 19:46 |
*** roman_g has quit IRC | 19:53 | |
fungi | pondering while i cook, i guess we have to intertwined bugs in gerritbot. one is that the reconnect method seems to only disconnect, the other is that we shouldn't bother reconnecting on a irc.client.MessageTooLong exception | 20:10 |
fungi | two intertwined bugs | 20:10 |
clarkb | with message too long the server will just chomp the end of the message right? so ya shouldn't need to reconnect | 20:11 |
fungi | the server discards it | 20:12 |
fungi | or rather the server never gets it because the irc module isn't sending it | 20:12 |
clarkb | ah | 20:12 |
fungi | the irc module is raising an exception on send | 20:12 |
fungi | not actually sending | 20:12 |
fungi | we could act on that to chomp the message down to some safe length | 20:13 |
fungi | or we could just ignore irc.client.MessageTooLong exceptions and be satisfied that they don't get sent | 20:13 |
clarkb | ignoring it seems like a better behavior than the current one | 20:13 |
fungi | better than trying to reconnect, yes. irc.client.MessageTooLong shouldn't be taken as an indication that the connection is broken | 20:15 |
fungi | now, troubleshooting why reconnection is broken will be the harder bit | 20:16 |
*** roman_g has joined #opendev | 20:17 | |
*** stevebaker has joined #opendev | 20:18 | |
clarkb | meetpad should get the last two updates applied to it in the next little bit. I'll double check on it afterwards | 20:20 |
ianw | clarkb: o/ ... reading | 20:35 |
clarkb | ianw: tldr is it seems the new launchers are fine. But linaro was struggling due to no valid host found and launch timeouts | 20:37 |
clarkb | I don't think that is related to the opendev server swap out | 20:37 |
clarkb | if you agree we've landed the change to pull the old servers from inventory and my next step is to delete the servers then land https://review.opendev.org/c/openstack/project-config/+/780984 and its child | 20:37 |
ianw | ok, i couldn't find what i thought was a smoking gun log message for why nodes weren't coming up, but if they are now, that's good :) | 20:38 |
ianw | looks like plenty are running so LGTM | 20:39 |
clarkb | ya most seem to launch, but hvae the occasional cloud side error stil | 20:39 |
clarkb | thanks for confirming. I'll look at cleaning up the old servers shortly | 20:39 |
clarkb | fungi: ianw I think we can go ahead and land https://review.opendev.org/c/openstack/project-config/+/780984 and its child now too fwiw | 20:41 |
ianw | it would be super to get a second arm64 cloud for backup; if anyone is reading this with access to such a thing :) | 20:41 |
*** Tengu has joined #opendev | 20:42 | |
clarkb | hrm I think I added buggy config.js to meetpad used a : instead of a = | 20:42 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Fix jitsi config.js https://review.opendev.org/c/opendev/system-config/+/781565 | 20:44 |
clarkb | fungi: ianw ^ I have not tested taht yet but I suspect that is the fix | 20:44 |
clarkb | there is one more infra-prod-run-meetpad queued up and when that is done I'll manually do ^ and confirm | 20:45 |
*** sboyron has quit IRC | 20:46 | |
ianw | everything else config.blah is an = | 20:46 |
clarkb | ok manually applied that fix and restarted and it is happy now. Sorry for that miss | 20:51 |
clarkb | they converted from using json like data structure to var assignments | 20:51 |
clarkb | and when I copy pasta'd I failed to convert | 20:51 |
ianw | clarkb / dtantsur : ok, i sort of buy the theory "udev hasn't activated glean@INTERFACE, so network-pre doesn't know it has anything to wait for, and goes ahead and starts NM anyway" | 20:52 |
ianw | and i can see that NM doesn't want to, by default, wait for udev-settle | 20:52 |
clarkb | changes from https://review.opendev.org/c/opendev/system-config/+/781544 appear to have applied cleanly (redirect is present etc) | 20:53 |
ianw | i wonder if instead of running glean in that early service -- does it just need to be some dummy thing that makes network-pre depend on udev-settle | 20:53 |
clarkb | ianw: oh taht is an interseting idea | 20:53 |
openstackgerrit | Merged openstack/project-config master: Remove unused nl0X.openstack.org config files https://review.opendev.org/c/openstack/project-config/+/780984 | 20:53 |
openstackgerrit | Merged openstack/project-config master: Update nodepool zk configs to be a bit less confusing https://review.opendev.org/c/openstack/project-config/+/780985 | 20:53 |
ianw | it feels like that should not be an option -- if this theory is the case, it's only because things are generally faster on hardware that we're getting things in order | 20:55 |
clarkb | I'm deleting old nodepool launchers now | 20:56 |
clarkb | #status log Replaced nl01-04.openstack.org with new Focal nl01-04.opendev.org hosts. | 20:58 |
openstackstatus | clarkb: finished logging | 20:58 |
clarkb | ianw: re glean, normally I raelly like to do boot tests of changes like this locally to build confidence in them. However I think our integration testing covers things pretty well at this point. Do you think we should go ahead and start landing some of these? | 21:03 |
clarkb | normally == $yearsago when I did more glean work :) | 21:03 |
ianw | clarkb: thinking even more, i think a better way might even be to drop an override file for NetworkManager to make it depend on udev-settle | 21:04 |
clarkb | ya I guess simple-init could do that | 21:06 |
clarkb | since it is modifying the image already | 21:06 |
clarkb | and is probably the cleanest way to express the ordering desired | 21:06 |
ianw | i just have to run morning errands, but yeah i think we can merge the more obvious things as they are gate tested by the boot tests | 21:07 |
ianw | perhaps there are now other ways to find active interfaces that aren't udev too, that wouldn't surprise me | 21:08 |
fungi | well, udev just reacts to kernel events | 21:08 |
clarkb | ianw: I doubt it since udev is part of systemd now | 21:08 |
fungi | udevd does, i mean | 21:08 |
fungi | anything can listen for those | 21:08 |
clarkb | or at least the "proper" systemd method is likely udev | 21:09 |
clarkb | I'd like to fully restart meetpad after the above fix lands and is applied. Just to double check the state of config mgmt produces a happy state. Then I think we should go ahead and land the xmpp websockets change and revert if it causes problems | 21:11 |
fungi | sounds good. i'm transitioning to evening relaxation but am happy to help test that when it deploys | 21:12 |
clarkb | great and thanks | 21:12 |
*** roman_g has quit IRC | 21:18 | |
*** ralonsoh has quit IRC | 21:20 | |
*** sshnaidm is now known as sshnaidmoff | 21:21 | |
*** sshnaidmoff is now known as sshnaidm|off | 21:21 | |
*** whoami-rajat has quit IRC | 21:26 | |
*** openstackstatus has quit IRC | 21:26 | |
openstackgerrit | Merged opendev/system-config master: Fix jitsi config.js https://review.opendev.org/c/opendev/system-config/+/781565 | 21:30 |
*** openstackstatus has joined #opendev | 21:31 | |
*** ChanServ sets mode: +v openstackstatus | 21:31 | |
*** openstack has joined #opendev | 21:33 | |
*** ChanServ sets mode: +o openstack | 21:33 | |
*** openstack has quit IRC | 21:35 | |
*** openstack has joined #opendev | 21:36 | |
*** ChanServ sets mode: +o openstack | 21:36 | |
fungi | i've joined the isitbroken room | 21:38 |
clarkb | yup I'm there too don't hear you though | 21:38 |
*** openstack has joined #opendev | 21:45 | |
*** ChanServ sets mode: +o openstack | 21:45 | |
*** openstack has quit IRC | 21:48 | |
*** openstack has joined #opendev | 21:58 | |
*** irclogbot_3 has joined #opendev | 21:58 | |
*** openstack has quit IRC | 21:59 | |
*** openstack has joined #opendev | 22:01 | |
*** ChanServ sets mode: +o openstack | 22:01 | |
openstackgerrit | Merged opendev/system-config master: Enable jitsi-meet xmpp websockets https://review.opendev.org/c/opendev/system-config/+/781545 | 22:09 |
*** iurygregory has quit IRC | 22:09 | |
*** iurygregory has joined #opendev | 22:09 | |
fungi | any feel for when we should be looking at a gerrit 3.3 upgrade? i guess the dstat metrics are to help us decide that | 22:16 |
clarkb | ya that and gatling git, but I got distracted (if anyone else wants to poke at gatling git feel free) | 22:18 |
clarkb | there were at least 2 people on the repo discuss ml who ended up reverting and it might be good to see if we can find out what caused them to do that | 22:18 |
fungi | yeah, i don't have fond memories of rolling back gerrit upgrades. it's pain, pure and simple | 22:26 |
clarkb | ok xmpp chagne has deployed. It only restarted 2 of the 4 containers. I think I'll do another full restart to ensure that if/when that happens we are still good | 22:29 |
clarkb | and Iget the you have been disconnected message so something must be missing | 22:31 |
clarkb | I'll get a revert going and then manually fix it | 22:31 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Revert "Enable jitsi-meet xmpp websockets" https://review.opendev.org/c/opendev/system-config/+/781577 | 22:33 |
clarkb | I'ev manually applied ^ and will go ahead and approve that | 22:34 |
fungi | thanks for checking it | 22:47 |
openstackgerrit | Ian Wienand proposed opendev/glean master: Override NetworkManager to wait for udev-settle https://review.opendev.org/c/opendev/glean/+/781580 | 22:48 |
ianw | clarkb: ^ see if that tests ok or introduces more problems ... not sure i've even convinced myself on that one :) | 22:48 |
ianw | i'm assuming that udev starting a service "rebuilds" the dependency tree in systemd | 22:51 |
clarkb | dtantsur|afk: ^ fyi since you've been testing your end | 22:57 |
clarkb | ianw: ya I think so | 22:58 |
clarkb | re rebuilding the tree | 22:58 |
clarkb | ianw: I'm not sure I grok the After declaration there though | 23:01 |
clarkb | ianw: does it need a Before=NetworkManager.service too? | 23:01 |
ianw | umm, the override i installed was for NetworkManager, to make it wait for udev settle | 23:02 |
ianw | at least, that's what i was trying to do :) | 23:02 |
fungi | notification says rackspace opened a ticket to warn us about network maintenance in dfw. i'm not where i can log into their dashboard to read the ticket right now, but i'll try to remember to take a look tomorrow if nobody has time before then | 23:02 |
clarkb | ianw: oh I see the target file dest is what is important there | 23:02 |
clarkb | ianw: sorry I missed it because the source file name is arbitrary | 23:02 |
clarkb | ianw: ya I think that is correct | 23:03 |
clarkb | so ya if that passes our testing and dtantsur|afk reports it fixes things then I think that may be the clenaest option | 23:03 |
ianw | fungi: that's a weird account | 23:06 |
fungi | it's got the right project id though | 23:07 |
ianw | the credentials listed for the account that ticket was filed against don't appear to allow me to log in | 23:07 |
clarkb | is this the one that we had shared with the foundation that got split? | 23:09 |
fungi | oh, i think that's a swift-only account i created for testing authenticated write access to a container for story attachments on storyboard-dev | 23:09 |
clarkb | ah | 23:09 |
fungi | it's the same tenant, the user just has limited access to only be able to write to a single swift container | 23:09 |
fungi | wonder why they picked that on | 23:09 |
fungi | that one | 23:10 |
ianw | i can't see the ticket logging in as the "main" user | 23:10 |
fungi | bizarre | 23:10 |
clarkb | maybe it only affects swift? | 23:10 |
fungi | the creds for that account should be in our usual list and at least in hiera, but anyway i can look at it tomorrow, i doubt it's important in that case | 23:11 |
openstackgerrit | Merged opendev/system-config master: Revert "Enable jitsi-meet xmpp websockets" https://review.opendev.org/c/opendev/system-config/+/781577 | 23:17 |
ianw | fungi: yeah, i pulled up the creds but i guess that user can't log into the UI to see the tickets | 23:19 |
fungi | right, probably not | 23:19 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: kerberos-kdc: quote some integers to avoid string/int confusion https://review.opendev.org/c/opendev/system-config/+/781585 | 23:38 |
ianw | ok, did a couple of runs on the kdc servers and all looking good; replication working and everything listening | 23:44 |
clarkb | nice | 23:45 |
clarkb | I just rechecked the meetpad server and it seems happy so I think we can call that done for now | 23:45 |
ianw | #status log all afs and kerberos servers migrated to focal, under ansible control | 23:48 |
openstackstatus | ianw: finished logging | 23:48 |
fungi | yay! excellent | 23:52 |
clarkb | ianw: were the kdcs done in place too? | 23:52 |
clarkb | and yes yay | 23:53 |
fungi | those are new servers, right | 23:53 |
fungi | ? | 23:53 |
ianw | clarkb: yeah, i just left them; since they're serving OPENSTACK.ORG domain it didn't seem logical to move to opendev.org and new servers | 23:53 |
fungi | oh, right | 23:53 |
fungi | they were already 03 and 04 | 23:53 |
clarkb | fungi: ya last round we did new servers and then afterwards discussed maybe not doing that again :) | 23:53 |
clarkb | iirc something didn't catch on that things had moved until we had more forcefully made them check again | 23:54 |
ianw | however, if we wanted a couple of new servers to do an OPENDEV.ORG domain ... that would largely be trivial and it would even set bootstrap itself with no intervention | 23:55 |
clarkb | ya the hard part would be migrating to it | 23:56 |
clarkb | unless we also just started over on the target (and copied files in afs as necessary) | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!