clarkb | I've started trying to figure out how the inmotion cloud vip is working. I set up a nc -l 54321 < file on each of the three hosts and the content of the files was the last octet for the actual ip of those hosts | 00:15 |
---|---|---|
clarkb | then I requested port 54321 against the VIP from home, review-test, and a third host in ovh | 00:15 |
clarkb | each time I got back a consistent IP address. That makes me think that the VIP is doign a 1:1 mapping currently | 00:16 |
clarkb | and they aren't doing higher level proxying or laod balancing | 00:16 |
clarkb | as a sanity check I don't see the vip directly on an interface on that host either | 00:17 |
*** gmann_afk is now known as gmann | 00:19 | |
clarkb | ok cool this is a kolla managed vip | 00:19 |
clarkb | I see it in the kolla config for that host | 00:20 |
clarkb | I still have no idea how it is functionally working but that is a start | 00:20 |
clarkb | aha, apparently ifconfig isn't showing me all the addresses on the interface kolla is using but ip addr does | 00:25 |
clarkb | cool so now I can see that ip addr is present on one of the three hosts | 00:25 |
clarkb | that confirms it is effectively 1:1 | 00:25 |
clarkb | and then if I exec into the haproxy container and look at /etc/haproxy/services.d/ contents i see haproxy listening on the vip | 00:26 |
clarkb | reading the kolla docs I think we can set some config options (specifically kolla_enable_tls_external, and kolla_external_fqdn_cert) then rerun kolla and that should update the haproxy configs with a cert? | 00:28 |
clarkb | I'll ask them if rerunning kolla is something they expect people to do | 00:29 |
clarkb | the upside to having kolla do that for us is it can be sure to get all the necessary ports in haproxy | 00:33 |
clarkb | but we could put another proxy in front of the haproxy from kolla and do it ourselves | 00:33 |
openstackgerrit | Merged opendev/system-config master: refstack: move non-private variables to public https://review.opendev.org/c/opendev/system-config/+/774587 | 00:37 |
openstackgerrit | Merged opendev/system-config master: Setup OpenInfra-Board Channel https://review.opendev.org/c/opendev/system-config/+/774706 | 00:40 |
clarkb | I think I have a general idea of how to rerun kolla with an updated config. I expect that to take some time to simply execute and dinner is happening momentarily. I'll see if my questions to inmotion have been answered tomorrow and take it from there | 00:55 |
*** rchurch has quit IRC | 01:05 | |
*** mlavalle has quit IRC | 01:07 | |
*** rchurch has joined #opendev | 01:07 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: borg-backup-server: run a weekly backup verification https://review.opendev.org/c/opendev/system-config/+/774753 | 01:27 |
openstackgerrit | Merged opendev/system-config master: refstack: add production image and deployment jobs https://review.opendev.org/c/opendev/system-config/+/774586 | 01:28 |
openstackgerrit | Merged opendev/system-config master: borg-backup-server: add script for pruning borg backups https://review.opendev.org/c/opendev/system-config/+/774561 | 01:28 |
openstackgerrit | Merged opendev/system-config master: borg-backup-server: volume space monitor https://review.opendev.org/c/opendev/system-config/+/774564 | 01:28 |
openstackgerrit | Merged opendev/system-config master: doc: update backup instructions https://review.opendev.org/c/opendev/system-config/+/774570 | 01:29 |
openstackgerrit | Merged opendev/system-config master: borg testing: catch stdout and stderr from test prune correctly https://review.opendev.org/c/opendev/system-config/+/774745 | 01:33 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: refstack: trigger image upload https://review.opendev.org/c/opendev/system-config/+/774756 | 02:13 |
*** artom has quit IRC | 02:13 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: borg-backup-server: run a weekly backup verification https://review.opendev.org/c/opendev/system-config/+/774753 | 02:39 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: openafs-<db|file>-server: fix role name https://review.opendev.org/c/opendev/system-config/+/774761 | 02:50 |
openstackgerrit | Merged opendev/system-config master: borg-backup: save PIPESTATUS before referencing https://review.opendev.org/c/opendev/system-config/+/774588 | 03:01 |
*** rchurch has quit IRC | 03:14 | |
*** rchurch has joined #opendev | 03:15 | |
*** artom has joined #opendev | 03:19 | |
*** hemanth_n has joined #opendev | 03:25 | |
openstackgerrit | Merged opendev/system-config master: refstack: trigger image upload https://review.opendev.org/c/opendev/system-config/+/774756 | 03:30 |
*** diablo_rojo has quit IRC | 03:41 | |
*** dviroel has quit IRC | 04:07 | |
*** lamt has quit IRC | 04:25 | |
*** mrunge has quit IRC | 04:37 | |
*** dmellado has quit IRC | 04:37 | |
*** JohnnyRainbow has quit IRC | 04:37 | |
*** ykarel has joined #opendev | 04:38 | |
*** mrunge has joined #opendev | 04:42 | |
*** dmellado has joined #opendev | 04:42 | |
*** JohnnyRainbow has joined #opendev | 04:42 | |
*** Eighth_Doctor has quit IRC | 04:47 | |
*** ysandeep|away is now known as ysandeep|rover | 04:49 | |
*** mordred has quit IRC | 04:50 | |
*** whoami-rajat__ has joined #opendev | 04:57 | |
*** openstackstatus has quit IRC | 04:58 | |
*** openstack has joined #opendev | 04:59 | |
*** ChanServ sets mode: +o openstack | 04:59 | |
*** ysandeep|rover is now known as ysandeep|brb | 05:13 | |
*** mordred has joined #opendev | 05:19 | |
*** Eighth_Doctor has joined #opendev | 05:22 | |
ianw | clarkb / kopecmartin : i have run a mysqldump of the refstack db and imported it into https://refstack01.openstack.org | 05:26 |
ianw | clarkb / kopecmartin : to me, it looks like things are not working. | 05:30 |
ianw | SQL connection failed. 10 attempts left.: oslo_db.exception.DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on 'localhost' ([Errno 111] Connection refused)") | 05:32 |
ianw | that's in the container | 05:32 |
*** redrobot4 has joined #opendev | 05:34 | |
*** ysandeep|brb is now known as ysandeep|rover | 05:35 | |
*** redrobot has quit IRC | 05:37 | |
*** redrobot4 is now known as redrobot | 05:37 | |
*** ykarel has quit IRC | 05:55 | |
*** marios has joined #opendev | 05:55 | |
*** ykarel has joined #opendev | 06:12 | |
ianw | i think it probably needs someting to wait for the mysql container to be alive | 06:14 |
ianw | but, i've hacked in something like that and it still doesn't work | 06:14 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: refstack: create database storage area https://review.opendev.org/c/opendev/system-config/+/774773 | 06:35 |
ianw | clarkb / kopecmartin : ^ that's a start i guess ... out of time for today | 06:35 |
*** levalicious has joined #opendev | 07:19 | |
*** eolivare has joined #opendev | 07:32 | |
*** rpittau|afk is now known as rpittau | 07:51 | |
*** ysandeep|rover is now known as ysandeep|lunch | 07:54 | |
*** hashar has joined #opendev | 07:54 | |
*** ralonsoh has joined #opendev | 07:55 | |
*** sboyron has joined #opendev | 08:02 | |
*** andrewbonney has joined #opendev | 08:21 | |
*** slaweq|away is now known as slaweq | 08:29 | |
*** zbr|pto is now known as zbr | 08:35 | |
*** ysandeep|lunch is now known as ysandeep|rover | 08:52 | |
*** jpena|off is now known as jpena | 08:57 | |
*** tosky has joined #opendev | 09:12 | |
*** DSpider has joined #opendev | 09:15 | |
*** ykarel is now known as ykarel|lunch | 09:34 | |
*** dtantsur|afk is now known as dtantsur | 10:38 | |
*** hashar is now known as hasharLunch | 10:45 | |
*** ykarel|lunch is now known as ykarel | 10:53 | |
*** dviroel has joined #opendev | 11:02 | |
*** hasharLunch has quit IRC | 11:19 | |
*** hasharLunch has joined #opendev | 11:42 | |
openstackgerrit | Oleksandr Kozachenko proposed zuul/zuul-jobs master: Update upload-logs-swift and upload-logs-gcs https://review.opendev.org/c/zuul/zuul-jobs/+/774650 | 11:44 |
openstackgerrit | Oleksandr Kozachenko proposed openstack/project-config master: Add zuul-storage-proxy in zuul namespace https://review.opendev.org/c/openstack/project-config/+/772364 | 11:48 |
*** hemanth_n has quit IRC | 11:56 | |
*** cloudnull has quit IRC | 12:02 | |
*** cloudnull has joined #opendev | 12:05 | |
*** eolivare_ has joined #opendev | 12:23 | |
*** eolivare has quit IRC | 12:25 | |
*** hasharLunch is now known as hashar | 12:29 | |
*** ysandeep|rover is now known as ysandeep|call | 12:31 | |
*** jpena is now known as jpena|lunch | 12:36 | |
*** hashar is now known as hasharAway | 12:39 | |
*** eolivare_ has quit IRC | 12:46 | |
*** iurygregory has quit IRC | 12:51 | |
*** ysandeep|call is now known as ysandeep|rover | 13:16 | |
*** eolivare_ has joined #opendev | 13:23 | |
*** ykarel_ has joined #opendev | 13:24 | |
*** ykarel has quit IRC | 13:27 | |
*** jpena|lunch is now known as jpena | 13:33 | |
*** ykarel_ is now known as ykarel | 13:40 | |
ttx | Hi all, I'm working to move the openstackptg bot to #openinfra-events and was taking the opportunity to rename it to "openinfraptg". But to do that it looks like someone will have no manually log in to Nickserv with the openstackptg account and associate an additional nick to it. Someone with access to the ptgbot password in hiera... Also after that the ptgbot_nick entry will have to be changed in hiera. I'm | 13:52 |
ttx | a bit unclear on the process to follow to do hiera things, so any guidance would be appreciated. | 13:52 |
fungi | ttx: i can take care of it shortly, just need to wire up a separate irc client | 13:55 |
ttx | fungi: ok, no urgency at all | 13:55 |
openstackgerrit | Thierry Carrez proposed opendev/system-config master: PTGBot is now openinfraptg on #openinfra-events https://review.opendev.org/c/opendev/system-config/+/774862 | 13:56 |
*** cloudnull has quit IRC | 13:56 | |
*** cloudnull has joined #opendev | 13:59 | |
fungi | config-core: diablo_rojo is volunteering to help with irc channel management, and is working on some foundation channel moves to the #openinfra channel namespace, simple change to add her to our default channel operators list here: https://review.opendev.org/774555 | 14:00 |
*** iurygregory has joined #opendev | 14:13 | |
*** hasharAway has quit IRC | 14:15 | |
openstackgerrit | Merged openstack/project-config master: Add diablo_rojo to AccessBot Operators https://review.opendev.org/c/openstack/project-config/+/774555 | 14:18 |
*** hasharAway has joined #opendev | 14:46 | |
openstackgerrit | Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474 | 14:49 |
*** hasharAway is now known as hashar | 14:54 | |
mordred | ttx: my eyes were reading your ptg bot change and misparsed the new bot name as "open in fraptg" and I was like "what's a fraptg?" I'm clearly not fully awake | 14:57 |
fungi | that's open in frap tg | 14:59 |
mordred | exactly | 15:00 |
mordred | see - I knew I needed more coffee | 15:00 |
fungi | it's all about the frappuccino in here | 15:01 |
openstackgerrit | Gomathi Selvi Srinivasan proposed zuul/zuul-jobs master: Create a template for ssh-key and size https://review.opendev.org/c/zuul/zuul-jobs/+/773474 | 15:08 |
*** ysandeep|rover is now known as ysandeep|dinner | 15:09 | |
*** fressi has quit IRC | 15:23 | |
*** ysandeep|dinner is now known as ysandeep|rover | 15:29 | |
openstackgerrit | Sorin Sbรขrnea proposed zuul/zuul-jobs master: Upgrade ansible-lint to 5.0 https://review.opendev.org/c/zuul/zuul-jobs/+/773245 | 15:38 |
*** hashar is now known as hasharAway | 15:42 | |
*** ykarel is now known as ykarel|away | 15:54 | |
clarkb | ianw: kopecmartin: I'll take a look after breakfast | 15:57 |
openstackgerrit | Oleksandr Kozachenko proposed zuul/zuul-jobs master: Update upload-logs-swift and upload-logs-gcs https://review.opendev.org/c/zuul/zuul-jobs/+/774650 | 16:08 |
*** mlavalle has joined #opendev | 16:16 | |
*** mlavalle has quit IRC | 16:16 | |
*** mlavalle has joined #opendev | 16:17 | |
openstackgerrit | Oleksandr Kozachenko proposed openstack/project-config master: Add zuul-storage-proxy in zuul namespace https://review.opendev.org/c/openstack/project-config/+/772364 | 16:17 |
*** ykarel|away has quit IRC | 16:17 | |
fungi | ttx: i've grouped openinfraptg into the nickserv registration for the existing openstackptg account, looking at what we need to update in hiera next | 16:23 |
ttx | fungi: probably just the $ptgbot_nick | 16:24 |
ttx | hiera('ptgbot_nick', 'username') | 16:25 |
*** marios has quit IRC | 16:31 | |
fungi | #status log Grouped openinfraptg nick to existing openstackptg account in Freenode and updated ptgbot_nick in our private group_vars accordingly | 16:32 |
openstackstatus | fungi: finished logging | 16:32 |
*** ianw has quit IRC | 16:39 | |
*** ianw has joined #opendev | 16:39 | |
openstackgerrit | Merged openstack/project-config master: Add zuul-storage-proxy in zuul namespace https://review.opendev.org/c/openstack/project-config/+/772364 | 16:40 |
*** hasharAway has quit IRC | 16:47 | |
*** hasharAway has joined #opendev | 16:49 | |
*** ysandeep|rover is now known as ysandeep|away | 16:54 | |
clarkb | ianw: the refstack change lgtm. I did leave a couple of thoughts/questions though would be great if you can check those before we merge it | 16:55 |
openstackgerrit | Jeremy Stanley proposed opendev/puppet-pip master: Pin get-pip.py to last Python 3.5 version https://review.opendev.org/c/opendev/puppet-pip/+/774900 | 16:58 |
fungi | infra-root: ^ more fallout from pip 21 | 16:58 |
clarkb | +2 | 17:00 |
openstackgerrit | Oleksandr Kozachenko proposed zuul/zuul-jobs master: Update upload-logs-swift and upload-logs-gcs https://review.opendev.org/c/zuul/zuul-jobs/+/774650 | 17:15 |
tobiash | fungi: is the only problem with pip 21 the drop of py 3.5 or should I expect more issues? | 17:25 |
openstackgerrit | Luigi Toscano proposed openstack/project-config master: test-release-openstack: use focal https://review.opendev.org/c/openstack/project-config/+/774906 | 17:26 |
openstackgerrit | Oleksandr Kozachenko proposed zuul/zuul-jobs master: Update upload-logs-swift and upload-logs-gcs https://review.opendev.org/c/zuul/zuul-jobs/+/774650 | 17:35 |
*** d34dh0r53 has quit IRC | 17:43 | |
*** diablo_rojo has joined #opendev | 17:44 | |
*** d34dh0r53 has joined #opendev | 17:45 | |
diablo_rojo | fungi, since all those patches for the irc channel admin have landed, are we good to go on with the next set of commands then? | 17:47 |
diablo_rojo | And hypothetically I will be able to run them given I am now apart of that group? | 17:47 |
clarkb | diablo_rojo: you can check your perms with chanserv first to confirm too | 17:48 |
clarkb | diablo_rojo: /query chanserv access #your-channel list | 17:49 |
clarkb | tobiash: yes pretty much. They just made new pip >=python3.6 only | 17:50 |
clarkb | I don't think much else about it has changed | 17:50 |
clarkb | (previously they added the dependency resolution which was a major change but that happened with 3.5 supprot) | 17:50 |
tobiash | clarkb: thanks, so probably no problem for us :) | 17:50 |
diablo_rojo | Looks like I am good to go clarkb! Thanks for the direction. | 17:52 |
*** hasharAway is now known as hashar | 18:00 | |
*** jpena is now known as jpena|off | 18:01 | |
stephenfin | Hit a weird bug | 18:06 |
stephenfin | I tried to edit a patch's commit message and clicked save | 18:06 |
stephenfin | waait, ignore that | 18:06 |
* stephenfin had to refresh to get the Publish Edit button to appear | 18:07 | |
stephenfin | and as confused by the "Go to latest patch set" bar that was appearing | 18:07 |
stephenfin | *Was | 18:07 |
clarkb | I think the go to latest patchset set button may imply you are editing an older patchset | 18:08 |
clarkb | but ya the in browser editor has always been a bit weird | 18:08 |
clarkb | (I think its better now than it was on 2.13 though) | 18:08 |
*** Alex_Gaynor has joined #opendev | 18:10 | |
Alex_Gaynor | ๐ I'm seeing arm64 builds hanging out in queue'd status for an extended period (>1 hour) https://zuul.opendev.org/t/pyca/status/ I don't see anything obvious in grafana that explains this. | 18:10 |
clarkb | https://grafana.opendev.org/d/pwrNXt2Mk/nodepool-linaro?orgId=1 shows that we were recently using at or near capacity, but right now it does look idle | 18:11 |
Alex_Gaynor | And has 0 in building. I'd expect things to be building if I'm queued :-) | 18:12 |
clarkb | me too | 18:12 |
clarkb | there are some errored launch attempts there, I wonder if it is failing early to build so we don't trigger the building report to graphite /me goes to look at launcher logs | 18:13 |
Alex_Gaynor | The queue also appears to be very deep, though I obviously have no idea if that's related. | 18:15 |
clarkb | a deep queue can also cause jobs to wait before building just due to lack of available resources to process everything at once, but in this case it seems to be that it isn't using any of the available resources from some reason | 18:16 |
clarkb | nodepool just logged 504 Gateway Time-out: The server didn't respond in time. for linaro server deletions | 18:17 |
clarkb | I wonder if the api services just went away /me digs more | 18:17 |
Alex_Gaynor | if it can't handle deletions, seems like you might end up with a "phantom" pool of resources that exist, but are unusable, and also prevent spinning up new ones. | 18:18 |
clarkb | ya its also affecting the listing of resources and services on different ports. | 18:18 |
*** eolivare_ has quit IRC | 18:18 | |
mordred | that reminds me of the old HP Public Cloud bug | 18:18 |
clarkb | my preliminary analysis is that the cloud apis just went away | 18:19 |
clarkb | kevinz: ^ fyi if you happen to be awake | 18:19 |
clarkb | I'll try interacting with it manually to see if I can observe any other useful behaviors | 18:19 |
mordred | with the missing database index that caused both creates and deletes to timeout at the LB but continue running/blocking on the backend - and of course since the LB timed it out, client code would retry the operation just putting more in the backend queue ... | 18:20 |
clarkb | mordred: ya image lists and server shows work | 18:21 |
clarkb | if run manually so I'm suspecting the issues are more narrow (similar to what you describe) | 18:21 |
* clarkb tries to manually boot and delete a server | 18:22 | |
*** rpittau is now known as rpittau|afk | 18:23 | |
clarkb | my test node failed with 'No valid host was found. There are not enough hosts available.' | 18:25 |
clarkb | however, if nodepool was hitting ^ I would've expect node failures to bubble up to zuul | 18:26 |
clarkb | this is an interesting situation | 18:26 |
mordred | yeah - why was nodepool getting gateway timeouts - no valid host is a real error | 18:30 |
mordred | unless no valid host is causing nodepool to retry loop and the loadbalancer is rate-limiting nodepool now but not you | 18:30 |
mordred | did you do that manual launch from a nodepool node? | 18:31 |
clarkb | no I did it from bridge | 18:31 |
clarkb | so that could be it, the proxy telling us to go away after tight looping due to failures | 18:31 |
mordred | yeah | 18:31 |
mordred | so could be a double failure sitch | 18:31 |
clarkb | ya grepping on not enough hosts I see a bunch of those errors in a small period of time then it stops which would be inline with your hypothesis | 18:32 |
clarkb | hrm except the gateway failures happened first and now its going through and finding no valid host | 18:35 |
clarkb | I expect that what is happening is something broke at a network level and caused the cloud to have a sad. It has since recovered enough to fail node boots with no valid host but not recovered enough to actually boot them | 18:36 |
clarkb | and now jobs are going to start getting node failures | 18:36 |
clarkb | but I'll keep poking and see if I can come up with a more concrete idea of what is going on | 18:36 |
clarkb | just caught it doing another round of attempts and then getting back no valid hosts | 18:40 |
clarkb | do we have a backoff on node relaunch attempts? | 18:40 |
clarkb | that may explain why we aren't seeing this in a tight loop | 18:41 |
clarkb | corvus: ^ | 18:41 |
*** diablo_rojo has quit IRC | 18:44 | |
*** dtantsur is now known as dtantsur|afk | 18:45 | |
fungi | tobiash: that's the main change i'm aware of in pip 21, it drops support for python <3.6 (including dropping 2.7 support) | 18:45 |
fungi | oh, i see clarkb also answered you | 18:46 |
fungi | diablo_rojo seems to have dropped again | 18:46 |
corvus | clarkb: i don't think there's an explicit backoff, just a complex system of loops and timeouts | 18:52 |
clarkb | hrm, it is definitely not progressing through the requests as quickly as I would expect if there is no backoff. I think this is a "good" thing in that it means we may end up with a fixed cloud before everythong NODE_FAILUREs though | 18:53 |
*** klonn has joined #opendev | 18:56 | |
*** whoami-rajat__ has quit IRC | 18:57 | |
openstackgerrit | Merged openstack/project-config master: test-release-openstack: use focal https://review.opendev.org/c/openstack/project-config/+/774906 | 19:02 |
clarkb | fwiw still seeing bursts of no valid host found | 19:15 |
*** ralonsoh has quit IRC | 19:17 | |
*** rchurch has quit IRC | 19:17 | |
Alex_Gaynor | FWIW, I'm now seeing clear "node_failure" statuses, so progress? | 19:18 |
fungi | corvus: so the manage-projects run took 1.5 hours | 19:19 |
corvus | infra-root: i'm looking at a manage-projects log, and it output a lot of errors on gitea03 | 19:19 |
corvus | and 02 | 19:19 |
fungi | maybe we're getting slammed again | 19:20 |
*** rchurch has joined #opendev | 19:20 | |
fungi | checking graphs | 19:20 |
clarkb | Alex_Gaynor: yes I think what is happening is the no valid host errors we are seeing more recently are going to start bubbling up as NODE_FAILURES | 19:20 |
tobiash | clarkb: I'm wondering if we should treat no valid host found errors in nodepool like non-fatal quota issues | 19:20 |
corvus | and 05... let's just say several giteas for now since it's hard to read these logs | 19:20 |
fungi | corvus: oh yeah, massive swap thrash and eventual oom in that timeframe | 19:20 |
fungi | so yay, our mystery load generator has returned? and now we have improved logging to investigate with | 19:21 |
clarkb | tobiash: in this case there is only one cloud provider for these node types so that would just cause all jobs to sit and wait until the cloud fixed tiself | 19:21 |
clarkb | fungi: fwiw I think its been about exactly one week since last time | 19:21 |
fungi | neat | 19:21 |
clarkb | fungi: a fun cron maybe? | 19:21 |
fungi | last time our suspect was rdo's ci servers, right? | 19:21 |
clarkb | but ya the improved logs will hopefully allows us to identify the source | 19:21 |
tobiash | clarkb: we were getting no valid host found mostly when the cloud is short on resources due to potential too high over provisioning | 19:22 |
clarkb | fungi: yes, they had an order of magnitude more requests on some of the servers (and theory was they tripped that one over which caused a chain reaction as haproxy rebalanced the pool) | 19:22 |
corvus | fungi: gitea03 starting at 2021-02-10T17:05:30.156912 | 19:22 |
clarkb | Alex_Gaynor: unfortunately I don't think there is much more we can do without the cloud intervening. | 19:22 |
Alex_Gaynor | ๐ | 19:22 |
clarkb | kevinz: when your day starts can you sync up with us and see if we can help with further debugging? | 19:22 |
clarkb | corvus: fungi: the rough debugging process with improved logs is look at apache2 access logs on affected hosts during the time frame and note the source port for large or out of place requests. Then go to haproxy server syslog and grep for that port and gitea backend | 19:23 |
fungi | clarkb: also sometimes it's helped to e-mail kevinz since he may see that sooner than irc | 19:24 |
clarkb | because there are only 65k possible ports you also typically have to match timestamp ranges too | 19:24 |
clarkb | fungi: good idea. I'll write an email then see if I can help with gitea | 19:24 |
corvus | tobiash, clarkb: i agree, that's usually what that error means. i think i'd be in favor of treading it as a non-fatal error; though since it's not actually reflected in quota, i don't think we'll be able to handle it intelligently. i think we ought to decline the request if we are not the last possible launcher. | 19:24 |
fungi | yeah, i'm doing the thing with gitea02, but someone independently doing the same for another impacted backend would help correlation | 19:24 |
corvus | fungi: i think i need to leave the gitea debugging to you, sorry | 19:25 |
fungi | corvus: no worries, thanks for spotting it! | 19:26 |
fungi | i'll be semi-focused on this for the next little while, but also need to do some cooking shortly | 19:27 |
*** hashar has quit IRC | 19:33 | |
clarkb | ok email sent | 19:34 |
clarkb | tried to accurately describe the transition from 504 gateway errors to no valid host found with accurate timestamps | 19:35 |
fungi | interestingly, the greatest number of connections i see to gitea02 during the 17z hour was from codesearch.o.o | 19:36 |
*** andrewbonney has quit IRC | 19:39 | |
clarkb | is it possible that creating new projects is doing it? | 19:41 |
clarkb | 16:40:05 openstackgerrit | Merged openstack/project-config master: Add zuul-storage-proxy in zuul namespace https://review.opendev.org/c/openstack/project-config/+/772364 | 19:41 |
clarkb | or is that just getting caught in the fallout? Our gitea testing does actually create all the projects in our project list and it does that successfully | 19:42 |
fungi | yeah, it's mostly come to our attention when new project creation fails, but we create new projects at other times without issue | 19:43 |
fungi | i don't think it's codesearch, because it's also far and away the largest source of connections to gitea02 at other times where this wasn't going on | 19:44 |
clarkb | automated email response has reminded me that it is the chinese new year | 19:46 |
fungi | d'oh! | 19:47 |
fungi | that doesn't bode well | 19:47 |
*** slaweq has quit IRC | 19:49 | |
*** zimmerry has quit IRC | 20:11 | |
*** zimmerry has joined #opendev | 20:13 | |
*** sboyron_ has joined #opendev | 20:55 | |
*** sboyron has quit IRC | 20:58 | |
*** sboyron_ has quit IRC | 21:09 | |
ianw | clarkb: how strongly do you feel about the /var/refstack v /var/lib/refstack? enough to respin? | 21:29 |
fungi | even with /var/refstack being non-fhs-compliant, i wouldn't want anyone to redo work | 21:30 |
clarkb | ianw: not super. I tend to always look at the docker compose file and work back from there anyway (and that has teh /var/lib/refstack pointers | 21:31 |
ianw | fungi: well the change is moving everything to /var/lib/refstack so i guess we're good from that pov | 21:31 |
clarkb | just calling it out as a difference to gitea if others care more strongly | 21:31 |
ianw | ok i might just go with it, and see if having a persistent db makes things work. i'm not sure though, i did a "mysqldump <trove-details> | mysql" to try and populate it and it didn't seem to work, but i don't know | 21:32 |
ianw | i have to just do school run but can help with gitea things if i can be useful | 21:33 |
fungi | in what way did it not work? i think i've used mysql -e for such things in the past | 21:34 |
fungi | or source the path to the dumpfile in the interactive mysqlclient prompt | 21:34 |
clarkb | re gitea I'm beginning to wonder if it could be our own project description updates that does it | 21:34 |
clarkb | possible that when there is background load on gitea that doing the management stuff all at once like that can make things sad | 21:35 |
clarkb | however, not 100% sure of that yet | 21:35 |
fungi | yeah, i need to find a minute to try and match up the manage-projects ansible log to see if i can tell when it started hitting different backends with when the memory on each of them started to skyrocket | 21:37 |
fungi | ugh, perhaps unsurprisingly, we have bitrot in our puppet-pip module testing | 21:41 |
fungi | looks like it could be a problem with beaker-hiera | 21:42 |
fungi | reading a bit, we may need to pin beaker-hiera<0.2 in our spec helper repo | 21:44 |
clarkb | that seems to be the classic case of bit rot in the puppet space for us | 21:44 |
openstackgerrit | Jeremy Stanley proposed opendev/puppet-openstack_infra_spec_helper master: Pin beaker-hiera<0.2.0 https://review.opendev.org/c/opendev/puppet-openstack_infra_spec_helper/+/775030 | 21:49 |
openstackgerrit | Jeremy Stanley proposed opendev/puppet-pip master: Pin get-pip.py to last Python 3.5 version https://review.opendev.org/c/opendev/puppet-pip/+/774900 | 21:50 |
clarkb | fungi: I seem to recall that depends on won't work for that for some reason? we may just have to land the infra spec helper change (which I am reviewing) | 21:51 |
fungi | yeah, maybe | 21:52 |
openstackgerrit | Merged opendev/system-config master: refstack: create database storage area https://review.opendev.org/c/opendev/system-config/+/774773 | 21:54 |
clarkb | ianw: fwiw I would've expected the mariadb to work without the persistent mount but if docker-compose down then up -d was run you'd lose the db | 21:55 |
clarkb | its definitely a good and correct improvment | 21:56 |
clarkb | fungi: just thinking out loud here about the gitea thing. Maybe we can measure it in our test job and see if that exhibits high load during the project description update pass? | 22:12 |
clarkb | I don't know how easy our existing test tooling makes that though | 22:12 |
fungi | wonder if we could add roles to start dstat and collect its record? | 22:16 |
ianw | devstack already has a background service that does similar | 22:17 |
fungi | yeah, just didn't know if devstack's implementation was easily reused or tightly coupled to devstack's overall design | 22:19 |
ianw | probably jumping on a running host a copying the .service file would be enough | 22:21 |
*** diablo_rojo has joined #opendev | 22:21 | |
diablo_rojo | fungi, been trying to run the renaming commands in the docs, but chanserv says I am not authorized? | 22:23 |
clarkb | diablo_rojo: you may have to explicit op yourself in the channel first | 22:23 |
clarkb | the ability to op and actually being op are separate | 22:23 |
fungi | diablo_rojo: /msg chanserv op #openstack-board | 22:24 |
fungi | i think that's the syntax | 22:24 |
fungi | and then the same but deop instead of op when you're finished | 22:24 |
clarkb | also this could be behavior change between gitea 1.13 and 1.14? | 22:28 |
clarkb | assuming the project description updates are related. | 22:28 |
clarkb | Just thinking out loud here: we could also remove the description updates for now and monitor the next project creation | 22:29 |
diablo_rojo | I did actually explicitly op myself in the channel first. | 22:33 |
diablo_rojo | But let me try again | 22:33 |
diablo_rojo | Yeah. It still says I am not authorized.. | 22:34 |
clarkb | do you need it on both sides maybe? | 22:35 |
diablo_rojo | In both #openinfra-events and #openstack-ptg I have op | 22:35 |
diablo_rojo | but I can't set the guard on #openstack-ptg | 22:35 |
clarkb | hrm | 22:36 |
diablo_rojo | Right? | 22:36 |
fungi | i'll need to re-check the freenode mode reference2 | 22:38 |
fungi | most of us are +Aeforstv but diablo_rojo is only +Aefortv | 22:38 |
diablo_rojo | Weird. | 22:38 |
fungi | founder perms are +AFRefiorstv | 22:39 |
fungi | https://freenode.net/kb/answer/channelmodes | 22:39 |
fungi | ahh, nope, i wanted perms | 22:40 |
*** levalicious has quit IRC | 22:40 | |
diablo_rojo | I would guess since I am missing the 's' that's why I can't 'SET' things. | 22:40 |
fungi | oh, though that says "An operator can use MLOCK with +f only if they have access flag +s in both channels, or if the channel to be forwarded to is +F and they have +s in the original channel." | 22:43 |
fungi | so, yep | 22:43 |
fungi | that looks likely | 22:43 |
fungi | i wonder why we don't normally grant +s to our operators list? | 22:43 |
fungi | diablo_rojo: i've added +s to your perms for #openinfra-events and #openstack-ptg, see if that worked? | 22:45 |
fungi | if so i can add you to the others you're working on while we figure out why we're not setting +s on everyone in the access list | 22:45 |
diablo_rojo | fungi, no dice. | 22:47 |
fungi | +s is "Enables use of the set command." according to `/msg chanserv help FLAGS` | 22:47 |
fungi | diablo_rojo: my fault, syntax error. should actually be added now | 22:49 |
fungi | i didn't spot the error response on my first try. probably doing too many things at the same time | 22:49 |
*** klonn has quit IRC | 22:51 | |
fungi | ianw: "This message is to inform you that the host your cloud server, ianw-klog-collector, resides on alerted our monitoring systems at 2021-02-09T12:39:27.515190." | 22:59 |
fungi | i can close that ticket out, just wanted to make sure you knew | 23:00 |
ianw | fungi: oh, we can delete that. that was the server i was using to collect logs from the linaro hosts that kept disappearing | 23:01 |
fungi | ianw: also are the "backup inconsistency" e-mails a test, or false negative? | 23:01 |
openstackgerrit | Merged opendev/puppet-openstack_infra_spec_helper master: Pin beaker-hiera<0.2.0 https://review.opendev.org/c/opendev/puppet-openstack_infra_spec_helper/+/775030 | 23:01 |
clarkb | ianw: re linaro do you know if there is someone else we should send email about the no valid host found errors there? kevinz's returned that it is the chinese new year | 23:01 |
fungi | "Inconsistency found in backup /opt/backups/borg-gitea01/backup on backup01 at Wed Feb 10 01:13:47 UTC 2021" | 23:02 |
fungi | et cetera | 23:02 |
ianw | sorry just pulling up my mail client | 23:02 |
clarkb | fungi: I'm yet to receive that one it seems | 23:02 |
fungi | clarkb: they went to the root inbox | 23:02 |
ianw | clarkb: yeah, i don't have another contact unfortunately ... not sure what else to do :( | 23:02 |
diablo_rojo | fungi, good to go now. | 23:02 |
diablo_rojo | I can set stuff. | 23:03 |
ianw | huh, those backup inconsistency ones i would not expect | 23:03 |
fungi | clarkb: ianw: could hrw know who to contact? | 23:03 |
diablo_rojo | If you want to give me that perm on the other channels I can move forward. | 23:03 |
fungi | diablo_rojo: awesome, yeah doing that now, just a sec | 23:03 |
diablo_rojo | fungi, thank you! | 23:03 |
clarkb | fungi: ya not finding it (I did a global serach too to rule out filing into an unexpected dir) | 23:04 |
ianw | fungi: maybe, he is usually in here but from #linaro we might have just missed him. worth a try | 23:05 |
fungi | clarkb: no, i mean our shared root inbox | 23:05 |
clarkb | oh I see | 23:05 |
fungi | diablo_rojo: i think i got them all | 23:06 |
ianw | oh, hrm we didn't actually approve the backup verification yet @ https://review.opendev.org/c/opendev/system-config/+/774753 | 23:06 |
fungi | clarkb: also inmotion has been sending setup complete notifications to that address, do you need those or can i file them into a subfolder? | 23:06 |
ianw | fungi: ok, they are definitely false positives from when i was testing it and pressed ctrl-c, killing the verification process that the script then warned out | 23:07 |
diablo_rojo | fungi, sweet! Thank you :) | 23:07 |
fungi | cool | 23:08 |
fungi | diablo_rojo: you're welcome, lmk if you run into more problems | 23:08 |
clarkb | fungi: they can be filed away | 23:09 |
diablo_rojo | fungi, will do! | 23:11 |
fungi | clarkb: thanks, done | 23:11 |
fungi | working on closing out the rackspace ticket about ianw-klog-collector as well | 23:12 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Build Gerrit 3.3 images https://review.opendev.org/c/opendev/system-config/+/765021 | 23:12 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Run gerrit 3.2 and 3.3 functional tests https://review.opendev.org/c/opendev/system-config/+/773807 | 23:12 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Cleanup refstack job dependencies https://review.opendev.org/c/opendev/system-config/+/775041 | 23:12 |
ianw | fungi: if you're in the control panel can you just delete it? | 23:12 |
ianw | (the server, otherwise i'll do it later) | 23:12 |
fungi | ianw: oh, yep happy to do that too | 23:12 |
clarkb | ianw: ^ I stuck that refstack cleanup behind my gerrit 3.3 jobs addition beacuse there were merge conflicst | 23:12 |
diablo_rojo | fungi, missing #openstack-summit | 23:13 |
clarkb | I don't think either is urgent but wanted to point that out as I noticed it when fixing the conflicts | 23:13 |
ianw | clarkb: oh, sorry, i owed a review on the gerrit 3.3 things, looking | 23:13 |
fungi | diablo_rojo: oh, hah, i'm not allowed to do that | 23:13 |
diablo_rojo | Ha ha ha | 23:14 |
diablo_rojo | Alright, will circle back to that one then. | 23:14 |
fungi | diablo_rojo: apparently not actually an official channel, only access is for the founder "spy" who created that channel >10 years ago | 23:14 |
fungi | >10.5 years ago in fact | 23:15 |
fungi | "modified 10y 31w 3d ago" | 23:16 |
fungi | corvus: mordred: does the irc nick "spy" ring a 10.5-year-old bell for you? | 23:16 |
clarkb | that must've been the first summit | 23:16 |
fungi | indeed | 23:16 |
clarkb | (if I've done math right) | 23:16 |
fungi | jbryce: ^ you might remember too, i suppose? | 23:18 |
corvus | yes spy was an OG | 23:20 |
fungi | ianw: i've closed out the ticket and deleted the instance now | 23:20 |
corvus | fungi: need me to ask freenode for it? | 23:20 |
ianw | thankyou! | 23:20 |
fungi | corvus: if you have a moment, that would be much appreciated! | 23:20 |
fungi | at this point we're just trying to set a forward on it anyway | 23:21 |
fungi | that reminds me i still need to start the org application for the #openinfra channel namespace, i found freenode's documentation on the process at least | 23:24 |
corvus | fungi: better sooner than later | 23:26 |
fungi | yup | 23:26 |
fungi | we're only just starting to forward to those, and i didn't want to jump the gun asking for that namespace until we'd given the former occupants of the base channel some time | 23:27 |
corvus | fungi: done; Flags +AFRefiorstv were set on openstackinfra in #openstack-summit. | 23:28 |
fungi | thanks corvus! | 23:28 |
fungi | diablo_rojo: as soon as our next accessbot run fires, we should be all set | 23:29 |
mordred | fungi: wow. spy is old | 23:29 |
diablo_rojo | seems I cant set the MLOCK for the openstack-foundation to openinfra redirect | 23:30 |
ianw | clarkb: did you ever look at making gitea pause until the db container was active? you used to be able to set a "healthcheck" on the mariadb instance and make other containers wait on that with a condition, but for some reason they removed that apparently | 23:31 |
corvus | diablo_rojo: there's an existing forward for the unregistered channel; maybe that needs to be removed first? | 23:32 |
corvus | info #openstack-foundation | 23:33 |
corvus | derp | 23:33 |
diablo_rojo | corvus, ohh that makes sense. Oh nailed it. | 23:33 |
diablo_rojo | Thats already in place then | 23:33 |
clarkb | ianw: I think docker-compose doesn't have that ability to wait. YOu have to do it within the container with like an init script | 23:33 |
clarkb | ianw: I want to say once I couldn't do it with docker-compose I gave up because I didn't want to have a super complicated container image | 23:34 |
clarkb | (but maybe complicated container image is a good idea?) | 23:34 |
corvus | diablo_rojo: the current status is: Mode lock : +ntcrf #openstack-unregistered | 23:34 |
diablo_rojo | oh so not redirected to the right place | 23:35 |
*** DSpider has quit IRC | 23:35 | |
ianw | clarkb: i got it working with http://paste.openstack.org/show/lCL5sfUhtkXLtvmHwMmV/ using version 2.1, but then i read that apparently that was considered too useful and so they removed it in version 3 :/ | 23:36 |
ianw | i don't actually know if it matters; i'm assuming refstack retries until it connects anyway | 23:36 |
ianw | we didn't deploy because of a typo | 23:36 |
clarkb | gitea retries | 23:36 |
clarkb | ianw: what was the typo? | 23:37 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: refstack: fix typo in role matcher https://review.opendev.org/c/opendev/system-config/+/775044 | 23:37 |
ianw | clarkb: ^ :) | 23:37 |
fungi | so looking at our accessbot config, we say to set +Aeforstv on everyone in the operators list, so i'm not sure why it added diablo_rojo to them without +s | 23:38 |
fungi | actually there are channels it didn't add her to at all, i'll check the accessbot output | 23:39 |
clarkb | ianw: doh | 23:41 |
ianw | i'll just do a manual run to get the new files on | 23:41 |
fungi | 2021-02-10 08:05:41,556 [INFO] setaccess - access #openinfra-board add diablo_rojo -FRis | 23:45 |
ianw | ok, i've started the refstack mariadb container and /var/lib/refstack/db/ is populated. i'm going to run the mysqldump import from the old trove | 23:45 |
fungi | now to figure out where/why accessbot is setting -FRis | 23:46 |
ianw | well https://refstack01.openstack.org/#/community_results still seems to not be happy | 23:48 |
clarkb | fungi: operators don't have FRi (but do have s) | 23:49 |
fungi | yeah, that's what i find weird | 23:49 |
clarkb | -FRi seems like what I would've expected | 23:49 |
fungi | but also she wouldn't have had FRi anyway | 23:49 |
clarkb | as a santiy check other operators do have +s (but no FRi) | 23:51 |
fungi | and it seems like accessbot isn't processing the whole list either, though it's not immediately apparent to me from the log why that is | 23:51 |
*** CeeMac has quit IRC | 23:51 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: refstack: capture container logs to disk https://review.opendev.org/c/opendev/system-config/+/775046 | 23:52 |
fungi | 2021-02-10 14:33:57,877 [DEBUG] irc.client - TO SERVER: QUIT :Connection reset by peer | 23:52 |
fungi | oh maybe we're getting disconnected | 23:53 |
clarkb | is it being rate limited? | 23:53 |
fungi | yeah, could be something like that, though the server doesn't seem to explain | 23:54 |
ianw | "Blocked loading mixed active content โhttp://refstack01.openstack.org:8000/v1/results?page=1โ | 23:55 |
fungi | that's being logged by the apache layer? | 23:55 |
clarkb | ianw: fwiw it gives me json back | 23:56 |
clarkb | but I have to switch it to port 443 | 23:56 |
ianw | yeah, i think the errors might be on the front end and the db is ok, it's just all confused between https/http and it's hostname... | 23:57 |
clarkb | ianw: I think the port 8000 stuff isn't meant to be publicly exposed fwiw | 23:57 |
clarkb | but I guess if that is apache complaining then you aren't hitting that | 23:57 |
fungi | looks like tools/apply-test.sh needs some help to deal with latest cryptography now | 23:58 |
clarkb | fungi: that uses ansible to run puppet with ansible and probably uncapped cryptography with old pip being the problem there? | 23:58 |
clarkb | fungi: can probably upgrade pip first or cap cryptography in the ansible install | 23:58 |
fungi | yeah, it's happening when we pip install ansible | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!