*** tosky has quit IRC | 00:09 | |
*** sgw has quit IRC | 00:16 | |
*** DSpider has quit IRC | 00:19 | |
openstackgerrit | Ian Wienand proposed opendev/glean master: Add container build jobs https://review.opendev.org/723285 | 00:29 |
---|---|---|
openstackgerrit | Merged opendev/system-config master: status.openstack.org: send zuul link to opendev zuul https://review.opendev.org/723282 | 01:14 |
openstackgerrit | Merged opendev/system-config master: Cron module wants strings https://review.opendev.org/723106 | 01:39 |
openstackgerrit | Merged openstack/diskimage-builder master: Add sibling container builds to experimental queue https://review.opendev.org/723281 | 02:07 |
*** rkukura has quit IRC | 02:13 | |
*** rkukura has joined #opendev | 02:27 | |
ianw | mordred: there is something going on with puppet apply where it's somehow restoring back to an old change | 02:49 |
ianw | remote_puppet_else.yaml.log.2020-04-27T01:24:05Z:Notice: /Stage[main]/Openstack_project::Status/Httpd::Vhost[status.openstack.org]/File[50-status.openstack.org.conf]/content: content changed '{md5}9185a2797200c84814be8c05195800fa' to '{md5}c9a8216d842c5c83e6910eb41d4d91ee' | 02:49 |
ianw | remote_puppet_else.yaml.log.2020-04-27T01:35:36Z:Notice: /Stage[main]/Openstack_project::Status/Httpd::Vhost[status.openstack.org]/File[50-status.openstack.org.conf]/content: content changed '{md5}c9a8216d842c5c83e6910eb41d4d91ee' to '{md5}9185a2797200c84814be8c05195800fa' | 02:49 |
ianw | the 01:24 run updated it, then the 01:35 run un-updated it, i think | 02:50 |
ianw | deploy723282,19 mins 16 secs2020-04-27T01:23:44 | 02:51 |
clarkb | ianw: I think thats a zuul bug that cirvus found | 02:51 |
ianw | opendev-prod-hourlymaster9 mins 16 secs2020-04-27T01:35:16 | 02:51 |
clarkb | it uses the change merged against master and that is racy | 02:52 |
ianw | clarkb: hrm, i think it was the opendev-prod-hourly that has seemed to revert the change, that should have seen the new change? | 02:54 |
ianw | the hourly job checked out system-config master to 2020-04-27 01:35:45.184204 | bridge.openstack.org | 2e2be9e6873ffe7dd07d84792b2bbef47e901f02 Merge "Fix zuul.conf jinja2 template" | 02:55 |
clarkb | hrm maybe another bug of similar variety? | 02:56 |
clarkb | like maybe deploy ran out order so deploy hourly ran head^ ? | 02:56 |
ianw | if i'm correct in calculating https://opendev.org/opendev/system-config/commit/1d0d62c6a61159038be5c4e98bebb0e232131f56 merged at 2020-04-26 23:42 ... so several hours before the hourly job | 02:58 |
ianw | going so see if i can come up with a timeline in https://etherpad.opendev.org/p/DSxEB-ViHzEHDMgxAJDp | 03:00 |
ianw | the next run, running now, appears to have applied it | 03:32 |
*** factor has joined #opendev | 03:34 | |
*** ykarel|away is now known as ykarel | 04:30 | |
openstackgerrit | Merged zuul/zuul-jobs master: Update ensure-javascript-packages README https://review.opendev.org/722354 | 04:52 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] ensure-virtualenv https://review.opendev.org/723309 | 04:57 |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] ensure-virtualenv https://review.opendev.org/723309 | 05:00 |
*** ysandeep|away is now known as ysandeep | 05:12 | |
*** jaicaa has quit IRC | 05:18 | |
*** jaicaa has joined #opendev | 05:20 | |
openstackgerrit | Ian Wienand proposed zuul/zuul-jobs master: [wip] ensure-virtualenv https://review.opendev.org/723309 | 05:37 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: [wip] plain nodes https://review.opendev.org/723316 | 05:41 |
*** dpawlik has joined #opendev | 05:56 | |
AJaeger | infra-root, I just saw a promote job fail with timeout uploading to AFS, see https://zuul.opendev.org/t/openstack/build/413faee223e54bc1bca7051a7b49c59b | 05:58 |
ianw | AJaeger: hrm, weird; i just checked that dir, and even touched and rm'd a file there and it was ok | 06:00 |
ianw | /afs/.openstack.org/docs/devstack-plugin-ceph | 06:01 |
openstackgerrit | Merged openstack/project-config master: Add Airship subproject documentation job https://review.opendev.org/721328 | 06:04 |
AJaeger | ianw: might be a temporary networking problem ;( | 06:25 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Stop translation stable branches on projects without Dashboard https://review.opendev.org/723217 | 06:35 |
*** iurygregory has quit IRC | 07:09 | |
*** iurygregory has joined #opendev | 07:10 | |
*** DSpider has joined #opendev | 07:22 | |
*** rpittau|afk is now known as rpittau | 07:22 | |
*** tosky has joined #opendev | 07:26 | |
*** sshnaidm|afk is now known as sshnaidm | 07:35 | |
*** ysandeep is now known as ysandeep|lunch | 08:16 | |
*** logan_ has joined #opendev | 08:31 | |
*** logan- has quit IRC | 08:32 | |
*** logan_ is now known as logan- | 08:35 | |
*** ykarel is now known as ykarel|lunch | 08:44 | |
hrw | zuul runs all using ansible. how to force it to use py3 on zuul? | 09:02 |
hrw | 2020-04-24 12:47:53.223078 | primary | "exception": "Traceback (most recent call last):\n File \"/tmp/ansible_pip_payload_Ffk1eE/__main__.py\", line 254, in <module>\n from pkg_resources import Requirement\nImportError: No module named pkg_resources\n", | 09:05 |
hrw | 2020-04-24 12:47:53.223192 | primary | "msg": "Failed to import the required Python library (setuptools) on debian-buster-arm64-linaro-us-0016157969's Python /usr/bin/python. Please read module documentation and install in the appropriate location" | 09:05 |
frickler | hrw: just set it like this? https://opendev.org/opendev/system-config/src/branch/master/playbooks/group_vars/gitea.yaml#L1 | 09:07 |
hrw | frickler: thx | 09:08 |
*** ykarel|lunch is now known as ykarel | 09:36 | |
*** ysandeep|lunch is now known as ysandeep | 09:53 | |
*** ykarel is now known as ykarel|afak | 10:31 | |
*** ykarel|afak is now known as ykarel|afk | 10:31 | |
*** rpittau is now known as rpittau|bbl | 10:32 | |
*** ykarel|afk is now known as ykarel | 11:31 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use cached buildset_registry fact https://review.opendev.org/723385 | 11:32 |
donnyd | Just an FYI OpenEdge is undergoing maintenance - shouldn't affect the CI - but in case it does you will know why | 11:35 |
*** smcginnis has quit IRC | 11:40 | |
*** DSpider has quit IRC | 11:40 | |
*** smcginnis has joined #opendev | 11:41 | |
*** DSpider has joined #opendev | 11:41 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: haskell-stack-test: add haskell tool stack test https://review.opendev.org/723263 | 11:58 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 12:33 |
*** ykarel is now known as ykarel|afk | 12:38 | |
*** rpittau|bbl is now known as rpittau | 12:49 | |
*** ykarel|afk is now known as ykarel | 12:52 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: omit variable instead of ignoring errors https://review.opendev.org/723524 | 13:19 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: WIP: omit variable instead of ignoring errors https://review.opendev.org/723524 | 13:20 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Use gitea for gerrit gitweb links https://review.opendev.org/723526 | 13:24 |
openstackgerrit | Monty Taylor proposed opendev/base-jobs master: Define an ubuntu-focal nodeset https://review.opendev.org/723527 | 13:27 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 13:29 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 13:33 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json https://review.opendev.org/723524 | 14:01 |
corvus | mordred: hrm, it doesn't look like we have a working cert for zuul.openstack.org yet | 14:11 |
fungi | corvus: mordred: ianw spotted that it was getting overwritten and tried to put together a timeline in https://etherpad.opendev.org/p/DSxEB-ViHzEHDMgxAJDp | 14:13 |
fungi | though i guess that was the redirect url itself, not the cert | 14:15 |
fungi | for the link on status.o.o | 14:16 |
fungi | oh, right, ianw spotted that the acme challenge cname hadn't been created for it so added that | 14:17 |
fungi | but also noted that the server is still in the emergency disable list so changes to it aren't getting applied | 14:17 |
fungi | and was reluctant to take it out of the emergency disable list with nobody else around | 14:18 |
fungi | corvus: mordred: are we clear to take zuul01.openstack.org back out of the emergency disable list in that case? there's no comment in the file saying why we disabled it and now i can't remember | 14:19 |
corvus | fungi: i think we need https://review.opendev.org/723107 | 14:20 |
corvus | otherwise the next config change will kill geard | 14:20 |
fungi | got it, reviewing | 14:20 |
fungi | ahh, yep, i remember discussing this one | 14:20 |
corvus | so it seems like we can merge that, then take the scheduler out of emergency, then run the letsencrypt playbook? then run the zuul playbook? | 14:21 |
fungi | seems that way to me, i just approved it moments ago | 14:22 |
mordred | yes - I agree with all of that | 14:29 |
fungi | related, corvus: mordred seems to have addressed your comment on 723048 | 14:30 |
fungi | mordred: i had a question on 723048 about use of sighup there... is that just sending hangup to the scheduler pid, and if so shouldn't we use the rpc client instead? | 14:31 |
mordred | fungi: oh - HUP is probably bad there - maybe we don't need to do anything other than having docker-compose shut down the container? | 14:32 |
mordred | ah - graceful stop in the old init script was USR1 | 14:33 |
fungi | yeah, if the goal was to stop the scheduler, then hup is not the thing | 14:33 |
mordred | yeah- lemme update | 14:33 |
corvus | we don't do any graceful stops of the scheduler at the moment, only hard stops | 14:33 |
corvus | mordred: so i think we just want the scheduler to stop in the normal way | 14:34 |
fungi | also usr1 seems unlikely to be something we would want to use anyway | 14:34 |
fungi | because it could take hours to finish | 14:34 |
corvus | yeah that | 14:34 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Rework zuul start/stop/restart playbooks for docker https://review.opendev.org/723048 | 14:34 |
mordred | oh - yeah? ok. me just takes it out | 14:34 |
fungi | though maybe once we have distributed scheduler, it's basically instantaneous/hitless? | 14:34 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Rework zuul start/stop/restart playbooks for docker https://review.opendev.org/723048 | 14:34 |
mordred | how's that look? | 14:35 |
fungi | lgtm | 14:35 |
*** mlavalle has joined #opendev | 14:41 | |
fungi | clarkb: cacti indicates we had a hard swap event on lists.o.o (severe enough to cause a 15-minute snmp blackout) around 12:20 | 14:46 |
*** iurygregory has quit IRC | 14:47 | |
*** iurygregory has joined #opendev | 14:48 | |
fungi | oom knocked out 9 python processes between 12:26:48 and 12:33:12 | 14:49 |
fungi | probably earlier in fact, that event seems to have overrun the dmesg ring buffer | 14:50 |
fungi | 11 "Killed process" lines recorded to syslog between 12:27:01 and 12:33:30 | 14:51 |
fungi | i guess the timestamps embedded in the kmesg events are off by a bit | 14:52 |
fungi | oh wow, even dstat was stuttering | 14:54 |
fungi | toward the worst, it was only managing to record roughly one snapshot a minute | 14:55 |
clarkb | fungi: did we see mailman qrunner process memory change upwards in that period? | 14:59 |
openstackgerrit | Merged opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107 | 14:59 |
clarkb | also we should cross check with that robot too maybe? | 14:59 |
fungi | i'm working to understand the fields recorded in the csv | 14:59 |
fungi | looks like the last two fields are process details | 15:00 |
fungi | ahh, no, the final fields are ,"process pid cpu read write","process pid read write cpu","memory process","used","free" | 15:02 |
fungi | i guess those correspond to --top-cpu-adv --top-io-adv --top-mem-adv and so "memory process" is the field we care about there? | 15:02 |
clarkb | Apr 27 12:25:54 is when OOM killer was first invoked looks like | 15:03 |
clarkb | fungi: ya I think memory process is the most important one | 15:03 |
clarkb | the others probably have useful info too like who was busy during the lead up period | 15:03 |
clarkb | fungi: looks like that same bot is active around the OOM | 15:05 |
clarkb | I kinda want to add a robots.txt that tells it to go away and see if we have a behavior change | 15:05 |
fungi | so going into this timeframe, we had 12:20:00 13543 qrunner / 40660992% | 15:05 |
clarkb | fungi: note the % is a bit weird. Its actually just bytes. So thats 40MB ish which isn't bad | 15:06 |
fungi | as of 12:25:32 16053 listinfo / 50327552% | 15:06 |
fungi | and kswapd0 was the most active cpu and i/o consumer | 15:07 |
clarkb | fungi: what that is telling me is we don't have a single process which is loading up on memory. | 15:07 |
clarkb | which makes me more suspicious of apache | 15:07 |
clarkb | fungi: also we seem to be using mpm_worker and not mpm_event in apache | 15:09 |
clarkb | likely a holdover from upgrading that server in place | 15:09 |
clarkb | iirc mpm event is far more efficient memory wise because it doesn't fork for all the things? | 15:09 |
clarkb | maybe we should try switching that over too | 15:09 |
fungi | can't hurt | 15:09 |
fungi | anyway, i'm going to restart all the mailman sites... we talked about wanting a reboot of this server anyway, should i just go ahead and do that? | 15:10 |
fungi | and then set the dstat collection back up (and rotate the old log) | 15:10 |
clarkb | fungi: ya a reboot seems like it would at least help rule out older kernel bugs (if that is a possibility here) | 15:11 |
clarkb | I seem to recall that xenial kernel of some variety didn't handle buffers and caches properly | 15:11 |
clarkb | and then we need to stop apache2, a2dismod mpm_worker, a2enmod mpm_event, start apache? | 15:12 |
clarkb | mordred: maybe we should encode that into unit files then systemctl works and ansible can just ensure a service state? | 15:13 |
clarkb | (I realize that will take a bit more work to get the systemd incantations correct, but our testing should help with that) | 15:14 |
fungi | lists.o.o is currently booted with linux 4.4.0-145-generic with an uptime of 380 days and will be booting linux 4.4.0-177-generic | 15:17 |
fungi | i've checked and apt reports no packages pending upgrade | 15:17 |
fungi | reboot underway | 15:17 |
fungi | taking a while to come back up, probably either a pending host migration or just overdue fsck | 15:19 |
openstackgerrit | Merged opendev/base-jobs master: Define an ubuntu-focal nodeset https://review.opendev.org/723527 | 15:19 |
clarkb | fungi: seems like thats pretty normal for us :/ | 15:20 |
fungi | when you go that long between reboots, yes | 15:20 |
fungi | it came back | 15:21 |
fungi | 41 qrunner processes running according to ps | 15:21 |
clarkb | I see a bunch of mailman processes. I tlooks happy | 15:21 |
fungi | so seems like the sites all started back as expected | 15:21 |
clarkb | fungi: are you wanting to do the apache thing? or should I plan to do that after breakfast? I'm happy either way, justdon't want to step on toes | 15:23 |
fungi | #status log lists.openstack.org rebooted for kernel update | 15:23 |
openstackstatus | fungi: finished logging | 15:23 |
fungi | #status log running `dstat -tcmndrylpg --tcp --top-cpu-adv --top-mem-adv --swap --output dstat-csv.log` in a root screen session on lists.o.o | 15:23 |
openstackstatus | fungi: finished logging | 15:23 |
corvus | mordred, fungi: i think we're ready to remove zuul from emergency and run some playbooks? | 15:23 |
fungi | clarkb: i need to switch gears to do some openstack vmt stuff shortly, but can try to get to it later, or we can just observe first and see if the oom situation persists since the reboot | 15:24 |
openstackgerrit | Merged zuul/zuul-jobs master: hlint: add haskell source code suggestions job https://review.opendev.org/722309 | 15:24 |
corvus | i think so, so i'll do that | 15:25 |
*** ysandeep is now known as ysandeep|away | 15:25 | |
fungi | corvus: i think so723107 merged ~25 minutes ago | 15:25 |
fungi | thanks! | 15:25 |
corvus | running le playbook now | 15:25 |
fungi | oui, c'est bon | 15:27 |
corvus | mordred, fungi, ianw: https://zuul.openstack.org/status lgtm now | 15:30 |
corvus | looks like i don't need to run the zuul service playbook | 15:30 |
fungi | awesome | 15:32 |
clarkb | fungi: ya I' mostly just suspicious of apache right now given the qrunner sizes don't go up when we oom and we have an indexer bot running through apache at around that same period | 15:35 |
fungi | oh, me too. if you look back at the cacti graphs, once it's able to get snmp responses again the 5-minute load average is still >50 | 15:37 |
fungi | so likely lots and lots of processes | 15:37 |
clarkb | fungi: did my process above look correct to you for using mpm event? I've also double checked toehr other xenial hosts are using apache + mpm_event and not worker | 15:38 |
fungi | which could be the mta or mailman handling a bunch of messages, but probably it's apache forking | 15:38 |
*** _mlavalle_1 has joined #opendev | 15:38 | |
fungi | clarkb: yeah, i guess the current puppet-mailman isn't picking an mpm for apache and like you say we've inherited a non-default one due to in-place upgrades | 15:39 |
fungi | your command sequence looks right to me | 15:40 |
clarkb | ya I think we've basically just relied on platform defaults. Unfortauntely platform default has stuck around too long in this case | 15:40 |
*** mlavalle has quit IRC | 15:40 | |
clarkb | infra-root ^ I'd like to switch apache2 from mpm_worker to mpm_event on lists.o.o. Plan is stop apache2; a2dismod mpm_worker; a2enmod mpm_event ; start apache2. This gets it in line with our other apache servers. I'll do that shortly after some tea. Let me know if you'd like me to hold off | 15:41 |
corvus | clarkb: ++ | 15:42 |
fungi | thanks clarkb! | 15:42 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json https://review.opendev.org/723524 | 15:46 |
clarkb | alright tea is consumed proceeding now | 15:49 |
clarkb | and done. website seems to respodn to my browser | 15:50 |
*** ykarel is now known as ykarel|away | 15:50 | |
clarkb | #status log Updated lists.openstack.org to use Apache mpm_event insteand of mpm_worker. mpm_worker was a holdover from doing in place upgrades of this server. All other Xenial hosts default to mpm_event. | 15:51 |
openstackstatus | clarkb: finished logging | 15:51 |
clarkb | fungi: then assuming the OOM perists tomorrow maybe we try a robots.txt and exclude this particular bot? | 15:52 |
*** sshnaidm is now known as sshnaidm|afk | 15:53 | |
AJaeger | do we still use git0x.openstack.org? https://review.opendev.org/723251 proposes to kill the only place that I could find... | 15:56 |
clarkb | AJaeger: we do not | 15:56 |
clarkb | AJaeger: do you also want to remove git.opensatck.org from that list? | 15:57 |
clarkb | its the line above the block you removed | 15:57 |
AJaeger | clarkb: sure, can do... | 15:57 |
AJaeger | I thought that was in use, so was not sure whther we need it... | 15:57 |
clarkb | AJaeger: it exists as a redirect host on static.opendev.org, but I don't think we need cacti data for it | 15:58 |
clarkb | (since it is just a CNAME to static in dns) | 15:58 |
AJaeger | Ah, good | 15:58 |
openstackgerrit | Andreas Jaeger proposed opendev/system-config master: Remove git*.openstack.org https://review.opendev.org/723251 | 15:59 |
AJaeger | clarkb: updated ^ | 15:59 |
*** rpittau is now known as rpittau|afk | 16:07 | |
redrobot | Would love another set of eyes on this change: https://review.opendev.org/#/c/721349/ | 16:08 |
clarkb | corvus: mordred ^ are we good to add new git repos or do zuul things still need updating? | 16:09 |
corvus | clarkb: i think we're good | 16:09 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json https://review.opendev.org/723524 | 16:13 |
mordred | clarkb: yeah- I thnik we're good | 16:13 |
clarkb | corvus: mordred did you see the thing from ianw yesterday about periodic deploy and on demand deploy undoing each other? | 16:15 |
fungi | example was in that etherpad i linked earlier | 16:15 |
clarkb | https://etherpad.opendev.org/p/DSxEB-ViHzEHDMgxAJDp appears to be where notes were taken | 16:15 |
fungi | https://etherpad.opendev.org/p/DSxEB-ViHzEHDMgxAJDp | 16:15 |
corvus | did it perhaps enqueue the item before 1:23, thereby enqueuing the old ref? | 16:17 |
openstackgerrit | Merged opendev/system-config master: Rework zuul start/stop/restart playbooks for docker https://review.opendev.org/723048 | 16:17 |
*** iurygregory has quit IRC | 16:17 | |
corvus | (rather, not the old ref, but the state of the repo at that point in time) | 16:17 |
clarkb | looks like hourly starts at the top of the hour with a 2 minute hitter | 16:18 |
mordred | clarkb: yeah - I was thinking it might be what corvus said | 16:18 |
clarkb | so ya it would've enqueued before the non hourly job ran at least. Not sure when the non hourly job enqueued though | 16:18 |
corvus | clarkb: well the non-hourly job would have gotten that specific change | 16:18 |
clarkb | should the hourly jobs maybe not use the zuul provided ref? and always update? | 16:19 |
corvus | so basically, we enqueued an hourly job and froze the repo state for it, then probably due to load didn't get around to running it for a while | 16:19 |
mordred | maybe - I believe the intent of the hourly jobs is "run with the tip of master when you run" as opposed to "run with the tip of master when you are encoded" - so maybe putting in a flag we can set on the hourly pipeline invocation of the job that would cause the playbooks to do a pull from opendev first? | 16:20 |
mordred | (the maybe there is in response to clarkb's "should the hourly jobs...") | 16:20 |
openstackgerrit | Merged openstack/project-config master: Revert "Disable ovn pypi jobs temporarily" https://review.opendev.org/723073 | 16:21 |
fungi | or should there be a way to tell zuul you want timer triggered jobs to have their heads resolved when started rather than when enqueued? that may be tough to pull off though | 16:22 |
*** hrw has quit IRC | 16:22 | |
corvus | yeah, that's an intentional design decision to ensure that all jobs in a buildset run with the same repo state | 16:22 |
*** elod has quit IRC | 16:22 | |
*** hrw has joined #opendev | 16:22 | |
mordred | in a magical world it would be neat to be able to have a periodic pipeline that only triggers if there has been no corresponding activity in a different pipeline for X duration. I have no idea what that would look like, and would probably require v4 and required db | 16:23 |
corvus | so i don't think changing zuul is appropriate here | 16:23 |
*** elod has joined #opendev | 16:23 | |
fungi | i just get a little twitchy with jobs working around zuul's git handling, but maybe this is one of those circumstances where it's the better solution | 16:24 |
mordred | corvus: what do you think about a "pull latest from opendev" flag for the run-base playbook? | 16:24 |
corvus | maybe a pull in the job is the best workaround here -- other than minimizing what we actually need the hourly pipeline for | 16:24 |
corvus | (eventually, it should just be for letsencrypt, right?) | 16:24 |
mordred | corvus: I think we mostly have hourly pipeline for things that are using images but that we don't have a way to trigger otherwise | 16:24 |
mordred | so that we don't have to wait for a day to pick up a new zuul image or similar | 16:24 |
mordred | but I agree with the goal - it woudl be great to have only LE in there | 16:25 |
mordred | I can work up a "pull from upstream" flag if we think that's an ok workaround for now | 16:25 |
corvus | sounds reasonable to me | 16:25 |
fungi | yeah, it seems like the most straightforward solution at this point | 16:26 |
*** tobiash has quit IRC | 16:26 | |
*** prometheanfire has quit IRC | 16:26 | |
*** calcmandan has quit IRC | 16:26 | |
*** noonedeadpunk has quit IRC | 16:26 | |
*** jkt has quit IRC | 16:26 | |
*** dirk has quit IRC | 16:26 | |
*** AJaeger has quit IRC | 16:26 | |
mordred | kk | 16:26 |
clarkb | wfm | 16:27 |
*** tobiash has joined #opendev | 16:32 | |
*** prometheanfire has joined #opendev | 16:32 | |
*** calcmandan has joined #opendev | 16:32 | |
*** noonedeadpunk has joined #opendev | 16:32 | |
*** jkt has joined #opendev | 16:32 | |
*** dirk has joined #opendev | 16:32 | |
*** AJaeger has joined #opendev | 16:32 | |
fungi | yoctozepto: fdegir: did your git problems with opendev.org repos persist into today or did they mysteriously clear up? | 16:36 |
yoctozepto | fungi: I didn't do much today regarding opendev.org clone/pull operations so hard to tell; assume they did ;-) | 16:37 |
*** ChanServ has quit IRC | 16:42 | |
fungi | yoctozepto: thanks, hopefully it was just some temporary network problem somewhere out on the internet | 16:42 |
*** ChanServ has joined #opendev | 16:45 | |
*** tepper.freenode.net sets mode: +o ChanServ | 16:45 | |
*** _mlavalle_1 has quit IRC | 17:09 | |
*** mlavalle has joined #opendev | 17:11 | |
openstackgerrit | Merged openstack/project-config master: Define stable cores for horizon plugins in neutron stadium https://review.opendev.org/722682 | 17:16 |
openstackgerrit | Merged openstack/project-config master: Add Portieris Armada app to StarlingX https://review.opendev.org/721343 | 17:16 |
*** tobiash has quit IRC | 17:26 | |
*** prometheanfire has quit IRC | 17:26 | |
*** calcmandan has quit IRC | 17:26 | |
*** noonedeadpunk has quit IRC | 17:26 | |
*** jkt has quit IRC | 17:26 | |
*** dirk has quit IRC | 17:26 | |
*** AJaeger has quit IRC | 17:26 | |
*** tobiash has joined #opendev | 17:29 | |
*** prometheanfire has joined #opendev | 17:29 | |
*** calcmandan has joined #opendev | 17:29 | |
*** noonedeadpunk has joined #opendev | 17:29 | |
*** jkt has joined #opendev | 17:29 | |
*** dirk has joined #opendev | 17:29 | |
*** AJaeger has joined #opendev | 17:29 | |
openstackgerrit | Merged openstack/project-config master: Add ansible role for managing Luna SA HSM https://review.opendev.org/721349 | 17:29 |
fdegir | fungi: i noticed similar issues today as well so I had to switch to mirrors | 17:38 |
*** ChanServ has quit IRC | 17:39 | |
fungi | fdegir: and this was cloning over https via ipv4? | 17:40 |
*** ChanServ has joined #opendev | 17:41 | |
*** tepper.freenode.net sets mode: +o ChanServ | 17:41 | |
fdegir | fungi: yes and i just started another set of clones manually right now and it's hanging - will probably timeout | 17:42 |
fdegir | Cloning into 'shade'... | 17:42 |
fdegir | and just waits | 17:42 |
fungi | i'll switch to trying shade in that case. and see about forcing my testing to go on ipv4 instead of ipv6 | 17:43 |
fdegir | fungi: as i noted yesterday, it could be another repo next time | 17:44 |
fdegir | fatal: unable to access 'https://opendev.org/openstack/shade/': Failed to connect to opendev.org port 443: Connection timed out | 17:44 |
fungi | looks like my git client is new enough to support `git clone --ipv4 ...` | 17:44 |
fdegir | fungi: testing the repos bifrost clones during its installation: https://opendev.org/openstack/bifrost/raw/branch/master/playbooks/roles/bifrost-prep-for-install/defaults/main.yml | 17:44 |
fdegir | *_git_url | 17:45 |
fdegir | now requirements hanging | 17:45 |
fungi | i've got a loop going on my workstation now like `while git clone --ipv4 https://opendev.org/openstack/shade;do rm -rf shade;done` | 17:45 |
fdegir | fungi: if it helps, i can keep this thing running and you can look at logs | 17:46 |
fdegir | i can pass my public ip if it helps | 17:46 |
fungi | fdegir: yes, i can check our load balancer for any hits from your ip address, though if a connection failed to reach the load balancer that will be hard to spot | 17:46 |
fungi | ideally devstack is timestamping when it tries to clone | 17:47 |
fdegir | fungi: we don't use devstack | 17:47 |
fungi | ahh, okay, the other problem report was from a devstack user | 17:48 |
fdegir | fungi: yes - seeing that made me realize it may be an issue on gerrit side | 17:48 |
fdegir | originally i thought i had issues but that bug report made me report as well | 17:48 |
fungi | if you have a timestamp for when one of the failed clone commands was attempted i can hopefully work out whether any connections arrived at the load balancer from you at that time | 17:48 |
fungi | i have exact times for every request which reached the lb from you and what backend they were directed to | 17:49 |
fdegir | fatal: unable to access 'https://opendev.org/openstack/requirements/': Operation timed out after 300029 milliseconds with 0 out of 0 bytes received | 17:49 |
fungi | but obviously if a connection attempt doesn't reach us that won't be logged at our end | 17:49 |
fdegir | i don't have timestamps as we didn't enable timestamping on our jenkins | 17:50 |
fdegir | and can't check the slaves since we use openstack single use slave | 17:51 |
fdegir | Cloning into 'python-ironicclient'... | 17:52 |
fdegir | so it is totally random | 17:52 |
fungi | these are the connections haproxy saw from your ip address: http://paste.openstack.org/show/792771 | 17:52 |
fdegir | need to have dinner now | 17:53 |
fdegir | will be back later tonight | 17:53 |
fungi | cool, thanks! | 17:53 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Do not set buildset_fact if it's not present in results.json https://review.opendev.org/723524 | 17:53 |
fungi | looks like those are all being directed to gitea04.opendev.org, though i may have trouble mapping them to specific requests as haproxy just operating as an osi layer 4 socket proxy isn't doing any transparent forwarding | 17:54 |
fungi | so far my shade clone loop isn't encountering any issues, but i'll switch to hitting gitea04 directly to see if that has anything to do with it | 17:55 |
clarkb | fungi: note that you will only hit a single backend using that url | 17:55 |
fungi | right | 17:56 |
fungi | now trying this: while git clone --ipv4 https://gitea04.opendev.org:3000/openstack/shade;do rm -rf shade;done | 17:56 |
clarkb | and ya mapping onto gitea side requests can be a pain | 17:57 |
*** factor has quit IRC | 17:58 | |
clarkb | fungi: gitea04 shows things like [E] Fail to serve RPC(upload-pack): exit status 128 - fatal: the remote end hung up unexpectedly | 17:58 |
fungi | fdegir: when you're back from dinner, it might be helpful if you could try with a simple reproducer like that from the network where you're seeing that, and then maybe also try from another location if you can, so we can tell whether it's specific to the location you're coming from. if it is, then we can compare traceroutes in both directions and possible start to work on correlating where the problem might | 17:59 |
fungi | be | 17:59 |
dpawlik | hi. If we would like to switch in openstack/validations-common from testr to stestr, (https://review.opendev.org/#/c/723529/) requirements-check CI job is raising error that stestr not found in lower-constraints. Is something else that I need to configure or just add stestr==3.0.1 to lower-constraints.txt ? | 17:59 |
fungi | clarkb: yeah, that could be due to a number of reasons, sounds like typical premature socket termination | 17:59 |
fungi | dpawlik: i'm not sure, you may be better off asking in #openstack-requirements as it's probably more on topic there | 18:00 |
clarkb | fungi: fdegir the other thing that might be useful is talking to a backend (or all 8 backends) directly | 18:00 |
dpawlik | thank you fungi | 18:00 |
clarkb | they all have valid tls certs and are exposed publicly; | 18:00 |
fungi | clarkb: yeah, that's what i'm suggesting, test in a loop to one backend directly so we can rule out the source hash directing some clients to good backends and others to bad | 18:01 |
fungi | the while shell loop i pasted just above is exactly that | 18:01 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66700&rra_id=all shows spikes in failed tcp connection attempts which may be related | 18:02 |
clarkb | asymettric routing could cause that | 18:02 |
fungi | fwiw, i've been running that continuously for nearly 10 minutes, and it's taking roughly 10 seconds each iteration, so far no errors | 18:03 |
clarkb | fungi: the other upside to directly connecting to the backends is we'll be able to filter logs for that more easily | 18:03 |
fungi | exactly | 18:03 |
*** factor has joined #opendev | 18:04 | |
fungi | also the ip address fdegir provided me looks like it's in the citynetwork.se kna3 pop | 18:06 |
*** mehakmittal has joined #opendev | 18:06 | |
fungi | or at least that's the subdomain in reverse dns on the last named core router in my traceroutes to it, but there are a couple hops after that with no ptr records on their serial interfaces | 18:07 |
clarkb | `while true; do for X in `seq 1 8` ; do echo $X ; rm -rf shade-gitea0$X && git clone https://gitea0$X.opendev.org:3000/openstack/shade shade-gitea0$X ; done ; done` I'm running that now just to see if I can trip it. note you should set -x it | 18:07 |
clarkb | er set -e | 18:07 |
clarkb | fungi: we have a mirror in that location we could try to reprodcue from there | 18:08 |
fungi | my thoughts as well | 18:08 |
fungi | note that fdegir was seeing it over ipv4, so you might want to add --ipv4 to your git clone command to reproduce faithfully | 18:08 |
clarkb | fungi: I've only got ipv4 locally | 18:09 |
fungi | oh, :( for you | 18:09 |
clarkb | if I want ipv6 I haev to explicitly bounce through my ipv6 cloud node | 18:09 |
fungi | my backwater cable provider has finally managed to do a decent job with v6 prefix delegation over dhcp6 | 18:10 |
clarkb | fungi: my ISP just got bought (effective may 1st iirc). The new company is soliciting questions about the move and I asked if they planned to roll out ipv6. THe answer I got did not give me confidence they even know what ipv6 is | 18:10 |
clarkb | no errors yet in that for loop. I'm going to knead some bread now | 18:11 |
fungi | have fun! | 18:11 |
fungi | looks like our mirror server is in kna1 not kna3, but maybe they share the same core | 18:12 |
fungi | testing from mirror01.kna1.citycloud.openstack.org | 18:14 |
fungi | interesting, that server routes outbound from an rfc-1918 address, presumably through a fip | 18:15 |
fungi | i've got a clone loop of shade from gitea04 underway on it now, seeing around 4s for each clone to complete | 18:16 |
fungi | ooh! i've hit it!!! | 18:17 |
fungi | this definitely seems to be client location specific | 18:17 |
fungi | i'll reproduce again with a brief sleep between attempts and some timestamping | 18:18 |
*** mehakmittal has quit IRC | 18:20 | |
*** mehakmittal has joined #opendev | 18:21 | |
fungi | i've got this running in a root screen session on mirror01.kna1.citycloud.openstack.org now: while :;do sleep 10;echo -n 'start ';date -Is;git clone https://gitea04.opendev.org:3000/openstack/shade;echo -n 'end ';date -Is;rm -rf shade;done | 18:22 |
fungi | the spacing should make it easy to find in gitea's web log | 18:22 |
*** muskan has joined #opendev | 18:22 | |
fungi | i also confirmed the timestamps on that server seem to be accurate | 18:23 |
fungi | an attempt to clone just now at 2020-04-27T18:23:35 seems to be hanging | 18:23 |
fungi | yeah, still hanging, this is good! | 18:24 |
fungi | doing `docker-compose logs|grep 91.123.202.253` as root on gitea04 now with pwd /etc/gitea-docker | 18:27 |
fungi | hopefully that's the correct thing | 18:27 |
fungi | clone started at 18:25:53 is still hanging | 18:27 |
clarkb | neat | 18:28 |
clarkb | fungi: do you see it show up on the gitea side? | 18:28 |
fungi | also started a ping from citycloud to opendev.org to see if there's any obvious packet loss | 18:28 |
clarkb | my clone loop from home is still running successfully | 18:28 |
fungi | gitea-web_1 | [Macaron] 2020-04-27 18:23:20: Started POST /openstack/shade/git-upload-pack for 91.123.202.253 | 18:28 |
fungi | that's the last recorded entry for 91.123.202.253 | 18:29 |
fungi | i wonder if the timestamps from gitea are accurate | 18:29 |
clarkb | ok so we get far enough to start the upload-pack but then packages maybe disappear? we can tcpdump those to see what is going on at lower level maybe? | 18:29 |
clarkb | fungi: it records the start and end timestamps | 18:29 |
clarkb | as separate entries | 18:29 |
fungi | there was a clone from that address which started at 2020-04-27T18:23:35 and ended at 2020-04-27T18:25:43 | 18:29 |
fungi | and was successful | 18:30 |
clarkb | fungi: also maybe have mirror.kna1 fetch resources from mirror.sjc1? | 18:30 |
clarkb | fungi: and see if we can get it to fail doing more basic http requests | 18:30 |
fungi | no, my bad, that one timed out | 18:30 |
fungi | last successful clone started 2020-04-27T18:23:18 and ended at 2020-04-27T18:23:25 | 18:30 |
fungi | so i think the connection is never established | 18:31 |
fungi | i'll switch to tcpdump next | 18:31 |
yoctozepto | fungi: actually kolla CI had issues with opendev: " \"msg\": \"Failed to download remote objects and refs: fatal: unable to access 'https://opendev.org/openstack/ironic-python-agent-builder/': Failed to connect to opendev.org port 443: Connection timed out\\n\"", | 18:32 |
yoctozepto | Mon Apr 27 12:45:20 2020 | 18:32 |
clarkb | yoctozepto: its likely the same issue if its some transatlantic routing problem (or similar) | 18:32 |
fungi | yoctozepto: that (connection timed out) sounds like what we're seeing thenm | 18:32 |
clarkb | but also please don't talk to gitea in zuul jobs | 18:32 |
clarkb | zuul should provide everything you need | 18:32 |
fungi | the launchpad bug opened for devstack yesterday indicated a "connection refused" error | 18:32 |
fungi | yoctozepto: but since the job did connect to opendev, can you let us know where that failure ran? | 18:33 |
yoctozepto | e90b15791a067a4e6e54-7143c90e898b1b306bc3770ac4d2d8a8.ssl.cf2.rackcdn.com | 18:33 |
yoctozepto | oops | 18:33 |
yoctozepto | https://zuul.opendev.org/t/openstack/build/c0c89c350cb242d7abed88e80de32984 | 18:33 |
clarkb | that job ran in kna1 too | 18:34 |
clarkb | so ya likely the same issue | 18:34 |
*** iurygregory has joined #opendev | 18:34 | |
fungi | provider: airship-kna1 | 18:34 |
fungi | yup | 18:34 |
fungi | starting to suspect this may be a citynetwork issue | 18:35 |
clarkb | I'm goign to stop my local clones now that we haev narrowed this down with an ability to debug | 18:35 |
clarkb | my local clones did not have any problem | 18:35 |
clarkb | fungi: are you testing kna to all gitea backends or just 04? | 18:35 |
fungi | clarkb: just gitea04 | 18:36 |
clarkb | fungi: might be worth checking if it is all 8 (if its a bitmask problem or something like that then some may work while others dont) | 18:36 |
fungi | yeah, have definitely seen that in the past when you have flow-based distribution routing hashed on addresses and one of your cores is blackholing stuff | 18:37 |
fungi | okay, i have tcpdump running in a root screen session on gitea04 streaming to stdout and filtering for the kna1 mirror's ip address | 18:38 |
yoctozepto | clarkb: ack, it's actually bifrost that talked to it and we have little control over it (it is to be deprecated and replaced by a kolla-containerised solution when time allows - hopefully soon) | 18:38 |
fungi | assuming this is reproducible, mnaser may want to get in touch with the network folks at citynet | 18:39 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: tox: Use 'block: ... always: ...' instead of ignore_errors https://review.opendev.org/723640 | 18:39 |
fungi | they'll probably have a faster time of working out the connectivity issues | 18:39 |
fungi | tcpdump is definitely capturing packets on successful clone runs | 18:40 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: ensure-sphinx: use failed_when: false instead of ignore_errors: true https://review.opendev.org/723642 | 18:41 |
fungi | as soon as i snag another hung clone i'll be able to work out whether the tcp/syn ever arrived. if it does and a response is generated, i'll probably need to start up a similar tcpdump on the mirror server to see if the syn+ack ever arrives | 18:41 |
fungi | okay, caught one | 18:42 |
fungi | start 2020-04-27T18:41:46+00:00 | 18:42 |
fungi | and last packet to arrive at gitea04 was 18:42:01.640261 IP 38.108.68.147.3000 > 91.123.202.253.39696 (end of the previous completed clone) | 18:43 |
clarkb | fungi: so not even getting the SYN | 18:44 |
fungi | that's how it's looking to me | 18:44 |
fungi | my 1k echo slow ping is just about to wrap up and i can get some icmp delivery stats | 18:44 |
fungi | 1000 packets transmitted, 1000 received, 0% packet loss, time 999929ms, rtt min/avg/max/mdev = 176.894/177.360/267.018/3.398 ms | 18:45 |
fungi | so icmp doesn't seem impacted | 18:45 |
fungi | i could probably install hping or something to do syn/syn+ack pings but may be best if we just hand this off to mnaser and whoever we usually talk to at citycloud | 18:46 |
*** dpawlik has quit IRC | 18:47 | |
fungi | though first i guess we can try some connections to other places from citycloud if we want | 18:47 |
clarkb | fungi: ya I think so. Especially since its the initial SYN disappearing | 18:47 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fo: Use 'block: ... always: ...' and failed_whne instead of ignore_errors https://review.opendev.org/723643 | 18:47 |
fungi | maybe best to do an easier reproducer with nc or something | 18:47 |
clarkb | or even just ping? | 18:47 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: go: Use 'block: ... always: ...' and failed_when instead of ignore_errors https://review.opendev.org/723643 | 18:47 |
mnaser | let me try and ping people.. | 18:48 |
clarkb | fungi: fwiw tobias is usually who I email | 18:48 |
fungi | thanks mnaser! i know you know some folks there | 18:48 |
clarkb | and mnaser is always around :) | 18:48 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: ara-report: use failed_when: false instead of ignore_errors: true https://review.opendev.org/723644 | 18:49 |
fungi | yeah, i don't usually see tobberydberg around in irc | 18:50 |
fungi | oh, he's actually in #openstack-infra at the moment | 18:51 |
fungi | but anyway, sounds like maybe mnaser has this well in hand | 18:51 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: k8-logs: use failed_when: instead of ignore_errors: https://review.opendev.org/723647 | 18:51 |
mnaser | anything in specific i can forward? | 18:51 |
mnaser | it looks like hitting opendev.org is timing out? | 18:51 |
fungi | mnaser: we're getting reports from users of citycloud (including ourselves) that a small percentage of tcp connections from kna to servers we have in your sjc location have their initial tcp/syn packet never make it | 18:53 |
fungi | the result is "connection timed out" for some tcp sockets | 18:53 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: container-logs: use failed_when: instead of ignore_errors: https://review.opendev.org/723648 | 18:54 |
fungi | generally manifesting so far in `git clone` connections for the opendev.org gitea load balancer (though we've reproduced it with direct connections to the backend as well) | 18:54 |
fungi | an example is 91.123.202.253 in citycloud (a fip for 10.0.1.9) stalls attempting to establish a socket to 38.108.68.147 3000/tcp | 18:55 |
fungi | most connections attempts are fine, but sometimes the initial tcp/syn packet from 91.123.202.253 never makes it to 38.108.68.147 according to tcpdump listening on the destination | 18:56 |
mnaser | fungi: wonderful, thank you, i handed that over | 18:56 |
fungi | thanks! | 18:56 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: tox: Use 'block: ... always: ...' instead of ignore_errors https://review.opendev.org/723640 | 18:57 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: ensure-sphinx: use failed_when: false instead of ignore_errors: true https://review.opendev.org/723642 | 18:57 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: go: Use 'block: ... always: ...' and failed_when instead of ignore_errors https://review.opendev.org/723643 | 18:57 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: ara-report: use failed_when: false instead of ignore_errors: true https://review.opendev.org/723644 | 18:57 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: fetch-subunit-output: use failed_when: instead of ignore_errors: https://review.opendev.org/723653 | 18:57 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: add-build-sshkey: use failed_when: instead of ignore_errors: https://review.opendev.org/723654 | 18:57 |
mnaser | i dont have an ack from them but its 9pm-ish so | 18:57 |
fungi | yep, it's likely not urgent | 18:57 |
fungi | fdegir: ^ to summarize, we think there's something going on between citycloud kna and vexxhost sjc, likely close to (or maybe even inside) the citycloud side of the connection | 18:58 |
fungi | mainly because i've so far been unable to reproduce from elsewhere, though i'll try to test from citycluod lon and ovh gra just to get some more transatlantic datapoints | 19:00 |
fdegir | thanks for the debugging fungi | 19:04 |
fdegir | i was thinking london but instead try frankfurt and stockholm regions | 19:05 |
fdegir | plus the us one perhaps | 19:05 |
fungi | we conveniently already have servers in kna and lon which is why i tested those | 19:05 |
fdegir | ok | 19:06 |
fungi | so far i'm not able to reproduce from citycloud lon nor from ovh gra | 19:06 |
fungi | so i have doubts it's a general transatlantic issue | 19:06 |
fungi | were your connections coming from directly-addressed servers, or through a (layer 3 or 4) nat? | 19:07 |
fungi | all our systems in citycloud are behind fips, so that could be a common factor too | 19:07 |
fdegir | same as your systems | 19:07 |
fdegir | we are running in kna as well | 19:07 |
fungi | yeah, so *could* just be their nat layer is overrun in that pop | 19:07 |
fungi | and some new flows are getting dropped | 19:08 |
fungi | fdegir: if you have quota you can shift to one of their other pops, that might be a workaround for you | 19:09 |
openstackgerrit | Merged zuul/zuul-jobs master: fetch-sphinx-tarball: use remote_src true https://review.opendev.org/721237 | 19:09 |
fdegir | fungi: i think we do and can try moving to london | 19:10 |
fungi | fdegir: if that solve it for you, that'll also be a useful datapoint for us | 19:10 |
fdegir | fungi: this was really helpful as i was puzzled and searching opendev/openstack-infra maillists to see if there was a planned maintenance | 19:10 |
fungi | i'm putting together bidirectional traceroutes now to see if they're symmertical | 19:11 |
fdegir | fungi: will let you know when i do that but it may not happen tomorrow | 19:11 |
clarkb | fungi: the nat on our mjrros is 1:1 | 19:12 |
fungi | clarkb: yep, but it may very well be the same systems doing the binat and the overload pat | 19:13 |
clarkb | but I suppose if global tables arefull that wont help much | 19:13 |
*** muskan has quit IRC | 19:14 | |
fungi | both vexxhost and citynetwork seem to be peering with cogent and preferring them, though from kna3 the traceroute seems to go through citynetwork sto2/cogent sto03 peering, while on the way back from vexxhost packets are arriving at the cogent lon01/citynetwork lon1 peering and then traverse sto2 to kna3 | 19:17 |
fungi | so basically symmetrical on the vexxhost end but somewhat asymmetric on the citynetwork end | 19:19 |
openstackgerrit | Merged zuul/zuul-jobs master: fetch-sphinx-tarball: Do not keep owner of archived files https://review.opendev.org/721248 | 19:20 |
fungi | testing with our mirror in lon1, routing is (unsurprisingly) fully symmetrical at least to the pop level | 19:20 |
fungi | so this suggests the problem is likely in citynetwork kna3 or sto2 | 19:20 |
fungi | or possibly cogent sto03 | 19:21 |
fungi | given i can't reproduce the issue from lon, which is following basically the same routes through cogent's core | 19:22 |
*** mehakmittal has quit IRC | 19:22 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Set owner to executor user https://review.opendev.org/701381 | 19:28 |
mnaser | fungi: seems like they were having some similar issues -- "As for traffic to opendev.org regardless of which transit provider I push traffic through there is packetloss far out on the net. So they are most likely having issues (or their transit(s))" | 19:40 |
mnaser | we use cogent at sjc1 so i guess that theory may add up | 19:40 |
mnaser | there is other transit but, yeah. | 19:41 |
mnaser | only reported issue is "Welcome to the Cogent Communications status page. Some customers may be seeing latency between Singapore and Hong Kong due to a submarine fiber issue. At this time there is no ETR. The ticket for this issue is HD11103479." | 19:42 |
mnaser | fungi: can we have mtr running from kna1, that info may be useful to reach out to transit | 19:43 |
clarkb | mnaser: do you want the pings too or is a simple traceroute sufficient? | 19:43 |
mnaser | clarkb: traceroutes is usually what makes transit providers happy | 19:44 |
clarkb | well thats a first, traceroute isn't installed but mtr is | 19:46 |
mnaser | heh | 19:46 |
clarkb | mnaser: http://paste.openstack.org/show/792777/ | 19:49 |
clarkb | that 486 rtt to first router is rough | 19:51 |
openstackgerrit | Merged zuul/zuul-jobs master: tox: allow running default envlist in tox https://review.opendev.org/721796 | 19:56 |
openstackgerrit | Merged opendev/gerritlib master: Use ensure-* roles https://review.opendev.org/719404 | 20:02 |
fungi | clarkb: an mtr in the other direction would probably also be good | 20:08 |
fungi | sometimes when you see a jump like that, it's the point of convergence for an assymetric route where the return path is going through a significant latency increase somewhere else many hops out | 20:09 |
clarkb | fungi: k let me install mtr on opendev lb | 20:09 |
clarkb | it doesn't have traceroute either | 20:09 |
fungi | though also since the hops after that one are lower latency, it could just be that router is under load and deprioritizing icmp messages | 20:10 |
fungi | not at all uncommon | 20:10 |
fungi | especially since it looks like it's probably their datacenter distribution layer | 20:11 |
clarkb | fungi: mnaser http://paste.openstack.org/show/792779/ the other direction | 20:16 |
clarkb | I used traceroute there because I had to install either it or mtr and mtr has a million deps | 20:16 |
corvus | clarkb: are we running an apache on zuul01 now? | 20:34 |
clarkb | corvus: system-config/playbooks/roles/zuul-web seems to imply we are but I haven't double checked yet | 20:36 |
clarkb | yup seems that we are | 20:36 |
clarkb | I'm thinking maybe we want to compress the javascript html and css resources | 20:36 |
corvus | clarkb: it looks like apache is still configured to serve out of /opt/zuul-web-content | 20:37 |
corvus | which is making me wonder if we're positive anything has changed? | 20:38 |
clarkb | corvus: the main reason I thought we had changed was the headers for main.js in my brwoser came from cherrypy | 20:40 |
clarkb | it sets the server: header | 20:40 |
clarkb | we rewrite /.* to localhost:9000/.* | 20:41 |
clarkb | also it doesn't seem like the deflation of status.json is actually working. If I request it with accept-encoding: deflate set I get back plain text | 20:41 |
clarkb | this might need a bit more in depth debugging | 20:42 |
corvus | hrm, the timestamps on the apache config files are old though | 20:42 |
corvus | clarkb: i don't see "/.* to localhost:9000/.*" | 20:43 |
clarkb | corvus: thats in the zuul role | 20:43 |
clarkb | corvus: 000-default.conf seems to be where we write that too | 20:43 |
clarkb | and since it comes before the other files it wins? I think we should maybe remove the old files if they are no longer expected to be valid (to reduce confusion) | 20:44 |
corvus | oooooh | 20:44 |
corvus | yes those are brand new | 20:44 |
corvus | this is a very confusing situation | 20:44 |
clarkb | I agree | 20:44 |
corvus | diff 40-zuul.opendev.org.conf 000-default.conf | 20:45 |
corvus | that seems to suggest we have indeed lost some features | 20:45 |
clarkb | corvus: if I'm reading it correctly I think a big change is going to cherrypy for all requests | 21:04 |
clarkb | which I think is desireable, we wanted to stop consuming the js tarball, but maybe we need to figure out how to make that more efficient (better js compiles, compression, etc) | 21:04 |
clarkb | corvus: I think the /api/status caching is all wrong now that zuul's api has been redone too? | 21:05 |
corvus | clarkb: yeah, i don't think anything has to be different than before; apache as a reverse proxy should be able to cache the data, it should be served by cherrypy with correct headers | 21:06 |
corvus | so i guess we need to identify what we think is different or should be improved and see if we can improve the apache config to make that happen | 21:06 |
clarkb | corvus: for caching I think its just the path | 21:09 |
clarkb | its /api/tenant/.*/status now iirc | 21:09 |
*** jrichard has joined #opendev | 21:11 | |
clarkb | testing status retrieval in my browser it is coming back as gzip according to headers | 21:13 |
clarkb | so the DEFLATE may be working with gzip and not deflate | 21:13 |
clarkb | aha thats normal because apache | 21:14 |
clarkb | that gives me an idea for an improvement here one moment please | 21:14 |
clarkb | corvus: it doesn't look like cherrypy is setting content-type on static files it is serving | 21:16 |
clarkb | corvus: but if it were we could do something like: AddOutputFilterByType DEFLATE application/json text/css text/javascript application/javascript | 21:17 |
clarkb | I'll go ahead and push ^ up as well as caching improvements then if cherrypy starts doing that we'll be ready for it | 21:17 |
jrichard | My change ( https://review.opendev.org/#/c/721343/ ) went in today to create the starlingx/portieris-armada-app repo, but I don't see it under https://zuul.opendev.org/t/openstack/projects . Do I need to do something else to add the project there? | 21:17 |
clarkb | jrichard: no, we've been having some issues with config management that we thought were addressed but that indicates it probably isn't yet | 21:18 |
clarkb | jrichard: is the project in gerrit? | 21:19 |
clarkb | yes looks like gitea and gerrit are happy so its just the zuul config reload that isn't firing properly | 21:19 |
corvus | looks like it ran manage-projects and puppet-else but not zuul | 21:19 |
clarkb | mordred: corvus ^ fyi I know you were looking at that | 21:19 |
corvus | i don't think i was looking at that but i can | 21:20 |
jrichard | I do see it in gerrit. Is there anything I can do now to get it added there? | 21:22 |
corvus | clarkb, mordred: a cursory look makes me think that project-config is just configured to run remote-puppet-else and hasn't been updated to run service-zuul | 21:22 |
clarkb | corvus: I was assuming it was related to the sighup thing but I guess you think its earlier in the stack (not firing the job at all?) | 21:23 |
corvus | clarkb: yeah, sighup should be fixed; i'll see about making a change to the job config | 21:26 |
corvus | all of the job descriptions say "Run the playbook for the docker registry." | 21:27 |
corvus | i feel like those could be more correct | 21:27 |
corvus | clarkb: i'm really looking forward to your reorg patch | 21:28 |
clarkb | corvus: ya I'll need to resurrect that once the dust has settled on zuul and nodepool and codesearch and eavesdrop | 21:29 |
clarkb | I think nodepool is the last remaining set of services? | 21:29 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Improve zuul-web apache config https://review.opendev.org/723711 | 21:29 |
clarkb | thats the first bit in making performance better I think | 21:29 |
redrobot | Hmm... I don't think Zuul is picking up this patch to a new repo? https://review.opendev.org/#/c/723692/ Maybe I missed something? 🤔 | 21:30 |
redrobot | I had to add Zuul to reviewers manually | 21:30 |
redrobot | but I don' think that helped, hehe | 21:30 |
clarkb | redrobot: its the same issue jrichard has but against a different new repo | 21:31 |
clarkb | redrobot: we basically haven't signalled zuul to let it know there are new projects | 21:32 |
redrobot | clarkb, gotcha. Thanks! | 21:32 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Clean up some job descriptions https://review.opendev.org/723717 | 21:40 |
openstackgerrit | James E. Blair proposed openstack/project-config master: Run the zuul service playbook on tenant changes https://review.opendev.org/723718 | 21:43 |
corvus | clarkb, mordred: ^ i think that should fix the issue redrobot and jrichard observed | 21:44 |
clarkb | looking | 21:45 |
clarkb | corvus: also fwiw I'ev read up on zuul's cherrypy static file serving and it should lookup mimetypes by file extention | 21:45 |
corvus | it looks like the tenant config is in place, so i will manually run a smart-reconfigure | 21:45 |
clarkb | corvus: I think maybe having two .'s in the file extensions like we do with our js may confuse it? I need to set up a test for that | 21:45 |
mordred | corvus: ah - yeah - I think that looks solid | 21:45 |
mordred | clarkb, corvus: related: https://review.opendev.org/#/c/723022/ | 21:47 |
mordred | that will make sure service-zuul uses the zuul prepared copy of project-config | 21:47 |
clarkb | corvus: in https://review.opendev.org/#/c/723718/1 I think we want puppet else and zuul | 21:48 |
mordred | (which is a thing we added to other jobs after the initial zuul patch was written) | 21:48 |
mordred | clarkb: why puppet else/ | 21:48 |
clarkb | mordred: nodepool for now | 21:48 |
clarkb | I think it may be the last thing though | 21:48 |
corvus | clarkb: i'm not following; we only ran puppet-else on changes to zuul/main.yaml | 21:49 |
*** DSpider has quit IRC | 21:49 | |
mordred | yeah - I think it's ok to wait for service-nodepool before triggering nodepool config changes on p-c changes | 21:49 |
ianw | corvus/modred: thanks, i didn't consider the enqueue v runtime | 21:49 |
clarkb | oh I see ya ok | 21:49 |
clarkb | I think the original code should've maybe been run more aggressively but if we weren't already then its fine | 21:50 |
corvus | clarkb: are you suggesting we should run puppet-else on changes to nodepool/.* ? | 21:50 |
corvus | clarkb: it looks like service-nodepool runs puppet on the old puppet servers | 21:51 |
corvus | so i don't think we need puppet-else | 21:51 |
mordred | oh good point | 21:51 |
clarkb | oh I didn't realize that had gotten split out alraady | 21:52 |
corvus | jrichard, redrobot: you should be good to go now; you'll probably need to recheck those changes | 21:58 |
openstackgerrit | Merged openstack/project-config master: Run the zuul service playbook on tenant changes https://review.opendev.org/723718 | 22:07 |
clarkb | I was mistaken about cherrypy not sending content-type. It seems that firefox forgets that info if workin with a cached file | 22:07 |
clarkb | but forcing cache bypass shows that it does send the content-type | 22:07 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Improve zuul-web apache config https://review.opendev.org/723711 | 22:08 |
clarkb | infra-root ^ I think that may make zuul a bit more responsive for users | 22:08 |
clarkb | I need to pop out for a bike ride now. Back in a bit | 22:08 |
openstackgerrit | Merged opendev/system-config master: Clean up some job descriptions https://review.opendev.org/723717 | 22:32 |
*** jrichard has quit IRC | 22:35 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Test zuul-executor on focal https://review.opendev.org/723528 | 22:46 |
redrobot | corvus, awesome, thanks for the help! | 23:01 |
*** tosky has quit IRC | 23:02 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Increase timeout on system-config-run-zuul https://review.opendev.org/723756 | 23:41 |
clarkb | my apache2 vhost change hit a timeout on that job so I'm bumping it | 23:41 |
clarkb | looking at logs it seems to have been compiling openafs when it triggered the timeout | 23:41 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!