*** ryohayakawa has joined #opendev | 00:02 | |
clarkb | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_261/738714/2/check/system-config-run-gitea/261d1a6/gitea99.opendev.org/logs/access.log we have ports there | 00:06 |
---|---|---|
clarkb | fungi: corvus ianw ^ https://review.opendev.org/#/c/738714/ could use rereview, though I need to call it a day then can restart things with that tomorrow | 00:06 |
ianw | i can restart if we like; but overall i'm not sure | 00:15 |
ianw | the UA look a lot like those listed in https://www.informationweek.com/pdf_whitepapers/approved/1370027144_VRSNDDoSMalware.pdf, a 2013 article about a 2011 ddos tool russkill | 00:16 |
ianw | however, https://amionrails.wordpress.com/2020/02/27/list-of-user-agent-used-in-ddos-attack-to-website/ is another one that links to | 00:19 |
ianw | https://github.com/mythsman/weiboCrawler/blob/master/opener.py | 00:19 |
ianw | that has the very specific | 00:20 |
ianw | ua's we see -- compare and contrast to http://paste.openstack.org/show/795414/ | 00:21 |
ianw | like "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)" ... that's no co-incidence | 00:22 |
ianw | https://github.com/mythsman/weiboCrawler is also suspicious | 00:23 |
ianw | "Opener.py independently encapsulates some anti-reptile header information." | 00:33 |
*** rchurch has quit IRC | 00:34 | |
ianw | kevinz: ^ i hope i'm not being rude asking but maybe you could translate more of what the intent is? | 00:35 |
*** Dmitrii-Sh has quit IRC | 00:38 | |
*** Dmitrii-Sh has joined #opendev | 00:43 | |
*** diablo_rojo has quit IRC | 00:44 | |
*** DSpider has quit IRC | 00:57 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add gitea proxy option https://review.opendev.org/738721 | 01:48 |
corvus | ianw: an alternate google translate of that phrase is "some anti-anti crawlers" | 02:16 |
ianw | corvus: yeah, it's pretty much a smoking gun for what's hitting us ... if it is malicious or a university project gone wrong is probably debatable ... the result is the same anyway | 02:17 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add gitea proxy option https://review.opendev.org/738721 | 02:20 |
corvus | ianw: the code is clearly intended to avoid detection as a crawler. it looks like it's designed to crawl weibo without weibo detecting that it's a crawler | 02:27 |
corvus | perhaps repurposed | 02:29 |
*** sgw1 has quit IRC | 02:49 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725 | 02:54 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] add gitea proxy option https://review.opendev.org/738721 | 03:23 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725 | 03:23 |
*** sgw1 has joined #opendev | 03:27 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725 | 03:51 |
openstackgerrit | yatin proposed openstack/diskimage-builder master: Revert "Make ipa centos8 job non-voting" https://review.opendev.org/738728 | 03:57 |
openstackgerrit | yatin proposed openstack/diskimage-builder master: Revert "Make ipa centos8 job non-voting" https://review.opendev.org/738728 | 03:57 |
*** ykarel|away is now known as ykarel | 04:24 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: [wip] crawler ua reject https://review.opendev.org/738725 | 04:28 |
*** iurygregory has quit IRC | 04:31 | |
*** sgw1 has quit IRC | 04:40 | |
*** mugsie has quit IRC | 04:53 | |
*** mugsie has joined #opendev | 04:57 | |
*** ysandeep|away is now known as ysandeep | 05:09 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea: Add reverse proxy option https://review.opendev.org/738721 | 05:36 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea: crawler UA reject rules https://review.opendev.org/738725 | 05:36 |
openstackgerrit | Federico Ressi proposed openstack/project-config master: Create a new repository for Tobiko DevStack plugin https://review.opendev.org/738378 | 05:48 |
*** factor has quit IRC | 06:03 | |
*** factor has joined #opendev | 06:03 | |
*** icarusfactor has joined #opendev | 06:05 | |
*** factor has quit IRC | 06:06 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea: crawler UA reject rules https://review.opendev.org/738725 | 06:16 |
*** icarusfactor has quit IRC | 06:22 | |
*** bhagyashris is now known as bhagyashris|brb | 06:59 | |
*** hashar has joined #opendev | 07:09 | |
*** iurygregory has joined #opendev | 07:09 | |
*** sorin-mihai_ has joined #opendev | 07:15 | |
*** bhagyashris|brb is now known as bhagyashris | 07:15 | |
*** sorin-mihai_ has quit IRC | 07:16 | |
*** sorin-mihai_ has joined #opendev | 07:16 | |
*** sorin-mihai has quit IRC | 07:17 | |
*** sorin-mihai_ has quit IRC | 07:19 | |
*** sorin-mihai_ has joined #opendev | 07:19 | |
*** sorin-mihai_ has quit IRC | 07:21 | |
*** sorin-mihai_ has joined #opendev | 07:21 | |
*** sorin-mihai__ has joined #opendev | 07:23 | |
*** sorin-mihai_ has quit IRC | 07:26 | |
*** dtantsur|afk is now known as dtantsur | 07:35 | |
*** tosky has joined #opendev | 07:40 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
*** DSpider has joined #opendev | 08:12 | |
*** hashar has quit IRC | 08:17 | |
*** hashar has joined #opendev | 08:18 | |
openstackgerrit | Daniel Bengtsson proposed openstack/diskimage-builder master: Update the tox minversion parameter. https://review.opendev.org/738754 | 08:19 |
*** ysandeep is now known as ysandeep|lunch | 08:21 | |
*** ysandeep|lunch is now known as ysandeep | 09:21 | |
*** hashar has quit IRC | 09:25 | |
*** ryohayakawa has quit IRC | 09:31 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771 | 09:34 |
*** hashar has joined #opendev | 09:42 | |
*** hashar is now known as hasharAway | 09:47 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771 | 09:53 |
*** hasharAway has quit IRC | 09:57 | |
*** hashar has joined #opendev | 09:58 | |
*** tkajinam has quit IRC | 09:59 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771 | 10:13 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771 | 10:26 |
*** ysandeep is now known as ysandeep|brb | 10:27 | |
*** dtantsur is now known as dtantsur|brb | 10:27 | |
openstackgerrit | Slawek Kaplonski proposed openstack/project-config master: Update Neutron Grafana dashboard https://review.opendev.org/738784 | 10:28 |
*** ysandeep|brb is now known as ysandeep | 10:37 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771 | 11:00 |
*** priteau has joined #opendev | 11:02 | |
*** kevinz has quit IRC | 11:11 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771 | 11:12 |
sshnaidm|afk | just fyi, I saw in console today failure of cleanup playbook. It didn't affect the job results or something else, but in case you weren't aware: http://paste.openstack.org/show/795427/ | 11:19 |
*** sshnaidm|afk is now known as sshnaidm|ruck | 11:19 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Allow deleting workspace after running terraform destroy https://review.opendev.org/738771 | 11:25 |
*** owalsh_ has joined #opendev | 11:55 | |
*** owalsh has quit IRC | 11:58 | |
*** ysandeep is now known as ysandeep|afk | 12:45 | |
*** dtantsur|brb is now known as dtantsur | 12:45 | |
*** sgw1 has joined #opendev | 12:55 | |
clarkb | sshnaidm|ruck: the way we are using the cleanup playbook is as a last effort to produce debug data at the end of jobs. The unreachable there is expected in the case of job failures that would otherwise break the job. In this case the job looks mostly happy at the end though. Is it possible that is the centos8 issue showing up later than usual? Maybe that points to the ssh keys being modified on the host by | 12:57 |
clarkb | the job somehow? | 12:57 |
clarkb | sshnaidm|ruck: long story short we expect it to fail, but really only when the job node was so unhappy by the job | 12:57 |
sshnaidm|ruck | clarkb, ok, that's fine then | 13:05 |
sshnaidm|ruck | clarkb, I'm not sure any centos issue exists btw | 13:05 |
sshnaidm|ruck | clarkb, we tried the same image and job on third party CI and never got retry_limit | 13:06 |
sshnaidm|ruck | clarkb, even today it's much much less retry_limits than previous 2 days, and I thin I saw multiple attempts in non-tripleo jobs as well | 13:06 |
clarkb | yes, I've offered now like 5 times to attempt to hold nodes on our CI system or boot test nodes in our clouds, but no one will take me up on it | 13:07 |
clarkb | the issue is clearly centos8 related in that it happens there | 13:07 |
clarkb | it may require specific "hardware" or timing races to be triggered though | 13:07 |
sshnaidm|ruck | clarkb, if it was centos issue, I | 13:07 |
openstackgerrit | Lance Bragstad proposed openstack/project-config master: Create a new project for ansible-tripleo-ipa-server https://review.opendev.org/738842 | 13:07 |
sshnaidm|ruck | 'd expec it to be more consistent | 13:07 |
sshnaidm|ruck | 30% percent retry_limits yesterday and only a few today | 13:08 |
clarkb | sshnaidm|ruck: is the job load different though? | 13:08 |
clarkb | tripleo's queue looks very small right now | 13:08 |
clarkb | (we can be our own noisy neighbor, etc) | 13:09 |
sshnaidm|ruck | clarkb, during these 2 days it was usual, maybe yesterday more because of rechecks | 13:09 |
sshnaidm|ruck | clarkb, usually it's growing to US afternoon | 13:10 |
clarkb | also I think its still happening if you look at the queue there are many retry attempts | 13:10 |
clarkb | you're just getting lucky in that its not failing 3 in a row consistently | 13:10 |
clarkb | we should be careful to treat this as fixed if we're relying on it passing on attempt 2 or 3 | 13:12 |
clarkb | sshnaidm|ruck: ^ you may also want to double check that your third party CI checks weren't doing the same thing | 13:13 |
sshnaidm|ruck | clarkb, I set attepmts:1 in both patches, on upstream and third party | 13:13 |
clarkb | ok, I'm just calling it out bceause I see the current jobs with a bunch of retries | 13:14 |
sshnaidm|ruck | clarkb, never got retry limit on 3party though, and today is much better in upstream too: https://review.opendev.org/#/c/738557/ | 13:14 |
sshnaidm|ruck | oh, got one | 13:14 |
fungi | also when i analyzed the per-provider breakdown, it was disproportionately more likely in some than others (not proportional to our quotas in each) suggesting that the timing/load/noisy-neighbor influence could be greater in some places than others | 13:15 |
sshnaidm|ruck | fungi, the problem that I don't have logs, usually it's only build id | 13:18 |
clarkb | sshnaidm|ruck: yes, this is why I've offered to set holds or boot test nodes | 13:18 |
clarkb | the logs are on the zuul test node and zuul can't talk to them to get the logs | 13:18 |
sshnaidm|ruck | clarkb, let's do it, I'm only for it | 13:18 |
clarkb | which means its difficult for zuul to do anything useful when we get in this situation. But if we set a hold maybe we can reboot the test node and log in or boot it on a rescue host or something | 13:19 |
clarkb | sshnaidm|ruck: is there a specific job + change combo we should set a hold that is likely to have this problem? | 13:19 |
sshnaidm|ruck | clarkb, all jobs here for example: https://review.opendev.org/#/c/738557/ | 13:20 |
clarkb | thats the big problem I have with setting a hold. I don't know what to set it on | 13:20 |
sshnaidm|ruck | all of them with attempts:1 | 13:20 |
clarkb | sshnaidm|ruck: do the -standalone jobs meet that criteria? THose would be good simply because they are single node which keeps the cost of holds down | 13:21 |
sshnaidm|ruck | clarkb, I think it's fine, the whole list is: http://paste.openstack.org/show/795437/ | 13:22 |
*** jhesketh has quit IRC | 13:22 | |
sshnaidm|ruck | clarkb, you can just remove "multinode" from there | 13:22 |
*** jhesketh has joined #opendev | 13:24 | |
clarkb | sshnaidm|ruck: I've set it on tripleo-ci-centos-8-standalone tripleo-ci-centos-8-scenario001-standalone and tripleo-ci-centos-8-scenario002-standalone. The first two are running in check so maybe we'll catch one shortly | 13:24 |
clarkb | I'll do 003 and 010 | 13:24 |
sshnaidm|ruck | clarkb, great | 13:25 |
clarkb | thats a good spread based on the jobs that are running I think | 13:25 |
sshnaidm|ruck | clarkb, can we pull any statistics which jobs have more attempts? | 13:25 |
sshnaidm|ruck | looking at this patch, seems like not only centos have more attempts: https://zuul.opendev.org/t/openstack/status/change/737983,2 | 13:26 |
clarkb | sshnaidm|ruck: sort of. I think we may only really record that if/when an attempt succeeds because we're logging that in elasticsearch. THere is/was work to put it in the zuuldb which may get it in all cases but I'm not sure if that has landed | 13:26 |
sshnaidm|ruck | "puppet-openstack-lint-ubuntu-bionic (2. attempt)" | 13:26 |
clarkb | ya retries aren't abnormal | 13:26 |
fungi | reattempts can be for a number of reasons | 13:26 |
fungi | failed to download a package during a pre phase playbook? retry | 13:27 |
fungi | in this case we specifically care about retries from the node becoming unreachable | 13:27 |
sshnaidm|ruck | I see, though I still suspect it's not centos, I couldn't reproduce any problem with the same image on a different ci | 13:29 |
clarkb | I don't think any of the jobs in this pass will hit it. Looking at console logs they seem to be further along than the toci quickstart | 13:30 |
clarkb | we'll just have to keep trying until one trips over the old | 13:30 |
*** ysandeep|afk is now known as ysandeep | 13:31 | |
sshnaidm|ruck | clarkb, ok, so I will leave standalone, 001-003, 010 in this patch and will keep rechecking? | 13:32 |
sshnaidm|ruck | clarkb, please add also tripleo-ci-centos-8-scenario010-ovn-provider-standalone , seems like it may have actually problem with network: http://paste.openstack.org/show/795439/ | 13:33 |
clarkb | sshnaidm|ruck: yup, then let us know if one of those hits the issue and we'll see if we can reboot the host and get you ssh'd in | 13:33 |
sshnaidm|ruck | clarkb, great | 13:33 |
clarkb | sshnaidm|ruck: if that reboot doesn't fix things we can try to boot from snapshot or use a rescue instance too | 13:33 |
clarkb | added that hold | 13:34 |
fungi | in the past we've also seen job nodes which go unreachable suddenly become reachable again soon after the job ends | 13:34 |
sshnaidm|ruck | clarkb, now the different problem, what can be an issue with post_failure in finger://ze05.openstack.org/7be6bcfbd9274abba5caf69c65a3d519 ? No logs from job, it just builds containers, no tripleo there | 13:34 |
clarkb | sshnaidm|ruck: thats the same failure mode. If the job fails during the run step then it doesn't upload logs and you get the finger url | 13:35 |
clarkb | does the tripleo quickstart script build iamges too? | 13:35 |
sshnaidm|ruck | clarkb, no | 13:35 |
clarkb | perhaps this is a networking issuewith docker/podman? | 13:35 |
sshnaidm|ruck | clarkb, completely only build containers job | 13:36 |
clarkb | and doing container things trips it | 13:36 |
sshnaidm|ruck | clarkb, can not be | 13:36 |
sshnaidm|ruck | like completely nothing touches network there | 13:36 |
clarkb | containers do though | 13:36 |
clarkb | including container image builds | 13:36 |
fungi | 2020-07-01 12:53:36,912 DEBUG zuul.AnsibleJob: [e: 7f319ac2c4284af3b8d5381a995ee25d] [build: 7be6bcfbd9274abba5caf69c65a3d519] Ansible complete, result RESULT_UNREACHABLE code None | 13:36 |
clarkb | bceause containers if they are namespacing the network need netwroking | 13:36 |
sshnaidm|ruck | a-ha, so node disappeared | 13:37 |
sshnaidm|ruck | clarkb, it doesn't run containers, just build them | 13:37 |
clarkb | iirc building a container image happens in a container | 13:37 |
fungi | https://review.opendev.org/738668 should help get more public info for those cases | 13:37 |
clarkb | I believe that is true for buildah and docker | 13:38 |
sshnaidm|ruck | clarkb, sorry, but this is not realistic that buildah makes something to nodes network | 13:38 |
clarkb | why? | 13:38 |
clarkb | the container for the image build needs network access | 13:38 |
clarkb | it has to do something | 13:38 |
clarkb | I'm not sure what exactly that is, but it is doing something if I understand container image builds properly | 13:39 |
clarkb | basically I'm trying to not rule anything out | 13:39 |
corvus | clarkb, sshnaidm|ruck: the patch to store retried builds in the db has landed; but it's not fully deployed in opendev's zuul | 13:40 |
clarkb | let's see if we can recover system logs and go from there | 13:40 |
clarkb | but ruling things out before we have any logs is not very helpful | 13:40 |
clarkb | fungi: I've rechecked https://review.opendev.org/#/c/738710/1 and plan to approve the gitea side as the conference winds down | 13:42 |
*** bhagyashris is now known as bhagyashris|afk | 13:42 | |
fungi | i was considering pasting the executor loglines for build 7be6bcfbd9274abba5caf69c65a3d519 but it's 9302 lines and each is so long that i can only fit about 200 in a paste, so not wanting to make a series of 50 pastes out of that | 13:43 |
sshnaidm|ruck | clarkb, let's add also tripleo-build-containers-centos-8 tripleo-build-containers-centos-8-ussuri | 13:43 |
sshnaidm|ruck | clarkb, they're pretty short jobs | 13:44 |
sshnaidm|ruck | maybe we could catch it | 13:44 |
clarkb | sshnaidm|ruck: ok thats done. Lets get some of these jobs running again. Maybe push a new ps that disables the other jobs and then start iterating to see if we catch some? | 13:45 |
sshnaidm|ruck | clarkb, ack, retriggered now | 13:45 |
sshnaidm|ruck | clarkb, will try to figure out how to disable others, not so simple now.. | 13:46 |
sshnaidm|ruck | fungi, clarkb from this last post_failure: | 13:48 |
sshnaidm|ruck | 2020-07-01 11:51:13.001938 | TASK [upload-logs-swift : Upload logs to swift] | 13:48 |
sshnaidm|ruck | 2020-07-01 12:51:00.211178 | POST-RUN END RESULT_TIMED_OUT: [trusted : opendev.org/opendev/base-jobs/playbooks/base/post-logs.yaml@master] | 13:48 |
clarkb | sshnaidm|ruck: that means the job was uploading logs for 30 minutes and didn't finish in time | 13:49 |
clarkb | we have seen that happen due to sheer volume of log uploads in a job. | 13:49 |
clarkb | either a massive log file or many many files etc | 13:50 |
fungi | though we tended to see it a lot more before we started limiting the amount of log data a job could save | 13:50 |
clarkb | we do have a watchdog for that but it relies on checking after the fact iirc so isn't perfect | 13:50 |
fungi | yep | 13:50 |
clarkb | ya | 13:50 |
sshnaidm|ruck | this job has a few logs, unlike usual tripleo jobs, I believe it's network hiccup | 13:51 |
sshnaidm|ruck | Provider: vexxhost-ca-ymq-1 | 13:51 |
clarkb | sshnaidm|ruck: it could be, though I would expect ssh to complain in that case | 13:51 |
clarkb | ebcause we rsync the logs from the test node to the executor then upload to swift | 13:52 |
clarkb | if it was a network issue on the test node the rsync should've failed. Its possible its a bw limitation between the executor and the swift I guess? | 13:52 |
sshnaidm|ruck | these jobs timed out to upload logs: http://paste.openstack.org/show/795442/ | 13:53 |
clarkb | different zuul executors | 13:54 |
sshnaidm|ruck | and clouds, and jobs.. | 13:54 |
clarkb | sshnaidm|ruck: did they both fail during upload-logs-swift? | 13:54 |
sshnaidm|ruck | clarkb, yes | 13:54 |
clarkb | ok, that step is swift upload from executor to one of 5 swift regions. 3 in rax and 2 in ovh | 13:54 |
sshnaidm|ruck | and this one as well http://paste.openstack.org/show/795443/ | 13:58 |
clarkb | the logs for successful runs of that job aren't particularly large. Not tiny either. Looks like ~20MB which is reasonable. I've also found successful buidls with uploads to both ovh regions | 13:58 |
*** mlavalle has joined #opendev | 13:58 | |
sshnaidm|ruck | maybe also node dies when uploads logs, but..seems weird such a timing | 13:58 |
fungi | it's more likely the count of individual log files being uploaded has a bigger impact on upload time than the aggregate byte count | 13:59 |
clarkb | sshnaidm|ruck: also the swift upload is not from the node to swift. Its executor to swift | 13:59 |
clarkb | fungi: the count is also not bad either, THough there are a few files | 13:59 |
sshnaidm|ruck | clarkb, well, then even dying node doesn't explain.. | 13:59 |
clarkb | sshnaidm|ruck: the process is rsync the logs from node to executor then swift upload from executor to swift | 13:59 |
clarkb | what could be happening is we spend a large amount of time doing the earlier steps like that rsync | 14:00 |
sshnaidm|ruck | clarkb, ack | 14:00 |
clarkb | then by the time we do the swift upload we fail because there isn't much time left | 14:00 |
fungi | yeah, we'd need to profile the start/end times of the individual tasks | 14:01 |
clarkb | ya though looking at the timestamps sshnaidm|ruck posted here its spending an hour doing the upload logs to swift step | 14:02 |
clarkb | full context for the post-run stage would be good though | 14:03 |
clarkb | sshnaidm|ruck: can you paste that somewhere? | 14:03 |
clarkb | but also my ability to debug things deteriorates as the number of simultaneous issues increases | 14:03 |
sshnaidm|ruck | clarkb, yeah, preparing.. | 14:03 |
clarkb | the build uuid is useful too because we can use that to find the logs on the executor to see if there is any hints there | 14:04 |
*** hashar has quit IRC | 14:06 | |
corvus | clarkb: you're saying there's a second issue related to log uploading? | 14:08 |
clarkb | sshnaidm|ruck: to stop running jobs you want to edit this file: https://review.opendev.org/#/c/738557/3/zuul.d/layout.yaml. Under templates remove everything but the container builds and standalone template. Under check remove everything I think. Set check to [] or remove the check key entirely | 14:08 |
clarkb | corvus: yes, or perhaps its the same issue with different symptoms (like maybe the network issues are on the executors) | 14:08 |
clarkb | I don't think that is the case because I was able to reproduce failure to connect to test nodes from home on monday | 14:08 |
clarkb | my hunch is it is two separate issues | 14:09 |
corvus | clarkb: would you like me to look into http://paste.openstack.org/show/795443/ ? (is that a good example)? | 14:09 |
clarkb | corvus: I haven't been able to look at those more closely since they don't have build uuids and sshnaidm|ruck's examples pasted directly into irc weren't directly attributed to anything, but those are what we've got until sshnaidm|ruck pastes more context somewhere | 14:10 |
clarkb | basiclly I think that is the breadcrumb we've got but I don't know how good an example they are | 14:10 |
corvus | it has an event id and a job, i can track it down | 14:10 |
fungi | i concur, the node we witnessed go unreachable was really unreachable from everywhere, not just from the executor (i couldn't reach it from home either) | 14:10 |
sshnaidm|ruck | clarkb, https://pastebin.com/Zn30H9sj https://pastebin.com/4nz8V1W7 | 14:10 |
corvus | sshnaidm|ruck: what are those pastes? | 14:11 |
sshnaidm|ruck | corvus, consoles where upload logs times out | 14:11 |
corvus | cool -- since we're tracking multiple issues, it's good to be explicit :) | 14:11 |
sshnaidm|ruck | clarkb, corvus one build id is 0e70e37c7f8948a581483123aaab98a2 | 14:12 |
sshnaidm|ruck | second is bf81f447b4924b56a613c3449a474d30 | 14:13 |
clarkb | thanks | 14:13 |
corvus | i'll track down the executor logs for d30 | 14:14 |
clarkb | 0e70e37c7f8948a581483123aaab98a2 seems to have been unreachable when the cleanup playbook ran | 14:15 |
sshnaidm|ruck | yeah, second one was more stable | 14:16 |
corvus | same is true for d30 -- it spent 1 hour trying to upload logs, timed out, then unreachable for cleanup; that sounds like a hung ssh connection during the 1 hour upload playbook, then no further connections? | 14:17 |
corvus | when the final playbook fails, it logs the json output from the playbook; i'm looking through that | 14:18 |
corvus | it only has the json output from the previous playbook, likely because it killed the process running the final playbook, so that's not much help | 14:22 |
sshnaidm|ruck | clarkb, tripleo-ci-centos-8-scenario003-standalone has retry limit | 14:22 |
sshnaidm|ruck | \o/ | 14:22 |
clarkb | sshnaidm|ruck: cool let me see what we canfind | 14:22 |
corvus | clarkb: however, the second-to-last playbook removes the build ssh key -- why does the cleanup playbook work at all? | 14:23 |
corvus | clarkb: (are we relying on a persistent connection lasting through that?) | 14:23 |
clarkb | corvus: oh we may with the control persistent thing | 14:24 |
sshnaidm|ruck | OMG almost all of them start to fail | 14:24 |
clarkb | 23.253.159.202 | 14:24 |
clarkb | it actually pings | 14:24 |
clarkb | and I can ssh in | 14:25 |
corvus | clarkb: i think we should take any unreachable errors on the cleanup playbook with a grain of salt. especially if we sat there for an hour doing the swift upload, it's likely controlpersist will timeout and we won't re-establish. so i think that's a red herring. | 14:25 |
clarkb | which suddenly has me thinking: zk connections dropping maybe? | 14:25 |
clarkb | sshnaidm|ruck: do you have an ssh key somewhere that I can put on the host? You'd be better able to look at tripleo logs to check them for oddities | 14:25 |
corvus | there are no zk errors in the scheduler log | 14:25 |
sshnaidm|ruck | clarkb, https://github.com/sshnaidm.keys | 14:26 |
fungi | and the scheduler isn't running out of memory (or even close) | 14:26 |
clarkb | sshnaidm|ruck: your key has been added. Let's not change anything on the server jsut yet as we try and observe what may be happening | 14:27 |
corvus | clarkb: do you have a build id handy for that? | 14:27 |
clarkb | corvus: not yet that was going to be my next thing | 14:27 |
sshnaidm|ruck | clarkb, thanks | 14:27 |
clarkb | corvus: finger://ze05.openstack.org/603b84def27f47ee99585d58436cdc72 I think | 14:27 |
sshnaidm|ruck | clarkb, which user? | 14:27 |
fungi | sshnaidm|ruck: root | 14:28 |
sshnaidm|ruck | I'm in | 14:28 |
clarkb | now this is interesting | 14:28 |
clarkb | its an ipv6 host | 14:29 |
clarkb | but ifconfig doesn't show me the ipv6 addr? | 14:29 |
clarkb | ya I think its ipv6 stack is gone | 14:29 |
clarkb | and we'd be using that for the job I'm pretty sure | 14:29 |
clarkb | sshnaidm|ruck: ^ | 14:29 |
sshnaidm|ruck | clarkb, hmm.. not sure why | 14:30 |
clarkb | I'm double checking that zuul would use that IP now | 14:30 |
fungi | looks like the first task to end in unreachable was prepare-node : Assure src folder has safe permissions | 14:30 |
clarkb | 2001:4802:7803:104:be76:4eff:fe20:3ee4 is the ipv6 addr nodepool knows about | 14:31 |
sshnaidm|ruck | fungi, this task takes much time from what I saw in jobs.. | 14:31 |
clarkb | we don't seem to record either ip in the executor log | 14:31 |
clarkb | so need to do more digging to see what ip was used | 14:31 |
corvus | fungi: for build 603b84def27f47ee99585d58436cdc72 ? that looks like it succeeded to me. | 14:31 |
fungi | corvus: oh, that was the last task of 11 for the play where one task was unreachable | 14:32 |
clarkb | corvus: 603b84def27f47ee99585d58436cdc72 hit unreachable | 14:32 |
corvus | clarkb: yes, i was saying i disagreed with fungi about what task was running at the time | 14:33 |
clarkb | ah | 14:33 |
clarkb | sorry read it as successful job you meant successful task | 14:33 |
corvus | yep, i'll take my own advice and use more words | 14:33 |
corvus | we do have all of the command output logs available in /tmp on the host | 14:34 |
corvus | /tmp/console-* | 14:34 |
fungi | i'm stull having trouble digestnig ansible output. the way it aggregates results makes it hard to spot which task failed in that play | 14:34 |
clarkb | ooh should we copy those off so that a reboot doesn't clear tmp? | 14:34 |
corvus | yes | 14:34 |
corvus | note that the most recent one is console-bc764e04-a4bf-cd3a-2382-000000000016-primary.log and exited 0 | 14:35 |
clarkb | k I'll do that copy once I've confirmed the ip used was the ipv6 addr | 14:35 |
clarkb | the copy of the logs | 14:35 |
corvus | clarkb: save timestamps | 14:35 |
corvus | since that's about the only way to line them up with tasks | 14:35 |
corvus | oh they have timestamps in them :) | 14:36 |
corvus | so not critical, but still helpful :) | 14:36 |
clarkb | rgr | 14:36 |
fungi | this is a rax node, so would have gotten its ipv6 address via glean reading it from configdrive, right? | 14:37 |
clarkb | fungi: oh yup. Also I've just found other rax-iad jobs and they use the ipv4 addr as ansible_host | 14:38 |
fungi | weird | 14:38 |
clarkb | so ipv6 may just be a shiny nuisance | 14:38 |
clarkb | I think glean may not configure ipv6 on red hat distros | 14:38 |
clarkb | with static config drive config | 14:38 |
clarkb | we are swapping quite a bit on ze05 as well | 14:39 |
clarkb | copying logs now | 14:39 |
sshnaidm|ruck | clarkb, node 23.253.159.202 - from which jobs is it? | 14:42 |
sshnaidm|ruck | fungi, clarkb, do you have console logs for it just in case? | 14:42 |
clarkb | sshnaidm|ruck: tripleo-ci-centos-8-scenario003-standalone | 14:43 |
sshnaidm|ruck | ack | 14:43 |
*** diablo_rojo has joined #opendev | 14:43 | |
clarkb | sshnaidm|ruck: the console logs are all in /tmp/console-* on that host | 14:43 |
corvus | clarkb: i'm having trouble lining this up with the executor log | 14:43 |
clarkb | I've pulled them onto my desktop bceause /tmp there could be cleared out on a restart | 14:43 |
corvus | can we double check that the ip and uuid match | 14:43 |
clarkb | corvus: oh I can do that | 14:43 |
corvus | clarkb: my understanding is we're looking at 603b84def27f47ee99585d58436cdc72 on ze05, right? | 14:44 |
clarkb | | 0017542272 | rax-iad | centos-8 | e7bd207a-87d2-4394-877e-696b381c34f2 | 23.253.159.202 | 2001:4802:7803:104:be76:4eff:fe20:3ee4 | hold | 00:00:45:56 | unlocked | main | centos-8-rax-iad-0017542272 | 10.176.193.212 | None | 22 | nl01-10-PoolWorker.rax-iad-main | 300-0009721443 | openstack | 14:44 |
clarkb | opendev.org/openstack/tripleo-ci tripleo-ci-centos-8-scenario003-standalone refs/changes/57/738557/.* | tripleo with sshnaidm debug retry failures | | 14:44 |
clarkb | that is the nodepool hold | 14:44 |
clarkb | and that tripleo-ci-centos-8-scenario003-standalone standalone job had the finger url I posted aboev. But I'm double checking again | 14:44 |
sshnaidm|ruck | wow, so many console logs in /tmp/ | 14:44 |
clarkb | 2020-07-01 13:49:50,678 INFO zuul.AnsibleJob: [e: e0622c34b7944cfcadddb92079a3e537] [build: 603b84def27f47ee99585d58436cdc72] Beginning job tripleo-ci-centos-8-scenario003-standalone for ref refs/changes/57/738557/3 (change https://review.opendev.org/738557) | 14:45 |
clarkb | the job and change are correct at least | 14:45 |
clarkb | still trying to be sure on the build. Not sure what key we use between nodepool and executor logs | 14:45 |
clarkb | ok thats weird | 14:46 |
corvus | should be able to correlated node id to build on scheduler | 14:46 |
corvus | 2020-07-01 11:27:56,980 INFO zuul.ExecutorClient: [e: 7f319ac2c4284af3b8d5381a995ee25d] Execute job tripleo-ci-centos-8-scenario003-standalone (uuid: 1f2db893d451453bacfb80face66ec92) on nodes <NodeSet single-centos-8-node [<Node 001754227 ('primary',):centos-8>]> for change <Change 0x7fab39611990 openstack/tripleo-ci 738557,2> with dependent changes [{'project': {'name': 'openstack/tripleo-ci', | 14:47 |
corvus | 'short_name': 'tripleo-ci', 'canonical_hostname': 'opendev.org', 'canonical_name': 'opendev.org/openstack/tripleo-ci', 'src_dir': 'src/opendev.org/openstack/tripleo-ci'}, 'branch': 'master', 'change': '738557', 'change_url': 'https://review.opendev.org/738557', 'patchset': '2'}] | 14:47 |
clarkb | the ansible facts for 603b84def27f47ee99585d58436cdc72 show the hostname is centos-8-inap-mtl01-0017550604 | 14:47 |
corvus | clarkb: the scheduler log i just pasted is for the node id you pasted, but it's a different uuid | 14:47 |
corvus | it's also several hours old | 14:47 |
clarkb | its like we held a rax node but the job ran on inap? | 14:48 |
fungi | 23.253.159.202 is centos-8-rax-iad-0017542272 if that's what everyone's still looking at | 14:48 |
sshnaidm|ruck | clarkb, do you have other nodes? | 14:49 |
fungi | huh, yeah that's extra weird | 14:49 |
corvus | clarkb: how did you arrive at the uuid 603b84def27f47ee99585d58436cdc72 ? | 14:49 |
*** mwhahaha has joined #opendev | 14:49 | |
clarkb | sshnaidm|ruck: yes all of the standalone jobs ended up being held | 14:49 |
clarkb | corvus: I went to the zuul web ui and retrieved the finger url for the job that triggered the hold | 14:50 |
clarkb | corvus: the jobs have been modified on that change to set attempts: 1 and so won't retry beyond the first failure | 14:50 |
corvus | clarkb: uuid 603b84def27f47ee99585d58436cdc72 ran on node 001754227 | 14:51 |
corvus | is that one held? | 14:51 |
weshay_ruck | sshnaidm|ruck, mwhahaha https://review.opendev.org/#/c/738557/ | 14:51 |
clarkb | corvus: yes | 0017542272 | rax-iad | centos-8 | e7bd207a-87d2-4394-877e-696b381c34f2 | 23.253.159.202 is the ehld node | 14:52 |
clarkb | oh your's is short a digit? is that a mispaste? | 14:52 |
corvus | checking | 14:52 |
clarkb | [build: 603b84def27f47ee99585d58436cdc72] Ansible output: b' "ansible_hostname": "centos-8-inap-mtl01-0017550604",' <- from the executor logs | 14:54 |
clarkb | I would expect the ansible_hostname fact to be centos-8-rax-iad-0017542272 | 14:54 |
corvus | yes -- i think that whole number was a mispaste; let's start over | 14:54 |
corvus | 603b84def27f47ee99585d58436cdc72 ran on 0017550604 | 14:55 |
corvus | 1f2db893d451453bacfb80face66ec92 ran on 0017542272 | 14:55 |
clarkb | ok that matches the ansible host value we see in the build so tahts good. That doesn't explain why the finger url on the web ui seems to be wrong | 14:56 |
corvus | which one are we going to look at? | 14:56 |
clarkb | the hold is for 0017542272 so we want 1f2db893d451453bacfb80face66ec92 | 14:56 |
clarkb | oh you know what | 14:56 |
sshnaidm|ruck | clarkb, seems like 23.253.159.202 is the wrong node | 14:56 |
clarkb | I think I know what happened. sshnaidm|ruck abandoned the chagne and restored it to rerun jobs | 14:57 |
clarkb | the hold was already in place when that happened. Do we hold when jobs are cancelled for that state change? | 14:57 |
clarkb | 603b84def27f47ee99585d58436cdc72 is what we want a hold for, but not what we got a hold for | 14:57 |
clarkb | the other holds we got for the other jobs are much newer and all in inap | 14:58 |
clarkb | I'm thinking lets ignore standalone003 and look at the other jobs? | 14:58 |
clarkb | | 0017550599 | inap-mtl01 | centos-8 | d4bf117d-ef6d-427e-9081-fb37cd069594 | 198.72.124.39 | | hold | 00:00:26:07 | unlocked | main | centos-8-inap-mtl01-0017550599 | 198.72.124.39 | nova | 22 | nl03-7-PoolWorker.inap-mtl01-main | 300-0009723644 | openstack | 14:58 |
clarkb | opendev.org/openstack/tripleo-ci tripleo-ci-centos-8-standalone refs/changes/57/738557/.* | tripleo with sshnaidm debug retry failures | | 14:58 |
clarkb | That hold looks like a proper hold | 14:58 |
corvus | clarkb: the autohold info may be the most concise way to correlate them | 14:59 |
sshnaidm|ruck | clarkb, 198.72.124.39 ? | 14:59 |
clarkb | and I think finger://ze08.openstack.org/b5b5835abca34522933ee496c49514a6 is related to 0017550599 but I'm double checking that now before I do anything else | 14:59 |
clarkb | sshnaidm|ruck: yes, but let me double check my values :) | 14:59 |
sshnaidm|ruck | clarkb, sure | 14:59 |
corvus | clarkb: what autohold number is that? | 15:00 |
clarkb | [build: b5b5835abca34522933ee496c49514a6] Ansible output: b' "ansible_hostname": "centos-8-inap-mtl01-0017550599",' | 15:00 |
corvus | autohold-info 0000000160 Held Nodes: [{'build': 'b5b5835abca34522933ee496c49514a6', 'nodes': ['0017550599']}] | 15:00 |
corvus | clarkb: is that correct ^ ? | 15:01 |
clarkb | corvus: 0000000160 yup | 15:01 |
clarkb | that server does not ping | 15:01 |
clarkb | which is more like what we expected | 15:01 |
corvus | build b5b5835abca34522933ee496c49514a6 ran on ze08 | 15:02 |
corvus | it failed 24 minutes into 2020-07-01 13:57:59,664 DEBUG zuul.AnsibleJob.output: [e: e0622c34b7944cfcadddb92079a3e537] [build: b5b5835abca34522933ee496c49514a6] Ansible output: b'TASK [run-test : run toci_gate_test.sh executable=/bin/bash, _raw_params=set -e' | 15:03 |
clarkb | server show on the server instance shows no errors from nova | 15:03 |
clarkb | I can try a reboot and see if it comes back | 15:03 |
fungi | TASK: oooci-build-images : Run build-images.sh | 15:03 |
clarkb | but I'll hold off on that to ensure we've done all the debugging we can | 15:04 |
clarkb | mgagne: ^ if you're around you may be able to help as well as this is an inap instance | 15:04 |
fungi | oh, again, i'm looking at the last task in the play which had the first unreachable result... how do you identify the task which actually was unreachable? | 15:05 |
corvus | fungi: i don't see that line at all. are you looking at build b5b5835abca34522933ee496c49514a6 on ze08? | 15:05 |
fungi | oh, i know what i'm doing wrong | 15:06 |
fungi | i resolved build b5b5835abca34522933ee496c49514a6 to event e0622c34b7944cfcadddb92079a3e537 so i'm looking at the entire buildset | 15:06 |
fungi | now that i'm not looking at interleaved logs from multuple builds, yes this is easier to follow | 15:08 |
fungi | i agree it was run-test : run toci_gate_test.sh which the node became unreachable during | 15:08 |
clarkb | the neutron port for that IP address shows it is active and I don't see any errors | 15:08 |
clarkb | I'm now trying to confirm that that port is actually attached to the instance | 15:09 |
clarkb | the port device id matches our instance id | 15:09 |
clarkb | now I'm going to double check security groups | 15:10 |
clarkb | the security group is the expected default group wtih all ingress and egress allowed as expected | 15:11 |
fungi | unfortunately the nova console log buffer has been overrun by network traffic logging | 15:11 |
clarkb | fungi: the time range is about 45 minutes though | 15:12 |
clarkb | we can find when the instance booted to see if any crashes should show up in that | 15:12 |
clarkb | | created | 2020-07-01T13:48:50Z | 15:13 |
fungi | yep, current console log covers 50 minutes of time and ends at 4827 seconds | 15:14 |
clarkb | 4827seconds + 2020-07-01T13:48:50Z is ~= 10 minutes ago? | 15:14 |
clarkb | maybe a bit more recent? | 15:14 |
fungi | so basically an hour and twenty minutes after 13:48:50 | 15:14 |
fungi | ~15:08z yep | 15:15 |
clarkb | 2020-07-01 14:21:08,075 is when failure occurred | 15:15 |
fungi | so maybe 7 minutes ago | 15:15 |
*** ysandeep is now known as ysandeep|away | 15:15 | |
clarkb | its possible a crash occurred earlier and ansible didn't know about it I guess | 15:15 |
fungi | it's continuing to log | 15:15 |
clarkb | I'm running out of things to consider from the api side of things | 15:15 |
clarkb | anything else people want to try before we ask for a reboot? | 15:16 |
fungi | so yes the instance is still running and outputting network traffic loggnig on its console | 15:16 |
fungi | looks like it's logging its own dhcp requests | 15:16 |
fungi | oh, actually these look like other machines's dhcp requests maybe | 15:17 |
fungi | i suppose they could be virtual machines running on this instance | 15:17 |
corvus | fungi: any identifying info like mac? | 15:17 |
clarkb | fa:16:3e:fa:da:5e is the port mac according to the openstack api | 15:18 |
fungi | MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:e9:0c:f4:08:00 | 15:18 |
fungi | MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:00 | 15:18 |
fungi | i keep seeing those same two logged on ens3 | 15:18 |
*** _mlavalle_1 has joined #opendev | 15:19 | |
fungi | udp datagrams from 0.0.0.0:68 to 255.255.255.255:67 so definitely dhcp discovery | 15:19 |
corvus | oh you know what, i'm silly -- that's not going to tell us anything, because of course the internal vms are openstack, and the real vm is also openstack, so they're all going to be fa:16:3e :) | 15:19 |
clarkb | centos-8-1593208195 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:00 | 15:20 |
clarkb | corvus: ya | 15:20 |
* clarkb figures out the other one | 15:20 | |
fungi | yup. also if they're other machines and not this one but somehow still showing up on our port (because they're broadcast) they'll also likely be openstack | 15:20 |
clarkb | er thats a mispaste image name | 15:21 |
clarkb | centos-8-inap-mtl01-0017550603 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:16:ce:62:08:00 | 15:21 |
clarkb | centos-8-inap-mtl01-0017550600 is MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:e9:0c:f4:08:00 | 15:21 |
clarkb | those also happen to be two held nodes for other jobs that failed | 15:22 |
*** mlavalle has quit IRC | 15:22 | |
corvus | so there's a dhcp process running on our failed node which is continuning to receive and log requests from other inap vms nearby? | 15:22 |
clarkb | corvus: I think its just the iptables logging from the kernel that is receiving them | 15:23 |
clarkb | what that implies to me is l2 is fine | 15:23 |
clarkb | its l3 that is failing | 15:23 |
corvus | ack | 15:23 |
clarkb | and those other two hosts also don't ping | 15:23 |
clarkb | so ya they all seem to have working links and ethernet is functioning. IP is not functioning | 15:24 |
clarkb | I'm ready to try a reboot if no one objects | 15:24 |
corvus | ++ i think we learned a lot. reboot ++ | 15:24 |
fungi | no objection here | 15:24 |
fungi | we also have a couple more candidates to compare, sounds like | 15:24 |
fungi | it's also possible these are reachable from other nodes in the same network but not from outside (lost their default route?) | 15:25 |
fungi | we can check that with one of the others | 15:25 |
clarkb | it pings now | 15:26 |
clarkb | and I'm in | 15:26 |
clarkb | ssh root@198.72.124.39 for the others (I'll get sshnaidm|ruck's key shortly) | 15:26 |
clarkb | the console logs are still in /tmp so reboot didn't blast that away | 15:26 |
sshnaidm|ruck | clarkb, did you restart it? | 15:26 |
clarkb | sshnaidm|ruck: yes | 15:26 |
sshnaidm|ruck | ack | 15:26 |
fungi | rebooted via nova api | 15:26 |
clarkb | sshnaidm|ruck: your key should be working now | 15:27 |
sshnaidm|ruck | I'm in | 15:27 |
sshnaidm|ruck | clarkb, which job is it? | 15:27 |
clarkb | Jul 1 14:16:56 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <warn> [1593613016.1172] dhcp4 (ens3): request timed out | 15:28 |
clarkb | sshnaidm|ruck: tripleo-ci-centos-8-standalone | 15:28 |
fungi | i wonder, could something be putting a firewall rule in place on the public interface which blocks dhcp requests? | 15:29 |
clarkb | fungi: I was just goign to mention earlier in syslog is a bunch of iptables updates :) | 15:29 |
clarkb | I don't know that it is breaking dhcp yet but that definitely seems to be an order of ops | 15:30 |
fungi | that would explain the random behavior... depends on if it needs to renew a lease during that timeframe | 15:30 |
clarkb | ansible-iptables does a bunch of work then soon after dhcp stops | 15:30 |
clarkb | fungi: ya | 15:30 |
fungi | i wonder if i can find the dhcp requests getting blocked in the console log | 15:31 |
fungi | checking | 15:31 |
fungi | fa:16:3e:fa:da:5e is the mac for ens3 | 15:31 |
clarkb | sshnaidm|ruck: ^ maybe you can look into that? I'm not sure what tripleo's iptables rulesets are intended to do. Also you could cross check against successful jobs to see if they fail to renew a lease (usually you renew at 1/2 the lease time so you'd possible still fail to renew but have working IP until a bit later) | 15:31 |
sshnaidm|ruck | mwhahaha, weshay_ruck ^ | 15:31 |
sshnaidm|ruck | brb | 15:32 |
*** sshnaidm|ruck is now known as sshnaidm|afk | 15:32 | |
*** factor has joined #opendev | 15:32 | |
clarkb | that host has rebooted with no iptables rules ? | 15:32 |
clarkb | whcih is I guess good for debugging but surprising since we write out a ruleset I thought | 15:32 |
fungi | indeed, iptables -L and ip6tables -L are both entirely empty | 15:33 |
clarkb | maybe that doesn't work on centos8 | 15:34 |
clarkb | something to investigate | 15:34 |
fungi | i'm not finding any blocked dhcp requests logged by iptables, but i have a feeling the iptables logging is not comprehensive (probably does not include egress, for example) | 15:37 |
clarkb | I'm also having a hard time finding the lease details for our current lease | 15:37 |
clarkb | oh neat I think beacuse ens3 config is static now | 15:38 |
clarkb | the network type in config drive is ipv4 not ipv4_dhcp or whatever the value is | 15:40 |
clarkb | so I think static config is actually what we expect | 15:40 |
mwhahaha | so the mtu is bad i believe | 15:40 |
clarkb | is something kicking it over to dhcp which is then failing bceause there is no dhcp server? | 15:40 |
mwhahaha | it shouldn't be 1500 | 15:40 |
mwhahaha | ? | 15:40 |
clarkb | mwhahaha: why not? | 15:40 |
mwhahaha | most clouds it's not 1500 | 15:40 |
mwhahaha | because tennat networking | 15:40 |
clarkb | most of ours are | 15:40 |
mwhahaha | you sure? | 15:40 |
clarkb | I think openedge is our only cloud with a smaller mtu | 15:40 |
clarkb | mwhahaha: ya they do jumbo frames to carry tenant networks allowing tenant traffic to have a proper 1500 mtu | 15:41 |
mwhahaha | k | 15:41 |
fungi | Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.5251] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds) | 15:41 |
fungi | that's almost an hour after the node booted | 15:41 |
fungi | could the job be reconfiguring ens3 to dhcp? | 15:41 |
clarkb | fungi: cool I think thats the smoking gun. This should be statically configured based on teh config drive stuff | 15:41 |
mwhahaha | centos did have a bug with an ens3 existing always btw | 15:41 |
* mwhahaha digs up bug | 15:41 | |
mwhahaha | the cloud image that is | 15:42 |
clarkb | it switches to dhcp, fails to dhcp because neutron isn't doing dhcp there and then network manager helpfully unconfigures our interface | 15:42 |
clarkb | we reboot and go back to our existing glean config and are statically configured again | 15:42 |
fungi | that may also explain why the dhcp discovery datagrams we were seeing logged by the node's iptables drop rules were other nodes experiencing the same issue | 15:42 |
mwhahaha | https://bugs.launchpad.net/tripleo/+bug/1866202 | 15:42 |
openstack | Launchpad bug 1866202 in tripleo "OVB on centos8 fails because of networking failures" [Critical,Fix released] - Assigned to wes hayutin (weshayutin) | 15:42 |
clarkb | fungi: oh ya they're all asking for dhcp and no one can respond | 15:42 |
clarkb | and this would happen in any other cloud that is statically configured too | 15:42 |
mwhahaha | so if /etc/sysconfig/network-scripts/ifcfg-en3 exists, when we restart legacy networking it'll nuke the address | 15:42 |
fungi | because that cloud doesn't have a dhcpd to answer | 15:42 |
clarkb | rax is at least | 15:42 |
clarkb | mwhahaha: that will do it | 15:43 |
clarkb | /etc/sysconfig/network-scripts/ifcfg-en3 is how glean configures the interface | 15:43 |
clarkb | ens3 but ya | 15:43 |
mwhahaha | thought on that node it's configured static | 15:43 |
mwhahaha | so seems weird | 15:43 |
clarkb | mwhahaha: it is statically configured via the network script and NM's sysconfig compat layet | 15:43 |
mwhahaha | ovs and networkmanager don't place nicely so i wonder if there's an issue around that | 15:44 |
clarkb | and this could explain why your third party ci doesn't see it. iF that cloud is configured to use dhcp then it will work fine | 15:44 |
fungi | also explains why we saw disproportionately more of these in some providers... those are likely the ones with no dhcpd | 15:45 |
clarkb | fungi: ya I think rax and inap are the non dhcp cases for us | 15:45 |
clarkb | everyone else does dhcp I think | 15:45 |
mwhahaha | hrm no /tec/sysconfig/network-scripts is empty in the image i pulled from nb02 | 15:45 |
mwhahaha | so we're not hitting that bug where the left over thing is there | 15:45 |
clarkb | mwhahaha: its written by glean on boot based on config drive information | 15:45 |
mwhahaha | yea but we shouldn't be changing it to dhcp (and it's not dhcp on that node) | 15:46 |
clarkb | mwhahaha: well something is telling NM to dhcp | 15:46 |
clarkb | per the syslog fungi pasted above | 15:46 |
mwhahaha | clearly | 15:46 |
clarkb | and its happening about an hour after the node boots so it isn't glean | 15:46 |
* mwhahaha goes digging | 15:46 | |
clarkb | (at least it would be super weird for boot units to fire that late) | 15:47 |
clarkb | mwhahaha: we can add your ssh key to this node if it helps to dig with logs | 15:47 |
clarkb | I just need a copy of the pubkey | 15:47 |
mordred | wow. I pop in to see how things are going and I see a conversation about NM randomly dhcping an hour after boot | 15:48 |
mwhahaha | i think sshnaidm|afk added me if you're looking at centos-8-inap-mtl01-0017550599 | 15:48 |
clarkb | mwhahaha: yup thats the one | 15:48 |
clarkb | mwhahaha: not sure if you had all the bakcground on it. We caught the held node. Observed its link layer was working based on iptables drop logging in the console log but it did not ping or ssh | 15:49 |
fungi | i'll correlate the timestamp to the job | 15:49 |
clarkb | mwhahaha: after double checking the cloud hadn't error'd with oepsntack apis we rebooted the instance and it came back | 15:49 |
clarkb | currently it appears to be using the glean static config as expected which is why it is working | 15:49 |
clarkb | syslog shows at some point NM switched to dhcp which failed and unconfiogured the interface | 15:49 |
clarkb | we've also got ~2 more instances in the unping state that we can probably reboot if necessary but for now keeping them in that state might be good to have in our back pocket | 15:51 |
clarkb | note the container image build jobs seem to hit tis too. I wonder if podman/buildah are doing NM config? | 15:51 |
fungi | looks like the "run toci_gate_test.sh" task starts at 13:57:59 and then dhcp4 is activated on ens3 by NetworkManager at 14:15:25 we log the unreachable state for the node at 14:21:07 | 15:52 |
clarkb | and with that I'm going to find breakfast and maybe do a bike ride since we're somewhere this is deuggable | 15:52 |
clarkb | fungi: mwhahaha sshnaidm|afk /tmp/console-* will have consoel logs generated by the host too | 15:52 |
clarkb | those may offer a more exact correlation to whatever did the thing | 15:52 |
mwhahaha | so we run os-net-config right before networkmanager starts doing stuff | 15:53 |
mwhahaha | let me trouble shoot this | 15:53 |
fungi | there is a bit of time between dhcp starting to try to get a lease and giving up/unconfiguring the interface (more likely picking a v4 linklocal address to replace the prior static one with) and then a bit more delay before ansible decides ssh access is timing out | 15:55 |
mwhahaha | had glean always used network manager? | 15:55 |
fungi | i believe it had to for centos-8 and newer fedora | 15:55 |
mwhahaha | k | 15:55 |
clarkb | we switched centos7 to it too I think | 15:55 |
mwhahaha | we aren't touching ens3 in our os-net-config configuration | 15:55 |
clarkb | but would need to double check that | 15:55 |
fungi | ianw knows a lot more of the history there, he had to struggle mightily to find something consistent for everything | 15:56 |
mwhahaha | we're adding a bridge br-ex to br-ctlplane which we alreadyc onfigured | 15:56 |
mwhahaha | so let me see if i can figure out why networkmanager tries to dhcp ens3 | 15:56 |
clarkb | mwhahaha: fungi we could rerun suspect things on that host and reboot to bring it back if it breaks | 15:56 |
clarkb | that may help narrow it down quickly | 15:56 |
mwhahaha | Jul 1 14:15:25 centos-8-inap-mtl01-0017550599 NetworkManager[1023]: <info> [1593612925.4888] device (ens3): state change: activated -> deactivating (reason 'connection-removed', sys-iface-state: 'managed') | 15:57 |
mwhahaha | is likely the cause but i don't know why that's occuring | 15:57 |
mwhahaha | so we do a systemctl restart network which is the legacy network stuff and it seem to mess with networkmanager | 15:57 |
fungi | neat | 15:58 |
clarkb | you could try it on that held node and see if it breaks | 15:59 |
mwhahaha | so we restart networking, network manager bounces ens3 and it tries to reconnect it using dhcp | 15:59 |
clarkb | I wonder if that is an 8.2 change | 15:59 |
mwhahaha | even though ifcfg-ens3 is configured as static | 15:59 |
mwhahaha | i bet it is but not certain | 15:59 |
*** shtepanie has joined #opendev | 15:59 | |
mwhahaha | can you point me to the glean config code | 16:00 |
clarkb | https://opendev.org/opendev/glean/src/branch/master/glean/cmd.py | 16:00 |
mwhahaha | thx | 16:00 |
*** sshnaidm|afk is now known as sshnaidm|ruck | 16:01 | |
clarkb | that is the bulk of it and there should be a systemd unit that calls that for ens3 on the host | 16:01 |
clarkb | though it also uses udev so it is parameterized unit | 16:01 |
mwhahaha | you folks rebooted this to get the node back right? | 16:02 |
clarkb | yes | 16:02 |
mwhahaha | yea so that points to networkmanager doing silly stuff | 16:03 |
mwhahaha | i think we can work around it by just disabling networkmanager in a pre task | 16:04 |
mwhahaha | for now | 16:04 |
clarkb | will that unconfigure the static config NM sets? | 16:05 |
clarkb | worth a try I guess | 16:05 |
fungi | yeah, easy enough to test | 16:06 |
mwhahaha | it shouldn't | 16:06 |
mwhahaha | stoping the networkmanager service should still have the networking configured | 16:07 |
mwhahaha | it should prevent networkmanager from waking up and touching the interface | 16:07 |
mwhahaha | weshay_ruck, sshnaidm|ruck: we should be able to reproduce this by launching a vm on a network w/o DHCP and configuring the interface statically via network manager. then 'service restart network' | 16:11 |
mwhahaha | on a centos8.2 vm | 16:11 |
*** etp has quit IRC | 16:16 | |
fungi | looking at cacti graphs for gitea-lb01, the background noise from our rogue distributed crawler is still ongoing | 16:19 |
*** sshnaidm|ruck is now known as sshnaidm|afk | 16:20 | |
clarkb | fungi: ya Im trying to get out the door for a bike ride but after I'dlike to land and apply our logging updates for haproxy and gitea | 16:21 |
clarkb | and see if the post failures for swift uploads are persitent and dig into that more | 16:21 |
fungi | sounds good. i'm still trying to catch up since half of every day this week has been conference | 16:24 |
clarkb | fungi: also did you see ianw found the likely tool that us hitting us? | 16:25 |
clarkb | based on the UA values | 16:25 |
fungi | yup | 16:25 |
fungi | i fiddled with machine translating the readme | 16:25 |
fungi | seems a very likely suspect | 16:26 |
*** hashar has joined #opendev | 16:29 | |
*** ykarel is now known as ykarel|away | 16:34 | |
openstackgerrit | Merged openstack/project-config master: Update Neutron Grafana dashboard https://review.opendev.org/738784 | 16:39 |
*** dtantsur is now known as dtantsur|afk | 16:44 | |
*** factor has quit IRC | 16:51 | |
*** icarusfactor has joined #opendev | 16:51 | |
*** hashar has quit IRC | 16:58 | |
*** moppy has quit IRC | 17:15 | |
*** corvus has quit IRC | 17:15 | |
*** bolg has quit IRC | 17:15 | |
*** guillaumec has quit IRC | 17:15 | |
*** andreykurilin has quit IRC | 17:15 | |
*** factor has joined #opendev | 17:16 | |
*** icarusfactor has quit IRC | 17:16 | |
*** hashar has joined #opendev | 17:17 | |
*** corvus has joined #opendev | 17:18 | |
*** guillaumec has joined #opendev | 17:18 | |
*** moppy has joined #opendev | 17:18 | |
*** andreykurilin has joined #opendev | 17:19 | |
*** slittle1 has quit IRC | 17:26 | |
*** weshay_ruck has quit IRC | 17:26 | |
*** mtreinish has quit IRC | 17:26 | |
*** cloudnull has quit IRC | 17:26 | |
*** hrw has quit IRC | 17:26 | |
*** AJaeger has quit IRC | 17:26 | |
*** slittle1 has joined #opendev | 17:29 | |
*** weshay_ruck has joined #opendev | 17:29 | |
*** mtreinish has joined #opendev | 17:29 | |
*** cloudnull has joined #opendev | 17:29 | |
*** hrw has joined #opendev | 17:29 | |
*** AJaeger has joined #opendev | 17:29 | |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Test multiarch release builds and use temp registry with buildx https://review.opendev.org/737315 | 17:30 |
hrw | tarballs.opendev.org have PROJECTNAME-stable-BRANCH.tar.gz tarballs for most of projects. What defines which projects get them? | 17:39 |
*** bolg has joined #opendev | 17:41 | |
fungi | hrw: those are created by the publish-openstack-python-branch-tarball job included in the post pipeline by lots of project-templates and also directly by some projects | 17:42 |
fungi | mostly it'll be projects which use the openstack-python-jobs project-template or one of its dependency-specific variants | 17:43 |
hrw | fungi: thanks. | 17:45 |
*** hashar is now known as hasharAway | 17:47 | |
openstackgerrit | Merged zuul/zuul-jobs master: Test multiarch release builds and use temp registry with buildx https://review.opendev.org/737315 | 18:19 |
*** hasharAway is now known as hashar | 18:25 | |
openstackgerrit | Lance Bragstad proposed openstack/project-config master: Create a new project for ansible-tripleo-ipa-server https://review.opendev.org/738842 | 18:28 |
*** chandankumar is now known as raukadah | 18:39 | |
openstackgerrit | Merged zuul/zuul-jobs master: ensure-pip debian: update package lists https://review.opendev.org/737529 | 18:39 |
clarkb | I've approved https://review.opendev.org/#/c/738714/ | 18:41 |
clarkb | fungi: I'm thinking maybe we single core approve https://review.opendev.org/#/c/738710/ ? I think having that extra logging will be important to continue monitoring this ddos | 18:43 |
fungi | wfm, it's been tested | 18:44 |
corvus | +3 | 18:44 |
fungi | i was only hesitating to self-approve in case folks disagreed with the approach there | 18:44 |
clarkb | corvus: thanks! | 18:44 |
fungi | since it's a general haproxy role we might use for other things with different logging styles | 18:45 |
fungi | so i could have set it per-backend instead of in the default section | 18:45 |
clarkb | fungi: ya I think we can sort that out if we end up adding different backend types | 18:45 |
corvus | i bet if we used it for others we'd want consistency | 18:45 |
corvus | (oh, yeah, if we used different backend types that might be different.) | 18:45 |
fungi | though the way we're generating the backends from a template now would still need reworking to support other forwarders | 18:45 |
corvus | anyway, that's a future problem | 18:46 |
corvus | and i'm in the present | 18:46 |
*** factor has quit IRC | 18:46 | |
*** factor has joined #opendev | 18:46 | |
clarkb | mwhahaha: if you all end up sorting out the underlying issue it would be great to hear details so that we're aware of those issues geneally (since others do use those images too) | 18:48 |
mwhahaha | will do | 18:48 |
mwhahaha | i don't think it affects others because most folks integrate with networkmanager, os-net-config's support is still WIP so we use the legacy network scripts still | 18:48 |
clarkb | gotcha | 18:49 |
mwhahaha | but if i can figure out the RCA i'll get a bz and a ML post together | 18:49 |
fungi | that would be awesome | 18:49 |
fungi | we sort of expect the situation with glean+nm to be a little fragile after everything we discovered about the interactions between ipv6 slaac autoconf in the kernel and nm fighting over interfaces | 18:50 |
clarkb | unfortunately its the system rhel and friends are committed to so we're trying to play nice | 18:50 |
mwhahaha | yea the best thing to do would be NM_managed=No | 18:50 |
mwhahaha | it looked like it was still yes for ens3 | 18:51 |
mwhahaha | anyway time to try and reproduce :D | 18:51 |
mwhahaha | at least now we have a direction, thanks for your efforts | 18:51 |
*** factor has quit IRC | 19:04 | |
*** factor has joined #opendev | 19:04 | |
*** factor has quit IRC | 19:10 | |
*** _mlavalle_1 has quit IRC | 19:10 | |
clarkb | I'm going to clean up some of those holds now. In particular the one that had us confused as it isn't useful | 19:12 |
clarkb | corvus: fungi is the proper way to do that via zuul autohold commands or nodepool delete? | 19:12 |
clarkb | I think I've been doing nodepool delete in the pats but realized when corvus asked for a hold id earlier today that maybe I can do that via zuul? | 19:13 |
clarkb | ya I think I may have been doing this wrong | 19:15 |
*** _mlavalle_1 has joined #opendev | 19:15 | |
mwhahaha | i'm still poking at centos-8-inap-mtl01-0017550599 so if you could keep that around for a few that'd be great | 19:16 |
mwhahaha | i did grab logs/configs but i want to make sure i got everything | 19:16 |
fungi | clarkb: i had been doing it through nodepool delete, but recently learned that removing the autohold is better since that will still cause nodepool to clean up the nodes | 19:18 |
clarkb | mwhahaha: yup will leave that one alone | 19:18 |
mwhahaha | thanks | 19:18 |
clarkb | fungi: cool I was just about to test that | 19:18 |
fungi | yeah, better because it doesn't leave orphaned autohold entries around | 19:18 |
clarkb | ya I've got a couple to clean up | 19:19 |
clarkb | now to double check which autohold id is the one mwhahaha is on | 19:21 |
clarkb | id 0000000160 is the one mwhahaha is on and I kept 0000000161 too in case we need a second one but cleaned up the others | 19:22 |
clarkb | the gitea and haproxy things should be landing soon too. I'll be sure to restart giteas for the new format once that config is applied | 19:24 |
mwhahaha | is there an easy way to manually run glean to configure the network? | 19:25 |
mwhahaha | trying to reproduce how it would configure it vs like the installer | 19:25 |
clarkb | mwhahaha: you can invoke the script that the unit runs either directly or by triggering hte unit. Let me take a quick look to be more specific | 19:26 |
clarkb | mwhahaha: the unit is glean@.service and the @ means it takes a parameter. In this case the interface name. I think you trigger that with systemctl start glean.ens3 ? But also I notice the unit has a condition where it won't run if the sysconfig files exist | 19:27 |
clarkb | in this case it may be easiest to run it directly and forthat you wany Environment="ARGS=--interface %I" and ExecStart=/usr/local/bin/glean.sh --use-nm --debug $ARGS | 19:28 |
clarkb | %I is magical systemd interpolation for the argument and in this case it would be ens3 | 19:28 |
mwhahaha | yea that's what i'm looking for, thanks | 19:28 |
mwhahaha | i'll figure it out from there | 19:28 |
clarkb | and if you want to know what triggers it with the parameter I think it is a udev rule | 19:29 |
clarkb | ya /etc/udev/rules.d/99-glean.rules SUBSYSTEM=="net", ACTION=="add", ATTR{addr_assign_type}=="0", TAG+="systemd", ENV{SYSTEMD_WANTS}+="glean@$name.service" | 19:29 |
clarkb | glean@ens3 is the name to use with systemctl | 19:29 |
openstackgerrit | Merged opendev/system-config master: Update gitea access log format https://review.opendev.org/738714 | 19:33 |
*** yoctozepto7 has joined #opendev | 19:37 | |
openstackgerrit | Merged opendev/system-config master: Remove the tcplog option from haproxy configs https://review.opendev.org/738710 | 19:40 |
corvus | clarkb, fungi: yes, delete via zuul autohold delete | 19:42 |
corvus | one command instead of 2 | 19:42 |
*** yoctozepto has quit IRC | 19:45 | |
*** yoctozepto7 is now known as yoctozepto | 19:45 | |
openstackgerrit | Sean McGinnis proposed openstack/project-config master: update-constraints: Install pip for all versions https://review.opendev.org/738926 | 19:53 |
clarkb | I've restarted gitea on gitea01 to pick up the access log format change | 19:55 |
clarkb | it is happy so I'm working through the others then will double check the lb is similarly updated | 19:55 |
weshay_ruck | clarkb, fungi thanks guys! | 19:59 |
*** sorin-mihai__ has quit IRC | 20:01 | |
mwhahaha | so it looks like the ens3 config gets lost | 20:01 |
mwhahaha | glean writes out the file an it's named 'System ens3' NetworkManager[1023]: <info> [1593611369.2978] device (ens3): Activation: starting connection 'System ens3' (21d47e65-8523-1a06-af22-6f121086f085) | 20:02 |
clarkb | infra-root we now have ip:port recorded in haproxy and gitea logs so we can map between them now | 20:02 |
mwhahaha | but when we restart networking, it doesn't have this so it creates a Wired connection 1 | 20:02 |
clarkb | mwhahaha: so when I rebooted it it didn't just reread the existing config instead glean reran and rewrote the config? interesting | 20:04 |
mwhahaha | maybe? | 20:04 |
mwhahaha | because now it's back to being 'System ens3' | 20:04 |
clarkb | ianw should be waking up soon and may have NM thoughts | 20:05 |
fungi | oh, i'm sure he has nm thoughts, but most are probably angry ones | 20:07 |
mwhahaha | ha | 20:08 |
mwhahaha | let me see if i can see where we might be removing this file all of the sudden | 20:08 |
mwhahaha | seems weird still | 20:08 |
mwhahaha | yea it's like ifcfg-ens3 goes missing. the behaviour is that of a node where ifcfg-ens3 is removed and when you restart networkmanager it creates the 'Wired connection 1' | 20:11 |
mwhahaha | we shouldn't be touching that | 20:11 |
openstackgerrit | Sean McGinnis proposed openstack/project-config master: Use python3 for update_constraints https://review.opendev.org/738931 | 20:13 |
mwhahaha | sweet, it's os-net-config | 20:14 |
mwhahaha | could you do me a favor and restart that node | 20:16 |
* mwhahaha nukes the network | 20:16 | |
clarkb | mwhahaha: I can | 20:16 |
mwhahaha | thank you | 20:16 |
clarkb | mwhahaha: ready for that now? | 20:16 |
mwhahaha | yes plz | 20:16 |
clarkb | reboot issued, will probably be a minute before ssh is accessible again | 20:17 |
*** sgw1 has quit IRC | 20:35 | |
mwhahaha | yea i don't think it's coming back, oh well | 20:38 |
clarkb | mwhahaha: we have another held andnow that we know what is going on we can boot out of band centos8 images in inap too | 20:53 |
clarkb | assuming we need them let me know | 20:53 |
mwhahaha | no i think we have some direction. you should be able to release them for now | 20:53 |
mwhahaha | we're going to disable networkmanager as a starting point while we investigate what's happening to that config | 20:54 |
clarkb | k I'll clean those up ina abit | 20:56 |
clarkb | tomorrow I'll plan to land https://review.opendev.org/#/c/737885/ as that should make the gitea api interactions a bit more correct | 21:02 |
clarkb | but will want to watch it afterwards and its already "late" in my day relative to my early start now | 21:02 |
clarkb | the follow on to ^ needs work though | 21:02 |
clarkb | also I've just noticed that ianw implemented the apache proxy for gitea with UA filtering | 21:03 |
clarkb | so reviewing that now too | 21:03 |
clarkb | those changes actually look good. fungi corvus https://review.opendev.org/#/c/738721/4 the way its done we don't cut over haproxy to it. So we could land those changes then deploy the apache, test it then switch haproxy. Should be very safe | 21:06 |
clarkb | ianw: ^ thanks for doing that I like the approach being able to be a measured transition in prod too | 21:07 |
openstackgerrit | Dmitriy Rabotyagov (noonedeadpunk) proposed opendev/system-config master: Add copr-lxc3 to list of mirrors https://review.opendev.org/738942 | 21:15 |
corvus | clarkb, ianw: +2, but i hope we don't have to use it | 21:20 |
*** priteau has quit IRC | 21:21 | |
*** factor has joined #opendev | 21:29 | |
*** factor has quit IRC | 21:33 | |
*** factor has joined #opendev | 21:33 | |
*** hashar has quit IRC | 21:49 | |
*** factor has quit IRC | 21:49 | |
*** factor has joined #opendev | 21:50 | |
fungi | well, the traffic we saw yesterday seems to be continuing, so i expect it may be either that or leave every customer of the largest isp in china blocked with iptables | 21:54 |
fungi | no idea when (or if) it will ever subside | 21:55 |
ianw | hey, around now | 21:59 |
clarkb | ianw: no rush, was just looking at your change to apache filter UAs in front of gitea. I think we can roll that out if we decide it is necessary | 22:01 |
ianw | i figure with the proxy we could just watch the logs for 301's until they disappear, and then turn it off | 22:01 |
clarkb | ya | 22:01 |
clarkb | though aren't you doing 403? | 22:01 |
ianw | sorry 403 yeah | 22:01 |
ianw | if anyone is feeling containerish and wants too look over https://review.opendev.org/#/q/topic:grafana-container+status:open that would be great too | 22:03 |
ianw | really the only non-standard thing in there is in graphite where i've used ipv4 and ipv6 socat on 8125 to send into the graphite container | 22:04 |
clarkb | ianw: also we managed to catch a centos8 node that lost networking during a tripleo job. Rebooting it brought it back. Turns out that something to do with restarting networking in os-net-config or similar caused our static config for the interface to be wiped out and NM switched to dhcp | 22:05 |
clarkb | tripleo is going to try and workaround it by disabling NM but also looking into possibility that centos8.2 changed that and broke things | 22:05 |
clarkb | ianw: small thing on https://review.opendev.org/#/c/737406/18 | 22:11 |
clarkb | and question on https://review.opendev.org/#/c/738125/7 | 22:15 |
*** DSpider has quit IRC | 22:18 | |
ianw | clarkb: yeah, the host network thing is the ipv6 thing, which is addressed in https://review.opendev.org/#/c/738125/7/playbooks/roles/graphite/tasks/main.yaml @ 63 | 22:20 |
ianw | basically, i found that when setting up port 8125 to go into the container, docker would bind to ipv6 | 22:20 |
ianw | but the container has no ipv6 handling at all | 22:21 |
clarkb | ianw: what about host networking? | 22:21 |
clarkb | is the issue that with host networking the graphite services only bind on 0.0.0.0 ? | 22:21 |
clarkb | maybe it would be better to use host networking with the proxy (if we can configure grpahite services to bind on other ports then proxy to them) that way we can be consistent? | 22:22 |
ianw | hrm, yeah what i tried first was having host with 8125 and hoping i could run a socat proxy for ipv6 8125 but found docker took it over | 22:24 |
ianw | yeah, i guess if it's different ports it doesn't matter, let me try | 22:25 |
clarkb | ya in my suggestion we'd put statsd on 8126 and then the proxy listends on 8125 and forwards | 22:25 |
clarkb | at least that way its weird but consistently weird with our normal setup :) | 22:25 |
ianw | systemd actually has a nice socket activated forwarding service | 22:26 |
ianw | ... that doesn't support udp | 22:26 |
clarkb | ha | 22:26 |
fungi | "nice" | 22:29 |
clarkb | I guess they do that because it just wants to stop at the syn ack syn | 22:31 |
fungi | it's synful | 22:31 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Graphite container deployment https://review.opendev.org/738125 | 22:31 |
ianw | clarkb: ^ ok, so that has socat listening on ipv4&ipv6 8125 which forwards to 8825 (8126 is taken by statsd admin) which docker should map to 8125 in the container ... phew! | 22:32 |
clarkb | hrm I'm not sure docker will do the port mapping with host mode | 22:33 |
clarkb | (it might) | 22:33 |
clarkb | we may need to configure the services to use a different port if that doesn't work | 22:33 |
ianw | OOOHHHHH yeah THAT's right! | 22:33 |
ianw | that's why i did it | 22:33 |
*** icarusfactor has joined #opendev | 22:33 | |
ianw | docker silently says "oh btw i'm not doing the port mapping" but continues on | 22:34 |
*** factor has quit IRC | 22:36 | |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Handle multi-arch docker manifests in promote https://review.opendev.org/738945 | 22:39 |
ianw | i wonder if with host networking ipv6 gets into the container | 22:41 |
fungi | i would expect so | 22:43 |
fungi | we have other containerized services listening on v6 sockets | 22:43 |
fungi | our gitea haproxy, for example | 22:44 |
ianw | the problem is i'll have to re-write the upstream container to have ipv6 bindings | 22:45 |
ianw | maybe it's worth the pain | 22:45 |
clarkb | they dont make it configurable? thats too bad if so | 22:45 |
ianw | they did take one of my patches fairly quickly | 22:45 |
*** tkajinam has joined #opendev | 22:46 | |
corvus | why are there port bundings with host network mode? | 22:50 |
clarkb | corvus: they werethere from earlier ps which did not use hostnetworking | 22:51 |
corvus | ok, so next ps will remove those? | 22:51 |
corvus | (cause ps8 has both) | 22:51 |
ianw | corvus: this is the question ... i'm trying to avoid having to hack the pre-built container for ipv6 | 22:52 |
clarkb | another option could be to drop our AAAA record from dns for that service | 22:52 |
ianw | yeah, i feel like that's the worst option, it's a regression on what we have | 22:53 |
openstackgerrit | Merged zuul/zuul-jobs master: Handle multi-arch docker manifests in promote https://review.opendev.org/738945 | 22:53 |
corvus | i'd love to stick with host networking like all the others, so if we could change the bind that would be great | 22:58 |
ianw | there's two things to deal with in the container; gunicorn presenting graphite and statsd | 22:59 |
clarkb | it looks like we can mount in configs for both | 23:01 |
clarkb | we'd then need to write the configs but I think that will allow us to change the bind? | 23:01 |
ianw | statsd maybe, that launches from a config file | 23:01 |
clarkb | as an alternative we could change the bind upstream since ipv6 listening at :: will also work for ipv4 | 23:01 |
ianw | it looks like gunicorn is started from the script with "-b" arguments | 23:02 |
ianw | urgh, the other thing is it has no support for ssl | 23:05 |
ianw | it runs nginx | 23:08 |
*** _mlavalle_1 has quit IRC | 23:11 | |
corvus | wonder why they didn't just use uwsgi | 23:16 |
*** tosky has quit IRC | 23:16 | |
ianw | if we map in the keys, a nginx config, a statsd config it might just work | 23:20 |
ianw | ... assuming upstream never changes anything | 23:20 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!