*** Dmitrii-Sh has quit IRC | 00:02 | |
*** ryohayakawa has joined #opendev | 00:04 | |
*** Dmitrii-Sh has joined #opendev | 00:07 | |
*** rh-jelabarre has quit IRC | 00:43 | |
*** rh-jelabarre has joined #opendev | 00:44 | |
*** rh-jelabarre has quit IRC | 00:48 | |
*** gouthamr_ has quit IRC | 00:59 | |
*** gouthamr_ has joined #opendev | 01:05 | |
*** diablo_rojo has quit IRC | 01:12 | |
*** gouthamr_ has quit IRC | 01:25 | |
*** gouthamr_ has joined #opendev | 01:37 | |
mnaser | has there been any issues with promotion by any chance | 02:49 |
---|---|---|
mnaser | https://hub.docker.com/r/vexxhost/magnum-api/tags for example has no promotion since merging | 02:50 |
* mnaser double checks job config... | 02:51 | |
mnaser | `openstack-operator:images:promote:magnum` last ran 2020-06-22 19:28:29 | 02:52 |
*** gouthamr_ has quit IRC | 02:52 | |
mnaser | https://zuul.opendev.org/t/vexxhost/builds?change=739774%2C7 | 02:52 |
mnaser | looks like the promote never got enqueued.. even for docs, this was 5 changes that merged all at once | 02:52 |
*** gouthamr_ has joined #opendev | 02:56 | |
*** gouthamr_ has quit IRC | 03:26 | |
*** gouthamr_ has joined #opendev | 03:32 | |
*** gouthamr_ has quit IRC | 04:00 | |
*** gouthamr_ has joined #opendev | 04:11 | |
*** gouthamr_ has quit IRC | 04:29 | |
*** gouthamr_ has joined #opendev | 04:30 | |
*** gouthamr_ has quit IRC | 04:35 | |
*** gouthamr_ has joined #opendev | 04:40 | |
*** ysandeep|away is now known as ysandeep|rover | 05:03 | |
*** marios has joined #opendev | 05:39 | |
*** gouthamr_ has quit IRC | 06:50 | |
*** gouthamr_ has joined #opendev | 06:52 | |
*** knikolla has quit IRC | 06:54 | |
*** knikolla has joined #opendev | 06:56 | |
*** gouthamr_ has quit IRC | 07:04 | |
*** gouthamr_ has joined #opendev | 07:05 | |
*** gouthamr_ has quit IRC | 07:15 | |
*** gouthamr_ has joined #opendev | 07:22 | |
*** zbr7 is now known as zbr|ruck | 07:23 | |
*** hashar has joined #opendev | 07:45 | |
*** gouthamr_ has quit IRC | 07:46 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Update synchronize-repos https://review.opendev.org/740110 | 07:47 |
*** ysandeep|rover is now known as ysandeep|lunch | 07:52 | |
*** gouthamr_ has joined #opendev | 07:52 | |
*** ysandeep|lunch is now known as ysandeep | 07:53 | |
*** fressi has quit IRC | 07:53 | |
*** ysandeep is now known as ysandeep|lunch | 07:53 | |
*** sshnaidm has joined #opendev | 07:55 | |
*** gouthamr_ has quit IRC | 07:57 | |
*** fressi has joined #opendev | 08:01 | |
*** moppy has quit IRC | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
*** gmann has quit IRC | 08:03 | |
*** gmann has joined #opendev | 08:06 | |
*** gouthamr_ has joined #opendev | 08:09 | |
*** iurygregory has quit IRC | 08:10 | |
*** jaicaa has quit IRC | 08:17 | |
openstackgerrit | Jeffrey Zhang proposed openstack/diskimage-builder master: support non-x86_64 centos7 to change DIB_DISTRIBUTION_MIRROR variable https://review.opendev.org/740183 | 08:18 |
*** jaicaa has joined #opendev | 08:20 | |
*** iurygregory has joined #opendev | 08:24 | |
*** tobiash has quit IRC | 08:26 | |
*** ysandeep|lunch is now known as ysandeep|rover | 08:34 | |
*** DSpider has joined #opendev | 08:36 | |
*** tobiash has joined #opendev | 09:03 | |
*** tosky has joined #opendev | 09:10 | |
openstackgerrit | Thierry Carrez proposed openstack/project-config master: maintain-github-mirror: fix dependency name https://review.opendev.org/740199 | 09:17 |
*** dtantsur|afk is now known as dtantsur | 09:29 | |
*** ryohayakawa has quit IRC | 09:47 | |
openstackgerrit | Dmitry Tantsur proposed openstack/project-config master: Account for ironic bugfix branches https://review.opendev.org/740212 | 09:49 |
*** roman_g has joined #opendev | 09:54 | |
*** roman_g has quit IRC | 10:14 | |
*** fressi has quit IRC | 10:16 | |
*** roman_g has joined #opendev | 10:16 | |
openstackgerrit | Dmitry Tantsur proposed openstack/project-config master: Account for ironic bugfix branches https://review.opendev.org/740212 | 10:26 |
*** hashar has quit IRC | 10:27 | |
*** ysandeep|rover is now known as ysandeep|afk | 10:42 | |
*** roman_g has quit IRC | 10:44 | |
*** ysandeep|afk is now known as ysandeep|rover | 10:58 | |
*** tkajinam has quit IRC | 11:02 | |
noonedeadpunk | fungi: so returning to the multinode issue - I localized the issue, but not sure why it is raised.. So I got these iptables rules on hold nodes http://paste.openstack.org/show/795701/ | 11:41 |
noonedeadpunk | and I think what happens, is that they don't allow ovs communication. as once I remove last reject rule everything becomes good | 11:42 |
noonedeadpunk | but according to them, icmp and things supposed to be good... | 11:43 |
noonedeadpunk | ok, so they're created with multinode-bridge role for some reason https://zuul.opendev.org/t/vexxhost/build/88fa15b6f7b4438eb1555f519aace349/log/job-output.txt#521 | 11:47 |
frickler | noonedeadpunk: this has the primary node as both switch and peers, but I think it should be only switch https://9f0840c51d5746d19be2-4a01e595aa1b522e94de715b2ff31aa2.ssl.cf1.rackcdn.com/739717/9/check/ffrouting-deploy/ebd87dd/zuul-info/inventory.yaml | 11:55 |
frickler | c.f. https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/multi-node-bridge | 11:55 |
noonedeadpunk | frickler: but I have same rules on the secondary as well | 11:56 |
noonedeadpunk | so they doesn't depend on switch only | 11:56 |
*** rh-jelabarre has joined #opendev | 11:57 | |
frickler | noonedeadpunk: maybe the issue isn't firewall but some other setup. c.f. also this for a working setup https://7407bee42906727a6720-dc7e48b48408cf00a200e97d0ee2c855.ssl.cf5.rackcdn.com/734621/6/check/devstack-multinode/111424e/zuul-info/inventory.yaml | 11:57 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Use ansible_ssh_private_key_file when setting up windows https://review.opendev.org/740254 | 12:02 |
*** bhagyashris is now known as bhagyashris|brb | 12:04 | |
noonedeadpunk | frickler: so I added clear-firewall role after multi-node-bridge and test pass https://review.opendev.org/#/c/739717/12..13/tests/test.yml | 12:04 |
*** zbr_ has quit IRC | 12:05 | |
frickler | infra-root: reqs tox is failing, I cannot reproduce locally and there is no change in the tox freeze, so likely something that changed on our nodes, see e.g. https://zuul.opendev.org/t/openstack/build/1b1159196b0b4b91af987ceece5f22a3 | 12:05 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Enforce BatchMode when setting up ssh for windows https://review.opendev.org/740254 | 12:08 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Enforce BatchMode when setting up ssh for windows https://review.opendev.org/740254 | 12:10 |
frickler | noonedeadpunk: oh, actually do you include this role? https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/multi-node-firewall | 12:20 |
noonedeadpunk | frickler: nope, I didn't. But it's included by deault without any option to omit it | 12:21 |
noonedeadpunk | oh, ait | 12:21 |
noonedeadpunk | I mixed up with persistent-firewall one | 12:21 |
noonedeadpunk | that's how playubook looked like https://review.opendev.org/#/c/739717/12/tests/test.yml | 12:22 |
noonedeadpunk | btw, can you release hold for that job? | 12:22 |
*** ysandeep|rover is now known as ysandeep|afk | 12:37 | |
*** bhagyashris|brb is now known as bhagyashris | 12:37 | |
frickler | noonedeadpunk: so I guess you might want most of the roles here, I'm not sure in detail how they might interact https://opendev.org/zuul/zuul-jobs/src/branch/master/playbooks/multinode/pre.yaml | 12:39 |
noonedeadpunk | yeah, btw I saw that. Eventually firewall cleanup works for me and I guess we don't want it to interfer. | 12:40 |
noonedeadpunk | So thanks) | 12:40 |
frickler | noonedeadpunk: if you completely clean the firewall, you need to make sure that you don't have any exploitable services running like recursive DNS resolvers | 12:45 |
noonedeadpunk | if they're not pre-installed with the image only... | 12:47 |
openstackgerrit | Merged zuul/zuul-jobs master: Enforce BatchMode when setting up ssh for windows https://review.opendev.org/740254 | 12:47 |
noonedeadpunk | but we have bgp running as part of frr (and maybe ospf as well). but not sure how exploitable that is... | 12:47 |
noonedeadpunk | considering the job time is like 5 mins max | 12:48 |
frickler | noonedeadpunk: not sure about bgpd, we also have ntpd running by default, not sure about it's config, maybe some other infra-root has some better grasp on this | 12:53 |
frickler | deleted the held nodes, too, thx | 12:53 |
openstackgerrit | Merged openstack/project-config master: maintain-github-mirror: fix dependency name https://review.opendev.org/740199 | 12:55 |
openstackgerrit | Merged openstack/project-config master: Publish Airship governance https://review.opendev.org/739790 | 12:55 |
openstackgerrit | Merged openstack/project-config master: add vexxhost/openstack-tools https://review.opendev.org/739627 | 12:57 |
*** ysandeep|afk is now known as ysandeep | 12:58 | |
openstackgerrit | Merged zuul/zuul-jobs master: emit-job-header: add inventory hostname https://review.opendev.org/738963 | 13:02 |
frickler | oh, that reqs issue is another setuptools thing it seems. passes with <=48, fails with the 49.2.0 we seem to install | 13:03 |
frickler | prometheanfire: smcginnis: ^^ | 13:03 |
mordred | frickler: :( | 13:05 |
smcginnis | Oh joy. | 13:07 |
smcginnis | frickler: Any idea if there is a setuptools bug filed? | 13:11 |
frickler | smcginnis: I haven't found one, feel free to do so | 13:18 |
*** roman_g has joined #opendev | 13:18 | |
frickler | according to git bisect, this seems to be the first broken commit, but it gives a different failure https://github.com/pypa/setuptools/commit/78d2a3bfafd38112dc3c486cd478e4cee1f782ec | 13:25 |
frickler | smcginnis: ah, got it, we need to change the expected exception | 13:29 |
frickler | something like http://paste.openstack.org/show/795705/ , but not sure yet how to make it work with the old one, too. | 13:30 |
*** mtreinish has quit IRC | 13:33 | |
*** mtreinish has joined #opendev | 13:33 | |
mordred | corvus: the container job did not fail yesterday - I rechecked it again this morning | 13:41 |
smcginnis | frickler: Maybe instead of a with block, we can just put it in a try/except and catch either one? | 13:53 |
fungi | mnaser: i'll take a look in the scheduler logs and see if i can work out why it wasn't enqueued | 14:10 |
fungi | noonedeadpunk: does ovs use any sort of encapsulation? i'm not too familiar with it. if so, though, it's possible the encapsulated traffic (whatever form that takes) might be getting blocked on the node's real network interfaces | 14:13 |
fungi | we'd really prefer if you didn't run those nodes with no packet filtering. it's a good safety net in case a job happens to add exploitable services which get caught up in a global reflection ddos or something (we get reports about that pretty regularly, unfortunately, and have to hunt down who turned off firewalling on a job and exposed a socks5 proxy or a recursive dns resolver or a memcached or...) | 14:15 |
fungi | one way to check would be to put the default iptables rules back but modify the default block rule to log any hits, then reproduce your ping test and see what actually got blocked by iptables | 14:18 |
*** qchris has joined #opendev | 14:22 | |
frickler | fungi: well I'm pretty sure the issue is not running the multi-node-firewall role | 14:27 |
fungi | frickler: ahh, so it was actually blocking the vxlan packets then probably | 14:32 |
*** roman_g has quit IRC | 14:33 | |
*** roman_g has joined #opendev | 14:44 | |
*** mlavalle has joined #opendev | 14:50 | |
*** bhagyashris is now known as bhagyashris|dinn | 14:52 | |
sgw1 | morning folks. can someone help me with getting permissions to push branches to the starlingx/manifest repo, I need to create our r/stx.4.0 release branch | 14:56 |
AJaeger | sgw1: https://review.opendev.org/#/admin/projects/starlingx/manifest,access shows that starlingx-release has these permissions | 14:57 |
AJaeger | sgw1: you're part of that team, aren't you? | 14:57 |
AJaeger | sgw1: oh, wait - branch creation. | 14:58 |
AJaeger | sgw1: that needs an ACL update AFAIK | 14:58 |
*** mnasiadka has joined #opendev | 14:58 | |
AJaeger | sgw1: https://docs.opendev.org/opendev/infra-manual/latest/creators.html#creation-of-branches explains what's needed | 14:59 |
yoctozepto | thanks fungi, my spell checker must have kicked in there | 14:59 |
AJaeger | sgw1: so, please send a change to openstack/project-config and update the ACL | 14:59 |
sgw1 | Ok, I guess I should have checked there first, I had it for all the other repos | 15:00 |
sgw1 | AJaeger: thanks, I guess I had permission after all! Sorry for the noise | 15:06 |
fungi | yoctozepto: no worries, i've heard other folks in the community actually call it "jitsu" too, so i thought there might be some general confusion over what the project is named. just trying to correct it whenever i run across that | 15:16 |
fungi | mnaser: so according to the scheduler's debug log 739774,7 was removed from the promote pipeline when another change for the same project merged after it. probably better if i check the most recent change merged for that repo | 15:19 |
mnaser | fungi: right, but my 'promote image' doesnt happen every single time, so it shouldn't have been dequeued | 15:20 |
fungi | when changes merge in a batch zuul is generally only going to run jobs for the first and last change when using the supercedent pipeline manager | 15:20 |
mnaser | i have different images being promoted if the image was changed or not | 15:20 |
mnaser | ah -- i wnder if that means i have to build all the images, all the time :X | 15:20 |
fungi | if you want jobs to run for every single change, the supercedent isn't what you want | 15:21 |
fungi | it assumes all jobs run for all changes | 15:21 |
fungi | definitely avoid things like file filters on jobs for supercedent pipelines or you'll get inconsistent results | 15:21 |
mnaser | that would really increase the # of jobs running by a lot i guess ;( | 15:22 |
fungi | mnaser: one option would be to switch your promote pipeline to dependent instead of supercedent | 15:22 |
mnaser | fungi: right, but i think we're using the opendev promote pipeline config | 15:22 |
fungi | or create a different dependent pipeline for those jobs | 15:22 |
fungi | oh, i thought this was in the vexxhost tenant | 15:23 |
fungi | zuul.Pipeline.vexxhost.promote according to the scheduler | 15:23 |
fungi | so not in the opendev tenant | 15:23 |
corvus | mnaser, fungi: you may want the serial pipeline manager for this: https://zuul-ci.org/docs/zuul/reference/pipeline_def.html#value-pipeline.manager.serial | 15:24 |
corvus | (ie, what we use for our deploy pipeline) | 15:24 |
fungi | ahh, yeah, if you need to make sure only one runs at a time | 15:24 |
fungi | but supercedent pipeline manager and file filters are two different (and basically incompatible) mechanisms for reducing the number of jobs you run | 15:25 |
fungi | so i would pick one or the other and not attempt to combine them | 15:26 |
corvus | 15:27 < mordred> corvus: well - if we're not going to be waiting on subchecks now - it makes me want to consider just doing 2.13->3.2 over a weekend | 15:30 |
mordred | corvus: so - if we're not likely going to be moving anything to checks or hoping subchecks lands, I'm kind of wondering if we should consider a flag-day weekend upgrade all the way to 3.2 | 15:30 |
corvus | mordred: were we thinking we wanted to stop at 2.16 and make sure we have all the ci stuff sorted out? | 15:30 |
*** ysandeep is now known as ysandeep|away | 15:30 | |
corvus | we'll still need some kind of hideci thing, right? | 15:31 |
mordred | infra-root: email on gerrit mailing list about the google folks abandoning the checks plugin. scrollback in #zuul talks about some things realted - including a 3.x frontend plugin wikimedia is using similar to our current ci results table | 15:31 |
mordred | corvus: yeah - but maybe we can do that in the similar structure to what wikimedia has in their plugin until the new labels stuff is there? | 15:32 |
mordred | or maybe we'll determine that we do need to stop at 2.16 to wait - I mean, we still need to run an upgrade test to see | 15:32 |
corvus | mordred: well, i'm unclear on what we actually have prepared for 3.x -- do we have that functionality ready at all? | 15:32 |
mordred | nope. we have nothing for 3.x currently | 15:33 |
mordred | I'm also unclear as to how compatible the different versions of polymer are - so will a polymer plugin for 2.16/3.0 be the same as a polymer plugin for 3.2 - or will we need to write the same thing twice | 15:34 |
corvus | mordred: then i think i agree with you that if our main hesitation for proceeding past 2.16 was that we needed a hideci solution and were hoping for it to be checks, that that is now changed, and we ought to be able to proceed faster, but we probably do still need a hideci solution. we probably don't need to stop at 2.16 for it though, if we come up with something that works for 3.2 in our testing, i | 15:35 |
corvus | think we can upgrade with confidence. | 15:35 |
mordred | ++ | 15:35 |
mordred | agree | 15:35 |
corvus | mordred: so i think the only remaining reason to stop at 2.16 would be luca's suggestion about acclimating folks. but i bet folks can muddle through. :) | 15:35 |
mordred | and I think that's likely to be the "best" path forward - we can develop (and by develop, I mean steal the wikimedia's already developed thing) for 3.2 | 15:36 |
fungi | so is it that the gerrit community is officially abandoning checks, or just discussing whether they should? | 15:36 |
mordred | corvus: yeah | 15:36 |
corvus | fungi: i think it's that the google folks (who are 99% of the contributors to checks) are abandoning it | 15:36 |
mordred | corvus: fwiw - the wikimedia polymer plugin is WAY more readable and is using an actual API to get messages | 15:36 |
mordred | fungi: I think it's effectivelu official | 15:36 |
mordred | since it was the google folks who were staffing it | 15:36 |
fungi | got it. disappointing reversal of direction on their part | 15:36 |
mordred | yeah | 15:37 |
corvus | i think anyone is welcome to continue working on it if they want, but that leaves about 0 people, plus or minus two. | 15:37 |
mordred | yeah. but ... who knows, maybe it's for the best because subchecks was still quite a ways off from being a thing, and we can likely adapt https://github.com/wikimedia/puppet/blob/production/modules/gerrit/files/homedir/review_site/static/gerrit-theme.html#L195 for our use reasonably quickly? | 15:37 |
*** sgw1 has quit IRC | 15:44 | |
*** sgw1 has joined #opendev | 15:44 | |
*** bhagyashris|dinn is now known as bhagyashris | 15:49 | |
*** roman_g has quit IRC | 15:50 | |
*** diablo_rojo has joined #opendev | 15:51 | |
fungi | infra-root: while the crawler-induced load we've been seeing from non-firewalled sources seems to be continuing, the volume has slowly trailed off, so we might be safe to lift the china unicom drop rules on the lb now without yet pulling the trigger on the apache proxy ua filter work: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all | 15:57 |
*** marios has quit IRC | 16:00 | |
yoctozepto | fungi: meetpad did very fine for us | 16:08 |
fungi | yoctozepto: great! | 16:09 |
fungi | ironic is apparently using it for some get-togethers as well | 16:09 |
fungi | as is the osd d&i wg | 16:09 |
fungi | er, osf d&i wg | 16:09 |
fungi | glad to see it's continuing to be useful outside the ptg context | 16:10 |
yoctozepto | fungi: that's nice to know! | 16:12 |
yoctozepto | fungi: what is d&i? | 16:12 |
fungi | diversity and inclusion | 16:21 |
fungi | one of the osf board of directors chartered working groups | 16:21 |
*** mnasiadka_ has joined #opendev | 16:48 | |
*** jbryce_ has joined #opendev | 16:48 | |
*** mwhahaha_ has joined #opendev | 16:48 | |
*** mwhahaha has quit IRC | 16:55 | |
*** mnasiadka has quit IRC | 16:55 | |
*** jbryce has quit IRC | 16:55 | |
*** rm_work has quit IRC | 16:55 | |
*** mordred has quit IRC | 16:55 | |
*** jbryce_ is now known as jbryce | 16:55 | |
*** mwhahaha_ is now known as mwhahaha | 16:55 | |
*** mnasiadka_ is now known as mnasiadka | 16:55 | |
*** Eighth_Doctor has quit IRC | 16:56 | |
*** rm_work has joined #opendev | 16:58 | |
*** Eighth_Doctor has joined #opendev | 17:03 | |
*** sshnaidm is now known as sshnaidm|afk | 17:12 | |
fungi | infra-root: based on the graph i linked a little while ago, i propose we revert the temporarily iptables rules we were using to block traffic from china unicom. i'm available to keep an eye on things for the next ~6 hours still. any objections? | 17:14 |
*** mordred has joined #opendev | 17:31 | |
*** hashar has joined #opendev | 17:35 | |
*** dtantsur is now known as dtantsur|afk | 17:45 | |
corvus | fungi: no objections | 17:59 |
fungi | it looks like we can undo them with `sudo systemctl restart netfilter-persistent` | 18:00 |
fungi | and if we need to put them back, it's: | 18:01 |
fungi | for X in $(cat ~clarkb/china_unicom_ranges) ; do echo $X ; sudo iptables -I openstack-INPUT -j DROP -s $X ; done | 18:01 |
*** qchris has quit IRC | 18:09 | |
*** qchris has joined #opendev | 18:22 | |
fungi | okay, i gave it an hour, no objections (thanks corvus for at least not letting me worry i was talking into a black hole). i'll proceed with the netfilter-persistent restart on gitea-lb01.opendev.org | 18:22 |
fungi | and done. continuing to keep an eye on this graph since that's where we saw the first signs of impact (we were maxing out the connection limit): http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all | 18:24 |
fungi | later today i'll do a status update and follow up to the service-announce thread if we haven't needed to take other action yet | 18:24 |
*** avass has joined #opendev | 18:26 | |
openstackgerrit | Merged openstack/project-config master: Account for ironic bugfix branches https://review.opendev.org/740212 | 18:33 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: add-build-sshkey: Ensure .ssh exists, enable admin authorized_keys https://review.opendev.org/740350 | 19:02 |
fungi | established tcp connection count has increased slightly since the rules were removed, but not substantially | 19:04 |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: add-build-sshkey: Ensure .ssh exists, enable admin authorized_keys https://review.opendev.org/740350 | 19:05 |
*** tosky has quit IRC | 19:26 | |
smcginnis | If anyone has a minute, I'd love to know if this is on the right track: https://review.opendev.org/#/c/739272/ | 20:18 |
fungi | i'm not sure either or i'd have weighed in... it's switching from distro-packaged pip to pypi-server pip to work around the fact that the ensure-pip role doesn't have a mechanism for installing multiple versions of distro-packaged pip when there are multiple distro-packaged python interpreters present | 20:26 |
smcginnis | fungi: Maybe I don't understand, but it's not switching from distro to pypi. It should be installing multiple distro-packaged packages based on the versions of python requested. | 20:28 |
smcginnis | At least that's what I thought I was doing, but definitely a possibilty that I don't really know what I'm doing. :) | 20:28 |
fungi | oh, did we improve ensure-pip to cover that case? | 20:28 |
smcginnis | That's what I'm trying to do. | 20:28 |
fungi | oh! this is a different change than i was thinking of | 20:29 |
smcginnis | https://review.opendev.org/#/c/739272/2/roles/ensure-pip/tasks/Debian.yaml should be looping through and trying to apt install each specific version. | 20:29 |
fungi | this one is for zuul-jobs. maybe better to discuss in #zuul | 20:29 |
smcginnis | Hmm, looks like I forgot to rejoin after my last reboot. | 20:30 |
fungi | the change i was thinking of was where you were switching the job to use the feature of the from-pypi method which can install multiple versions of pip | 20:30 |
fungi | anyway, getting ready to cook dinner, but can take a look afterward | 20:30 |
*** dirk has quit IRC | 20:42 | |
tbarron | so gouthamr and I have been banging our heads trying to debug an issue he reported the other day | 20:47 |
tbarron | with one of our jobs where it reboots -- only on rax noded apparently | 20:47 |
tbarron | http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-07-07.log.html#t2020-07-07T23:37:24 | 20:48 |
tbarron | http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-07-08.log.html#t2020-07-08T00:29:46 | 20:48 |
tbarron | I'm wondering if we can get some tips on how to instrument the job effectively. | 20:48 |
*** bolg has quit IRC | 20:48 | |
tbarron | syslog doesn't usually have anything relevant | 20:49 |
tbarron | journal seems to get lost before the reboot | 20:49 |
tbarron | one exception on syslog. once we saw this: | 20:49 |
tbarron | https://zuul.opendev.org/t/openstack/build/1bc61a82ace14217807f28c5b8c9debe/log/controller/logs/syslog.txt#8463 | 20:49 |
tbarron | But I don't know if it's a fluke or a real clue. | 20:50 |
tbarron | That general protection fault was right before the reboot and pertained to ipv6 packet filtering iiuc | 20:51 |
tbarron | by "journal gets lost" i mean we are lacking entries until after the reboot | 20:53 |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: Install venv for all platforms in ensure-pip https://review.opendev.org/739272 | 20:57 |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: Install venv for all platforms in ensure-pip https://review.opendev.org/739272 | 20:58 |
*** dirk has joined #opendev | 21:01 | |
fungi | tbarron: using any fancy vm features? rackspace uses xen while all our other providers (as far as i know) are kvm | 21:02 |
openstackgerrit | Sean McGinnis proposed zuul/zuul-jobs master: Install venv for all platforms in ensure-pip https://review.opendev.org/739272 | 21:02 |
fungi | another difference is that rackspace nodes have something like 20gb in their rootfs and then an ephemeral disk you can format and mount (devstack mounts it on /opt for example), could it be running out of disk space? | 21:03 |
tbarron | fungi: I don't think so. This job does a bunch of lvm commands and uses kernel nfs and uses neutron dynamic routing and quagga etc. to advertise self-service ipv6 tenant networks so it's a bit "fancy" | 21:05 |
tbarron | fungi: i see 40G root partition on rax vs 80 on, say, vexxhost so i was suspectng that the root disk might have filled but | 21:05 |
fungi | ahh, 40gb, that sounds right yeah | 21:06 |
tbarron | fungi: i unstrumented a periodic task (share-service reporting pool usage to the scheduler) to log 'df -h' (and 'free' for that matter) and | 21:06 |
tbarron | there was lots of disk space and free memory right before the reboot | 21:07 |
tbarron | 'free' output pretty much matched what dstat was showing | 21:07 |
tbarron | the lvm action is on /opt/stack/manila/data and there was plenty of disk free there too | 21:08 |
fungi | does the job continue after the reboot? do we collect syslog from it and did that write anything? | 21:10 |
tbarron | we collect syslog before and after the boot but in general it doesn't have anything interesting except 'reboot' | 21:11 |
fungi | i think we also have the option to set up a remote system console stream, i forget the kernel module for that, and then capture that in case it still has working networking and is able to send a copy of the panic | 21:11 |
tbarron | the one exception was that general protection fault at https://zuul.opendev.org/t/openstack/build/1bc61a82ace14217807f28c5b8c9debe/log/controller/logs/syslog.txt#8463 | 21:11 |
*** dviroel has joined #opendev | 21:12 | |
tbarron | fungi: that console stream thing sounds useful, I guess equivalent of 'openstack console log ...' | 21:13 |
tbarron | fungi: I don't know why the journal is missing the needed entries unless perhaps the file system it writes to is getting corrupted | 21:14 |
fungi | or the reboot happens before fd writes can be flushed and the fs synced | 21:14 |
tbarron | fungi: but the syslog after reboot isn't showing some big fsck/recovery | 21:15 |
tbarron | so yeah, flush/sync may be more likely | 21:15 |
fungi | ahh, the lkm i was thinking of is netconsole | 21:15 |
fungi | and yeah it's basically kernel console redirection to a tcp socket | 21:15 |
fungi | er, no udp datagrams i guess | 21:16 |
fungi | so you set up a netconsole listener, then load that kernel module with appropriate parameters telling it the destination address of the netconsole stream client | 21:18 |
fungi | and then on the client side you can basically just capture the udp stream and do whatever you like with it (write it to a file, et cetera) | 21:19 |
tbarron | so we can run the client from whereever (e.g. my notebook) but need to run the modprobe cmd as part of the devstack setup? | 21:20 |
tbarron | s/my notebook/some target with a public ip/ - and need firewalls to allow target and dest ports | 21:26 |
fungi | yep, the default iptables ruleset on the job nodes should allow all egress just fine, but your client side firewall/nat would need to make sure whatever destination udp port you choose is allowed through and goes to the right place | 21:29 |
tbarron | fungi: thanks for the idea, i'll try to get it working locally first, then in dsvm job | 21:31 |
fungi | Documentation/networking/netconsole.rst in the linux kernel docs, according to Documentation/networking/netconsole.rst | 21:31 |
fungi | er, according to https://www.kernel.org/doc/html/latest/admin-guide/serial-console.html | 21:32 |
fungi | aha, here https://www.kernel.org/doc/html/latest/networking/netconsole.html | 21:32 |
fungi | tbarron: that ^ should be pretty easy to follow, but let us know if you run into trouble getting it to work | 21:33 |
tbarron | fungi: will do, and thanks again | 21:33 |
fungi | my pleasure, as always | 21:34 |
*** avass has quit IRC | 21:53 | |
*** hashar has quit IRC | 22:13 | |
fungi | #status log The connection flood from AS4837 (China Unicom) has lessened in recent days, so we have removed its temporary access restriction for the Git service at opendev.org as of 18:24 UTC today. | 22:15 |
openstackstatus | fungi: finished logging | 22:15 |
*** rh-jelabarre has quit IRC | 22:30 | |
*** DSpider has quit IRC | 22:43 | |
*** hrw has quit IRC | 22:43 | |
*** hrw has joined #opendev | 22:53 | |
*** tkajinam has joined #opendev | 23:00 | |
*** mlavalle has quit IRC | 23:06 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!