openstackgerrit | Merged openstack/project-config master: Adds docs_branch_path value needed for promoting release branches. https://review.opendev.org/c/openstack/project-config/+/788593 | 00:16 |
---|---|---|
kevinz | ianw: Morning! I wonder is there something wrong with zk02? I can not ping it anyway | 00:25 |
ianw | kevinz: ahh, yes clarkb just removed it :) they've moved to zk06/07/08 now | 00:25 |
fungi | 4-6 | 00:25 |
fungi | but yes | 00:25 |
ianw | sorry, 04/05/06 | 00:26 |
ianw | yeah, keyboard typing error :) | 00:26 |
fungi | keyboard are where i make most of my typing errors as well | 00:26 |
ianw | fungi: only until elon gets some electrodes in your skull :) | 00:27 |
fungi | i'm saving up for wristjacks, not sure a skulljack is entirely hygenic | 00:27 |
ianw | kevinz: we're still seeing dropouts from nb03 to the new server(s), however. | 00:28 |
ianw | probably the more annoying thing is the aborted uploads to OSU (see thread) | 00:28 |
kevinz | ianw: OK, Let me check | 00:30 |
kevinz | ianw: ping zk06.openstack.org | 00:32 |
kevinz | ping: zk06.openstack.org: Name or service not known | 00:32 |
fungi | opendev.org | 00:32 |
fungi | we're in progress renaming our servers into the new domain | 00:32 |
fungi | basically new servers get names in opendev.org as we phase out use of openstack.org for anything which isn't openstack-specific | 00:33 |
*** brinzhang0 is now known as brinzhang | 00:33 | |
kevinz | fungi: OK, please let me now when the re-name is finished, so that I can continue test | 00:48 |
kevinz | fungi: Oh, I check that the opendev works. Thanks | 00:48 |
fungi | kevinz: sorry, i probably phrased that confusingly. zk01.openstack.org, zk02.openstack.org and zk03.openstack.org were replaced by zk04.opendev.org, zk05.opendev.org and zk06.opendev.org | 00:49 |
fungi | hopefully that makes sense | 00:49 |
gmann | clarkb: fungi nova grenade job is failing frequently (~90%) for this are you aware of this error - https://zuul.opendev.org/t/openstack/build/599cfa422a0648168c8b00a27fbd3114/log/logs/grenade.sh.txt#46891-46912 | 00:56 |
gmann | Failed to start rtslib-fb-targetctl.service: Unit rtslib-fb-targetctl.service is not loaded properly: Exec format error. | 00:57 |
fungi | gmann: heh, i guess you're not alone, if it helps... https://askubuntu.com/questions/1334619/failed-to-start-rtslib-fb-targetctl-service | 01:01 |
gmann | yeah, this legacy grenade job is also running on Ubuntu 18.04 which should be on 20.04 since wallaby gate | 01:04 |
*** brinzhang_ has joined #opendev | 01:05 | |
gmann | i remember, legacy jobs are not upgraded to ubuntu 20.04 | 01:05 |
*** brinzhang has quit IRC | 01:08 | |
ianw | clarkb: i've put holds on the nodepool jobs and rechecked 788553, see zuul ~ianw/nodepool-holds.sh | 01:09 |
*** d34dh0r53 has quit IRC | 01:24 | |
*** hamalq has quit IRC | 01:34 | |
kevinz | fungi: OK, Thanks for clarifying. so zk05 is the right one :-) | 01:52 |
*** brinzhang_ is now known as brinzhang | 02:07 | |
*** xinliang has joined #opendev | 02:25 | |
*** hemanth_n has joined #opendev | 02:45 | |
*** xinliang has quit IRC | 04:03 | |
*** vishalmanchanda has joined #opendev | 04:24 | |
kevinz | ianw: fungi: I can not observe packet loss in another tenant and infra host, but I can observe a lot of packet loss within os-control project(Just create one instance under os-control for network test) | 04:33 |
ianw | kevinz: i guess we're just lucky :) | 04:34 |
kevinz | Also sometimes the ssh is broken pipe in os-control instance | 04:34 |
ianw | interesting, i don't think i've had ssh drop out, but i'm sort of glad it's not just me noticing the issue! :) | 04:34 |
kevinz | ianw: well, I will check os-jobs tenant first, to exclude IPv6/IPv4 configure. (other tenant just have IPv4 enabled) | 04:35 |
ianw | ahh. yeah, i think the test-nodes would be less susceptible due to much less running of things that want to hang around forever. we'd barely notice a few retries etc. | 04:37 |
kevinz | ianw: OK, it sounds like os-control is the lucky one :-( | 04:38 |
ianw | as i mentioned in the email thread, we do have a certain uncanny ability to break things :) | 04:40 |
ianw | i think we lost openstack gerrit | 05:00 |
*** openstackgerrit has quit IRC | 05:01 | |
*** sboyron has joined #opendev | 05:31 | |
*** ysandeep|away is now known as ysandeep | 05:36 | |
*** snapdeal has joined #opendev | 05:43 | |
*** slaweq has joined #opendev | 05:55 | |
*** raukadah is now known as chandankumar | 05:56 | |
*** marios has joined #opendev | 06:00 | |
ianw | clarkb: i think we got one | 06:10 |
ianw | https://53cc3facebff961adc76-37cbc92cf6f6e06a61846b0d3fa08d8d.ssl.cf2.rackcdn.com/788553/1/check/dib-nodepool-functional-openstack-opensuse-15-src/6ef9cf0/ | 06:10 |
ianw | 158.69.69.156 | 06:11 |
*** avass has quit IRC | 06:13 | |
*** eolivare has joined #opendev | 06:13 | |
ianw | #status log updated the hosts entry for freenode on eavesdrop, restart gerritbot | 06:15 |
openstackstatus | ianw: finished logging | 06:15 |
ianw | i think it's back now, our chosen host was mia | 06:15 |
*** ralonsoh has joined #opendev | 06:25 | |
ianw | clarkb: ok so ... | 06:44 |
ianw | http://paste.openstack.org/show/OOOuDOIRBf1jTQEXvgaM/ | 06:46 |
ianw | basically, if we drop the "--network public" from the end, server creation works | 06:46 |
ianw | RESP BODY: {"NeutronError": {"type": "NetworkNotFound", "message": "Network public could not be found.", "detail": ""}} | 06:47 |
ianw | now, why that leads the overall command to return a 500 error is an open question, but i think that's something like the root cause | 06:47 |
ianw | of course, "openstack --os-cloud=devstack network show public" works :/ | 06:49 |
ianw | oh boo, this might be a red herring | 06:51 |
ianw | "GET call to network for https://158.69.69.156:9696/v2.0/networks?name=public used request id req-b4b8d066-b0e4-434f-afd0-91237890fea5" is just below that | 06:51 |
ianw | and works | 06:51 |
*** avass has joined #opendev | 06:58 | |
ianw | http://paste.openstack.org/show/804851/ | 07:00 |
ianw | ^ the good request, and the bad request (i.e. with --network public, and without) | 07:00 |
frickler | gmann: fungi: the issue with grenade/devstack on bionic was fixed by https://review.opendev.org/c/openstack/devstack/+/788429 . note however that we want to drop support for bionic on devstack in master, so folks should really migrate their jobs to focal | 07:02 |
*** hashar has joined #opendev | 07:04 | |
frickler | ianw: hmm, I never used the "--network xx" option before, is that new? the usual way for me is to do "--nic net-id=yy" but that needs the uuid of the network. | 07:06 |
ianw | frickler: definitely not new, but ... this error is :) | 07:06 |
*** amoralej|off is now known as amoralej | 07:07 | |
*** fressi has joined #opendev | 07:18 | |
*** andrewbonney has joined #opendev | 07:19 | |
*** jpena|off is now known as jpena | 07:21 | |
*** openstackgerrit has joined #opendev | 07:23 | |
openstackgerrit | Merged opendev/glean master: Move to Zuul standard hacking rules https://review.opendev.org/c/opendev/glean/+/788127 | 07:23 |
*** rpittau|afk is now known as rpittau | 07:27 | |
ianw | well i'm out of time | 07:33 |
ianw | pretty easy to replicate | 07:33 |
*** dtantsur|afk is now known as dtantsur | 07:40 | |
*** tosky has joined #opendev | 07:44 | |
*** dirk has quit IRC | 07:45 | |
frickler | hmm, the public network isn't shared, so it shouldn't be able to be used for instances iiuc anyway, under which condition would this work? | 07:49 |
* frickler can look closer after some upcoming meetings | 07:49 | |
*** dirk has joined #opendev | 07:51 | |
*** jaicaa has quit IRC | 08:02 | |
*** jaicaa has joined #opendev | 08:04 | |
*** ysandeep is now known as ysandeep|lunch | 08:09 | |
*** jaicaa has quit IRC | 08:11 | |
*** jaicaa has joined #opendev | 08:14 | |
kevinz | ianw: It looks that the packet loss disappear | 09:01 |
kevinz | ianw: fungi: What I found is the virtual router sync mechanism is wrong, inducing that 3 virtual router backend are alive for os-control-router. So I restart the l3_agent service and the virtual router sync mechanism works. Now pinging to zk05 works without packet loss... | 09:03 |
openstackgerrit | Merged opendev/irc-meetings master: Update TC office hours time for Xena cycle https://review.opendev.org/c/opendev/irc-meetings/+/788552 | 09:09 |
*** amoralej has quit IRC | 09:36 | |
*** fbo has quit IRC | 09:36 | |
*** ysandeep|lunch is now known as ysandeep | 09:40 | |
*** jpena is now known as jpena|off | 09:45 | |
*** hashar has quit IRC | 09:48 | |
*** jpena|off has quit IRC | 09:54 | |
ianw | kevinz: thanks for investigating! i will check it thoroughly in the morning :) | 10:01 |
*** fbo has joined #opendev | 10:10 | |
*** jpena has joined #opendev | 10:46 | |
*** akahat is now known as akahat|ruck | 10:50 | |
*** whoami-rajat has joined #opendev | 10:52 | |
*** hemanth_n has quit IRC | 10:53 | |
*** iurygregory has quit IRC | 10:54 | |
*** chrome0 has quit IRC | 10:55 | |
*** chrome0 has joined #opendev | 10:55 | |
*** iurygregory has joined #opendev | 10:58 | |
*** jpena is now known as jpena|lunch | 11:31 | |
*** hashar has joined #opendev | 12:09 | |
*** snapdeal has quit IRC | 12:20 | |
*** hrw has joined #opendev | 12:25 | |
hrw | morning | 12:25 |
fungi | hrw: not sure if you saw but ianw found the centos arm64 bug, apparently it's since been fixed in rhel but there's no clear picture of how long the new packages will take to make it into centos | 12:26 |
hrw | fantastic! | 12:27 |
*** jpena|lunch is now known as jpena | 12:27 | |
fungi | (summary: binutils was updated to set more aggressive compiler optimizations, which have since been walked back but lots of stuff built with one of the "bad" binutils versions needs recompiling) | 12:27 |
hrw | let me dig into logs | 12:29 |
fungi | https://bugzilla.redhat.com/show_bug.cgi?id=1946518 | 12:31 |
openstack | bugzilla.redhat.com bug 1946518 in binutils "binutils-2.30-98 are causing go binaries to crash due to segmentation fault on aarch64" [Unspecified,Modified] - Assigned to nickc | 12:31 |
hrw | yeah | 12:37 |
hrw | just finished reading http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2021-04-27.log.html | 12:37 |
hrw | and I lack access to bug 1875912 just like ianw | 12:38 |
openstack | bug 1875912 in pulseaudio (Ubuntu) "Selected audio output always USB device as default, but no control" [Undecided,Expired] https://launchpad.net/bugs/1875912 | 12:38 |
hrw | no openstack, https://bugzilla.redhat.com/show_bug.cgi?id=1875912 one | 12:39 |
openstack | hrw: Error: Error getting bugzilla.redhat.com bug #1875912: NotPermitted | 12:39 |
hrw | ;d | 12:39 |
fungi | neat | 12:41 |
*** fbo has quit IRC | 13:27 | |
*** fbo has joined #opendev | 13:29 | |
*** vishalmanchanda has quit IRC | 13:30 | |
gmann | frickler: thanks, once this merge I think we can remove bionic support, i think that is last deps on legacy jobs https://review.opendev.org/c/openstack/nova/+/778885/10 | 13:33 |
*** fressi has left #opendev | 13:43 | |
*** fbo has quit IRC | 14:03 | |
*** fbo has joined #opendev | 14:08 | |
*** d34dh0r53 has joined #opendev | 14:09 | |
*** marios is now known as marios|call | 14:14 | |
*** hashar has quit IRC | 14:26 | |
openstackgerrit | Ade Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node https://review.opendev.org/c/zuul/zuul-jobs/+/788778 | 14:49 |
frickler | gmann: sadly it is not the last one, there seem to be some in cinder, neutron and octavia, too http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Failed%20to%20start%20rtslib-fb-targetctl.service%5C%22 | 14:53 |
tosky | frickler : legacy jobs? Bionic jobs? | 14:54 |
gmann | frickler: did not know they also have job on bionic. anywaysi pushed the patch with -W and will notify this on ML to give time of 1-2 week or so https://review.opendev.org/c/openstack/devstack/+/788754 | 14:55 |
gmann | tosky: yeah on bionic, example cinder-plugin-ceph-tempest-mn-aa - https://opendev.org/openstack/cinder/src/branch/master/.zuul.yaml#L166 | 14:56 |
tosky | gmann: uh, that may be an oversight, I will ask about it, thanks | 14:58 |
gmann | ocatvia has nested-virt-ubuntu-bionic | 14:58 |
openstackgerrit | Ade Lee proposed zuul/zuul-jobs master: Add role to enable FIPS on a node https://review.opendev.org/c/zuul/zuul-jobs/+/788778 | 14:59 |
*** marios|call is now known as marios | 15:00 | |
tosky | gmann: for jobs which inherits from devstack, whenever you need to have a multinode job, you need to explicitly set nodeset: openstack-two-node-<version>; shouldn't we have a generic node which matches the default base node? Like openstack-two-node-devstackdefaultplatform? | 15:01 |
gmann | tosky: we have base multinode job in devstack running on latest distro | 15:02 |
tosky | gmann: but if you have a custom multinode job which inherits from another job you want to just set the nodeset | 15:02 |
tosky | that's the case for that bionic job | 15:02 |
tosky | what we lack now is a "two node nodeset which use the default platform that devstack jobs use" | 15:03 |
clarkb | frickler: were you able to look closer at the server create thing yet? I'll try looking at it in a bit if not | 15:06 |
gmann | tosky: will check, in tc meeting currently. we can disucss this on qa channel may be | 15:08 |
tosky | sure | 15:12 |
tosky | I may not be around in a bit, but... async IRC, I will answer at some point! | 15:12 |
*** fressi has joined #opendev | 15:46 | |
*** fressi has quit IRC | 15:47 | |
*** ysandeep is now known as ysandeep|away | 15:49 | |
clarkb | ianw: frickler: first thing I'vechecked is that `source /opt/devstack/openrc admin admin` produces env vars for the user and project domain (it does both set to default) | 16:00 |
*** mlavalle has joined #opendev | 16:02 | |
clarkb | ianw: frickler: next thing I notice is that when you drop the --network specification you still get the same error. It just happens later. The instance is successfully created but then enters an error state later | 16:06 |
clarkb | | fault | {'code': 500, 'created': '2021-04-29T16:04:25Z', 'message': 'Build of instance 0dc6e600-908c-4e06-aff4-955dc8a22ee9 aborted: Expecting to find domain in project. The server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (HTTP 400) (Reques'} | | 16:06 |
clarkb | I think specifying --network Public just trips over the problem quicker | 16:06 |
clarkb | if I create the server as admin admin rather than demo demo it too fails. However when you are admin you get the full traceback back in the fault entry | 16:09 |
clarkb | er --network public not --network Public | 16:10 |
clarkb | explicitly setting --os-user-domain-name default --os-project-domain-name default --os-domain-name default doesn't seem to help | 16:13 |
clarkb | looking at the nova config I see [neutron] user_domain_name = Default but no project_domain_name as under [keystone_authtoken] | 16:15 |
clarkb | I also wonder if the domain name is case sensitive as we haev Default in nova.conf and default in openrc | 16:15 |
clarkb | gmann: ^ do you know if that could be related? | 16:16 |
*** jpena|off has joined #opendev | 16:18 | |
*** hamalq has joined #opendev | 16:22 | |
*** hamalq has quit IRC | 16:23 | |
*** hamalq has joined #opendev | 16:24 | |
*** jpena has quit IRC | 16:25 | |
clarkb | ok the domain id is 'default' and the domain name is 'Default' (I'm sure this is incredibly difficult to change now but making those different and not making a uuid for the id is incredibly confusing) | 16:26 |
clarkb | I added project_domain_name = Default to the [neutron] section and did systemctl stop devstack@n-api.service && systemctl start devstack@n-api.service and no change | 16:26 |
clarkb | as a further sanity check project show demo and project show admin both report domain_id | default | 16:28 |
*** marios is now known as marios|out | 16:28 | |
mordred | clarkb: devstack sets a domain name of Default and a domain id of default | 16:30 |
clarkb | mordred: ya, its incredibly confusing | 16:31 |
mordred | ah - I see you found thtq | 16:31 |
clarkb | I've also confirmed that this domain info is passed in the token request "domain": {"id": "default", "name": "Default"} | 16:31 |
mordred | yes. it is INSANELY confusing | 16:31 |
mordred | actually, it's not just a devstack thing | 16:31 |
mordred | it's a keystone thing | 16:31 |
mordred | and it would be super hard to change at this point iirc | 16:32 |
clarkb | I suspect this is either a setup issue with keystone (that is what i did the project shows above) or a bug in nova/neutron/keystone | 16:32 |
*** jpena|off has quit IRC | 16:32 | |
clarkb | everything I can see on the client side seems to be accurate for v3 domain usage | 16:32 |
clarkb | and maybe tempest doesn't ever hit this beacuse they create a tempest specific project domain user etc and that setup is correct? I dunno just throwing out ideas right now | 16:32 |
clarkb | user show admin and user show demo also show domain_id | default | 16:33 |
clarkb | domain show default shows enabled | True | 16:33 |
clarkb | the next thing to look at is probably the request from nova to keystone for a token to do the neutron work | 16:35 |
clarkb | the one that fails | 16:35 |
gmann | clarkb: in tempest, we also use 'default' as default domain id | 16:36 |
clarkb | gmann: do you createa a new user and project? | 16:38 |
gmann | unless creds are asked for a particular domain where tempest create new domain | 16:38 |
gmann | clarkb: for dynamic creds (default one), yes | 16:38 |
clarkb | gmann: I wonder if that is why we aren't seeing this with tempest then | 16:38 |
clarkb | since we're just trying to use the default demo project and demo user | 16:38 |
clarkb | does anyone know if you can convince keystone to do unsafe logging? | 16:39 |
clarkb | I can see the request but none of the request details which amkes debugging not easy | 16:39 |
clarkb | (I can probably tcpdump between the tlsproxy and the keystone process too | 16:39 |
clarkb | side note: we're running this job on bionic and devstack master wants focal I think | 16:39 |
clarkb | (though I really doubt that would cause this problem) | 16:40 |
gmann | yeah, devstack master is moving to focal completely if we can do any soon https://review.opendev.org/c/openstack/devstack/+/788754 | 16:41 |
clarkb | hrm we seem to use a unix socket to proxy to keystone. Can I dump the traffic on that somehow? maybe just with cat? | 16:45 |
clarkb | (I don't think so) | 16:45 |
clarkb | or at least not with cat | 16:45 |
clarkb | heh the internet says use socat in the middle | 16:45 |
clarkb | oh neat adding the project_domain_name to the [neutron] section of the nova.conf allows --network public to work, but then it fails later with the similar error | 16:49 |
clarkb | so that did make a difference | 16:49 |
*** dtantsur is now known as dtantsur|afk | 16:49 | |
clarkb | the fault is now fault | {'code': 500, 'created': '2021-04-29T16:43:00Z', 'message': 'Build of instance 8714c2b0-fefc-4c8d-abe5-ed19c099e397 aborted: It is not allowed to create an interface on external network 5aafef62-c241-4da4-b4c7-5d5dffa916e8'} | 16:49 |
clarkb | hrm but if I use --network private or leave off --network it is back to borted: Expecting to find domain in project. The server could not comply with the request since it is either malformed or otherwise incorrect. The client is assumed to be in error. (HTTP 400) (Reques'} | 16:54 |
*** iurygregory has quit IRC | 16:58 | |
*** rpittau is now known as rpittau|afk | 16:59 | |
clarkb | BOOM | 1b0c01c7-81ab-4bcc-9d40-7214db8e58ea | clarkb-test | ACTIVE | private=10.1.0.27, fd1f:8d2f:a260:0:f816:3eff:fe8f:4609 | cirros-0.5.2-x86_64-disk | cirros256 | | 16:59 |
clarkb | turns out that n-cpu has a separate config file from the rest of nova and you have to add project_domain_name there too | 17:00 |
clarkb | gmann: the problem is that nova.conf and nova-cpu.conf do not specify project_domain_name under [neutron] | 17:00 |
clarkb | and getting a 500 error back from the create call is when you hit the path in nova api using nova.conf that fails and when you get an error instance it is because n-cpu failed using nova-cpu.conf | 17:01 |
*** marios|out has quit IRC | 17:02 | |
gmann | but i think devstack set in both | 17:03 |
clarkb | gmann: not on this default devstack install. I had to manually add them | 17:03 |
clarkb | gmann: I think that it must be setting them most of the time though because most of the time this stuff works | 17:04 |
clarkb | maybe there is a race in copying configs around? | 17:04 |
clarkb | let me find the devsatck log for this job | 17:04 |
gmann | yeah that is what i am suspecting | 17:04 |
clarkb | gmann: https://53cc3facebff961adc76-37cbc92cf6f6e06a61846b0d3fa08d8d.ssl.cf2.rackcdn.com/788553/1/check/dib-nodepool-functional-openstack-opensuse-15-src/6ef9cf0/job-output.txt the devsatck log is in the job output | 17:04 |
clarkb | (note the job runs on ubuntu bionic and builds an opensuse 15 image for testing in the nested openstack but none of that even starts to run because devstack isn't working) | 17:05 |
gmann | clarkb: here but its neutron/neutron-legacy script set it. https://opendev.org/openstack/devstack/src/branch/master/lib/neutron-legacy#L387 | 17:12 |
gmann | and nova.conf is what copied to nova-cpu.conf initially before compute specific configs | 17:13 |
gmann | can you check the order of copying 'cp /etc/nova/nova.conf /etc/nova/nova-cpu.conf' with neutron/neutron-legacy set it on nova.conf. | 17:15 |
gmann | copying should be later once neutron-legacy set it in nova.conf | 17:15 |
clarkb | gmann: ya looking at the log above the iniset on nova.conf for neutron project_domain_name happens at 2021-04-29 06:03:51.802688 then the copy to nova-cpu.conf is at 2021-04-29 06:06:17.979775 | 17:15 |
clarkb | gmann: I wonder if this is related to paralellized setups? | 17:16 |
clarkb | gmann: is it possible that the netron config for nova happens before nova does its setup and things get overwritten? | 17:16 |
gmann | humm, good point | 17:16 |
gmann | clarkb: if i remember you said its happening always? | 17:18 |
clarkb | gmann: no this issue is maybe 10% of the time | 17:18 |
clarkb | but when it breaks that break is 100% fatal | 17:18 |
gmann | ohk, may be we can check with disabling paralellization | 17:18 |
gmann | to confirm | 17:18 |
clarkb | gmann: well we have the log above we should be ablet ot work from that to undersatnd how it happens? | 17:19 |
*** andrewbonney has quit IRC | 17:19 | |
clarkb | gmann: also another interesting thing is that [neutron] user_domain_name is set at near the same port as project_domain_name, but user_domain_name remains in the config | 17:20 |
gmann | humm, is it just project_domain_name missing or any other config too https://opendev.org/openstack/devstack/src/branch/master/lib/neutron#L355 | 17:21 |
clarkb | gmann: I'll check. Also I do see a bug with order of operations. 2021-04-29 06:03:51.205615 | ubuntu-bionic | + ./stack.sh:main:1243 : iniset /etc/nova/nova-cpu.conf key_manager fixed_key happens before 2021-04-29 06:06:17.979775 | ubuntu-bionic | + lib/nova:start_nova_compute:903 : cp /etc/nova/nova.conf /etc/nova/nova-cpu.conf | 17:25 |
clarkb | gmann: just project_domain_name | 17:26 |
*** ralonsoh has quit IRC | 17:26 | |
clarkb | also all of the keystone_authtoken config seems to be good and that is configured before the [neutron] section | 17:27 |
*** iurygregory has joined #opendev | 17:31 | |
*** iurygregory has quit IRC | 17:31 | |
clarkb | gmann: looks like nova-cpu.conf has its config merged from localrc (though our localrc doesn't seem to contain the post-config stuff it operates on) and it does inidelete on some sections like the database | 17:33 |
clarkb | I need to step out for a bit, unfortunately haven't found any claer indications for why that config is missing yet | 17:33 |
gmann | yeah, same i was searching in case during merge it gets deleeted 2021-04-29 06:06:18.000195 | ubuntu-bionic | + lib/nova:start_nova_compute:905 : merge_config_file /opt/devstack/local.conf post-config '$NOVA_CPU_CONF' | 17:34 |
*** iurygregory has joined #opendev | 17:38 | |
*** eolivare has quit IRC | 17:51 | |
gmann | nova-cpu.conf is passed in touched in rpc_backend too but could not see anything removing the project_domain_name. 2021-04-29 06:06:18.262601 | ubuntu-bionic | + lib/rpc_backend:iniset_rpc_backend:158 : local file=/etc/nova/nova-cpu.conf | 18:08 |
clarkb | gmann: it is also nova.conf that had the problem. Almost like the original iniset against nova.conf failed | 18:16 |
clarkb | and then we just copied that from nova.conf to nova-cpu.conf | 18:16 |
frickler | oh, wow, looks like that's exactly the kind of weird races I feared would happen with the devstack async code | 18:29 |
clarkb | frickler: the way the logging records things doesn't seem to be out of order though | 18:30 |
clarkb | but maybe the recording isn't quite as it seems? | 18:30 |
clarkb | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/ensure-devstack/templates/local.conf.j2 that is the local.conf we run with. One thing I notice in there is we disable a bunch of services and that could change async ordering and explain why others haven't seen similar | 18:32 |
clarkb | /opt/stack/async/configure_neutron_nova.log does exist which implies that was done asynchronously with $otherthings | 18:37 |
clarkb | ok I have a theory | 18:38 |
clarkb | https://opendev.org/openstack/devstack/src/branch/master/stack.sh#L1202-L1250 | 18:39 |
clarkb | we start the async neutron config of nova then run merge_config_group | 18:40 |
clarkb | we also iniset in nova.conf and nova-cpu.conf | 18:40 |
clarkb | I think it is this https://opendev.org/openstack/devstack/src/branch/master/stack.sh#L1237-L1244 that runs around the same time as the async configure_neutron_nova | 18:42 |
clarkb | and they will both be reading and writing the same files so there is a race | 18:42 |
clarkb | I'll push a patch | 18:43 |
*** sboyron has quit IRC | 18:44 | |
*** sboyron has joined #opendev | 18:44 | |
gmann | in recording it seem .1 sec difference between both. | 18:46 |
gmann | anyways this is good to be moved to lib/nova side https://opendev.org/openstack/devstack/src/branch/master/stack.sh#L1237-L1244 | 18:48 |
gmann | in nova.conf only and then start_nova_compute will copy it to nova-cpu.conf | 18:48 |
clarkb | remote: https://review.opendev.org/c/openstack/devstack/+/788820 Fix async race updating nova configs | 18:49 |
clarkb | I think that may fix it | 18:49 |
frickler | clarkb: small nit, but I was looking at that range, too. seems plausible that this is uncovered only with swift disabled, otherwise the start_swift would probably take long enough to take the race away | 18:53 |
clarkb | frickler: ++ | 18:53 |
clarkb | fixing up the change now | 18:53 |
clarkb | I'll also note that it may be related to disabling swift | 18:53 |
frickler | o.k., I'll check results tomorrow, going offline now | 18:54 |
clarkb | frickler: thanks | 18:55 |
*** hashar has joined #opendev | 19:13 | |
*** sboyron has quit IRC | 19:18 | |
openstackgerrit | Albin Vass proposed zuul/zuul-jobs master: Reset connection before testing build ssh-keys https://review.opendev.org/c/zuul/zuul-jobs/+/788826 | 19:56 |
*** cenne|out is now known as cenne | 20:32 | |
clarkb | fungi: for elod's proejct-config changes to reparent openstack stuff the first 10 or so lgtm. If you are around enough maybe you want to take a look too and approve some of them (though we probably don't want to approve all at once?) | 20:55 |
ianw | clarkb: great find!!! i suspected async must have been in there somewhere | 21:07 |
clarkb | ianw: ya there was a lot of confusion over why it would happen in different ways, but once I realized there were two configs invovled and both lacked the expected project_domain_name everything started to come to gether | 21:09 |
ianw | the fact that server creation worked without "--network" would have also helped hide this further | 21:11 |
clarkb | ianw: ya but it didn't actually succeed | 21:11 |
clarkb | nova would just hit the same error later | 21:11 |
clarkb | I think specfiying network upfront caused n-api to validate it against neutron and fail earyl. If you didn't specify it then n-cpu would tell neutron to do the right thing and fail at that point | 21:12 |
ianw | it did create the server, but maybe network wasn't functional | 21:17 |
ianw | anyway, very glad to have found that one ... $RANDOM errors are the worst | 21:17 |
ianw | kevinz: YAY!! i think you fixed it :) overnight we've uploaded several images to OSU and not one dropped out | 21:18 |
clarkb | ianw: ya it created the server then it entered and ERROR state. I suspect that if nodepool had tried to run against it we would've failed there | 21:18 |
clarkb | since the nodepool tests ssh in and confirm some glean stuff as well as the growroot iirc | 21:18 |
clarkb | gmann: can you review https://review.opendev.org/c/openstack/devstack/+/788820 if we can get that landed I think it will help nodepool and dib testing (and then we can land some dib chagnes for bullseye) | 21:19 |
*** slaweq has quit IRC | 21:21 | |
*** SWAT has quit IRC | 21:27 | |
ianw | i got a question about intel_zuul not being able to comment, did anyone already look into that? | 21:33 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (2) https://review.opendev.org/c/openstack/project-config/+/786739 | 21:33 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (3) https://review.opendev.org/c/openstack/project-config/+/786740 | 21:33 |
clarkb | ianw: no, but chances are they are trying to vote and the project doesn't allow it | 21:33 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (4) https://review.opendev.org/c/openstack/project-config/+/788555 | 21:34 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (5) https://review.opendev.org/c/openstack/project-config/+/788556 | 21:34 |
clarkb | a number of third party ci systems have run into that recently (I'm guessing that means $projects went and updated group membership on which CI systems can vote) | 21:34 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (6) https://review.opendev.org/c/openstack/project-config/+/788557 | 21:34 |
fungi | i think a bunch of third-party ci operators have been rebuilding their ci systems, and their new configurations aren't an exact match for their old behaviors | 21:34 |
ianw | yeah the exact query is "we are still not able to get Intel_Zuul to comment on nova / neutron etc - comments on the sandbox just fine" | 21:35 |
clarkb | ianw: I think only these third party ci systems can vote on nova for example https://review.opendev.org/admin/groups/841ff9d50c89ab50925f127c8b388792639af64f,members | 21:36 |
clarkb | and Intel_Zuul is not one of them | 21:36 |
ianw | that makes sense. i thought this was a longer-standing CI but as you say it was either dropped, or something else changed | 21:37 |
clarkb | they should be able to leave commenst without votes though (any account can do that | 21:38 |
ianw | yeah, not clear exactly what's going on, but good place to start :) | 21:39 |
fungi | to restate, wallaby dropping support for legacy jobs and devstack-gate is (thankfully) pushing a lot of zuul v2 + jenkins ci systems to be rebuilt with newer zuul so they can continue to use upstream job definitions | 21:39 |
fungi | i'd be willing to bet many of them are following an example some where which includes configuration to +1/-1 changes | 21:40 |
clarkb | wouldn't surprise me. I think zuul's default examples include that for a check queue too | 21:42 |
*** hrw has quit IRC | 21:48 | |
*** hrw has joined #opendev | 21:48 | |
gmann | clarkb: +A, was waiting for ate result. | 21:50 |
gmann | gate | 21:50 |
clarkb | gmann: thanks! | 21:50 |
clarkb | (depends on won't work for us I don't think so getting that landed is great) | 21:50 |
fungi | yeah, the pending bullseye changes need another dib release and nodepool image bump after they merge anyway | 21:53 |
ianw | i'm still seeing glibc 155 packages in https://mirror.iad.rax.opendev.org/centos/8-stream/BaseOS/aarch64/os/Packages/ | 21:54 |
ianw | so i guess "soon" hasn't occured yet | 21:55 |
ianw | i'll keep an eye on things to try and get the gate flowing | 21:55 |
ianw | "The spice must flow!" | 21:57 |
clarkb | ianw: do you know why these fixes didn't end up in stream first? I thought that was supposed to be the direction? | 21:57 |
clarkb | seems like stream will be difficult to use if it gets the new stuff first but the fixes last | 21:58 |
clarkb | basically you get all the risk and none of the mitigation | 21:58 |
ianw | i don't fully understand the flow, and yeah the latency on pulling it, and the unclear way that happens, i don't find ideal | 22:00 |
*** hashar has quit IRC | 22:06 | |
fungi | i wouldn't be surprised if they're still trying to figure out the sequence themselves | 22:07 |
ianw | yeah, i'm remaining calm :) | 22:22 |
clarkb | ianw: the othr thing we should do is update these nodepool jobs to run on focal because devsatck is dropping bionic support | 22:26 |
ianw | clarkb: sure. you would have seen that i proposed we remove the non-containers test with https://review.opendev.org/c/zuul/nodepool/+/788406 | 22:27 |
clarkb | ianw: yup, that was what started my debugging of the registry server iirc | 22:28 |
clarkb | maybe not, there have been a lot of changes all trying to get through an failing on variety of unrelated problems lately | 22:28 |
ianw | anyway, i have that wip to use clouds.yaml for the containers test, i can also tweak one ontop to use focal and bring the whole thing up to 2021 :) | 22:29 |
clarkb | oh I see I saw the dib side change, this is the nodepool side | 22:35 |
clarkb | I +2'd the dib change but didn't approve it as I think we want to approve things once this devstack fix lands | 22:37 |
clarkb | (otherwise we're playing the odds at the casino) | 22:37 |
clarkb | fungi: looks like manage-projects reports success from that set you approved | 22:39 |
clarkb | fungi: https://zuul.opendev.org/t/openstack/builds?job_name=infra-prod-manage-projects+ fwiw | 22:39 |
fungi | clarkb: yeah, spot checks show they updated. i guess i'll approve another batch | 22:43 |
clarkb | fungi: I left some notes on a few of them where we may want to be careful and I -1'd the one that updates openstack/project-config | 22:44 |
clarkb | but otherwise ya I think we can do a few batches | 22:44 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (7) https://review.opendev.org/c/openstack/project-config/+/788558 | 22:52 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (8) https://review.opendev.org/c/openstack/project-config/+/788561 | 22:57 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (9) https://review.opendev.org/c/openstack/project-config/+/788567 | 22:57 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (10) https://review.opendev.org/c/openstack/project-config/+/788569 | 22:57 |
openstackgerrit | Merged openstack/project-config master: Move projects under meta-config acl (11) https://review.opendev.org/c/openstack/project-config/+/788571 | 22:57 |
clarkb | arg nova-ceph-multistore failed on the devsatck fix | 23:02 |
clarkb | ianw: ^ fyi | 23:02 |
clarkb | it failed 1 tempest test | 23:02 |
ianw | test_image_glance_direct_import[id-32ca0c20-e16f-44ac-8590-07869c9b4cc2] | 23:04 |
ianw | fail | 23:04 |
ianw | testtools.matchers._impl.MismatchError: 'success' != 'processing' | 23:04 |
ianw | i wonder if this is a case of not giving it enough time | 23:04 |
clarkb | I wonder if ceph is the galnce image store too (so could be specific to the ceph job) | 23:04 |
*** whoami-rajat has quit IRC | 23:13 | |
ianw | i have no idea how to correlate thing across that test | 23:27 |
clarkb | I rechecked the chagne figuring it is unlikely related to my change | 23:29 |
*** tosky has quit IRC | 23:40 | |
ianw | https://review.opendev.org/c/openstack/openstacksdk/+/786814 is released now, so we should be able to fix our stats reporting from nodepool | 23:52 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!