corvus | i believe it will ignore it | 00:01 |
---|---|---|
pabelanger | yup, just testing on 3.1.1. Not errors on PR | 00:03 |
*** efried has quit IRC | 00:05 | |
*** hwoarang has quit IRC | 00:06 | |
*** hwoarang has joined #openstack-infra | 00:08 | |
*** efried has joined #openstack-infra | 00:10 | |
*** slaweq has joined #openstack-infra | 00:11 | |
*** efried has quit IRC | 00:14 | |
*** efried has joined #openstack-infra | 00:14 | |
*** efried has quit IRC | 00:15 | |
*** slaweq has quit IRC | 00:16 | |
*** rh-jelabarre has quit IRC | 00:18 | |
*** rh-jelabarre has joined #openstack-infra | 00:23 | |
clarkb | I've not seen a successful upload to inap yet, but I think those that have failed may have started before mgagne_ fixed thigns? if they are still failing in another hour so then liekly not fixed sdk side | 00:28 |
clarkb | an image just went ready in inap \o/ | 00:32 |
clarkb | mnaser: ^ fyi we should be closing in on fixing that centos image problem | 00:32 |
ianw | :/ it looks like the readthedocs isn't triggering correctly any more. unfortunately to not echo out the password we have no_log on the important bits | 00:32 |
mordred | clarkb: I was only mildly following - mgagne found a thing? | 00:33 |
clarkb | mordred: oui | 00:33 |
mordred | clarkb: awesome | 00:33 |
clarkb | so this may have been entirely cloud side | 00:33 |
clarkb | I think it helped that osc was able to reproduce a failure if not the same one | 00:34 |
*** wolverineav has quit IRC | 00:36 | |
*** wolverineav has joined #openstack-infra | 00:37 | |
clarkb | | 0000000040 | 0000000017 | inap-mtl01 | centos-7 | centos-7-1544483708 | 9416a0d2-48f9-43c3-9aed-271635b897dd | ready | 00:00:02:26 | | 00:39 |
clarkb | osa centos jobs should be happy now | 00:39 |
clarkb | if they start on new nodes | 00:39 |
*** wolverineav has quit IRC | 00:42 | |
*** kjackal has joined #openstack-infra | 00:43 | |
*** jcoufal has quit IRC | 00:44 | |
ianw | ... {"detail":"CSRF Failed: CSRF cookie not set."} ... i do not like the look of this, rtd might have broken access to the authenticated endpoint | 00:45 |
*** yamamoto has quit IRC | 00:49 | |
*** _alastor_ has joined #openstack-infra | 00:59 | |
*** rockyg has quit IRC | 00:59 | |
*** sthussey has quit IRC | 01:03 | |
*** wolverineav has joined #openstack-infra | 01:05 | |
*** rockyg has joined #openstack-infra | 01:07 | |
*** wolverineav has quit IRC | 01:07 | |
*** wolverineav has joined #openstack-infra | 01:07 | |
ianw | well i don't think there's much we can do ... filed https://github.com/rtfd/readthedocs.org/issues/4986 | 01:08 |
*** rkukura has quit IRC | 01:12 | |
*** rockyg has quit IRC | 01:14 | |
mnaser | clarkb: thank you so much! | 01:14 |
*** ianychoi has quit IRC | 01:20 | |
*** _alastor_ has quit IRC | 01:25 | |
*** bobh has joined #openstack-infra | 01:27 | |
*** bobh has quit IRC | 01:31 | |
*** hwoarang has quit IRC | 01:36 | |
*** hwoarang has joined #openstack-infra | 01:37 | |
*** kjackal has quit IRC | 01:40 | |
*** rkukura has joined #openstack-infra | 01:55 | |
*** neilsun has joined #openstack-infra | 01:58 | |
*** _alastor_ has joined #openstack-infra | 02:01 | |
*** _alastor_ has quit IRC | 02:06 | |
*** mrsoul has joined #openstack-infra | 02:07 | |
*** bobh has joined #openstack-infra | 02:08 | |
*** bobh has quit IRC | 02:12 | |
*** jistr has quit IRC | 02:42 | |
*** jistr has joined #openstack-infra | 02:50 | |
*** psachin has joined #openstack-infra | 02:52 | |
*** anteaya has quit IRC | 03:01 | |
*** bhavikdbavishi has joined #openstack-infra | 03:07 | |
*** apetrich has quit IRC | 03:15 | |
*** hongbin has joined #openstack-infra | 03:15 | |
*** rh-jelabarre has quit IRC | 03:38 | |
*** bobh has joined #openstack-infra | 03:39 | |
*** ykarel|away has joined #openstack-infra | 03:42 | |
*** bobh has quit IRC | 03:43 | |
*** udesale has joined #openstack-infra | 03:48 | |
*** gyee has quit IRC | 03:53 | |
*** jamesden_ has joined #openstack-infra | 03:54 | |
*** agopi_ has joined #openstack-infra | 03:54 | |
*** agopi has quit IRC | 03:54 | |
*** jamesdenton has quit IRC | 03:55 | |
*** markvoelker has joined #openstack-infra | 03:57 | |
*** ramishra has quit IRC | 03:59 | |
*** bobh has joined #openstack-infra | 04:01 | |
*** markvoelker has quit IRC | 04:02 | |
*** bobh has quit IRC | 04:06 | |
*** udesale has quit IRC | 04:08 | |
*** mriedem_away has quit IRC | 04:15 | |
*** _alastor_ has joined #openstack-infra | 04:23 | |
*** wolverineav has quit IRC | 04:29 | |
*** wolverineav has joined #openstack-infra | 04:30 | |
*** slaweq has joined #openstack-infra | 04:30 | |
*** jamesmcarthur has joined #openstack-infra | 04:31 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul master: executor: add support for generic build resource https://review.openstack.org/570668 | 04:38 |
*** udesale has joined #openstack-infra | 04:43 | |
*** yamamoto has joined #openstack-infra | 04:46 | |
*** lpetrut has joined #openstack-infra | 04:55 | |
*** wolverineav has quit IRC | 05:05 | |
*** hongbin has quit IRC | 05:07 | |
*** hwoarang has quit IRC | 05:10 | |
*** hwoarang has joined #openstack-infra | 05:12 | |
*** wolverineav has joined #openstack-infra | 05:13 | |
*** jamesmcarthur has quit IRC | 05:18 | |
*** jamesmcarthur has joined #openstack-infra | 05:18 | |
*** ykarel|away has quit IRC | 05:19 | |
*** chandan_kumar has joined #openstack-infra | 05:21 | |
*** jamesmcarthur has quit IRC | 05:23 | |
*** ramishra has joined #openstack-infra | 05:26 | |
*** agopi_ is now known as agop | 05:30 | |
*** agop is now known as agopi | 05:30 | |
*** ykarel|away has joined #openstack-infra | 05:35 | |
*** lpetrut has quit IRC | 05:35 | |
*** ykarel|away is now known as ykarel | 05:42 | |
*** jamesmcarthur has joined #openstack-infra | 05:50 | |
*** _alastor_ has quit IRC | 05:50 | |
*** wolverineav has quit IRC | 05:54 | |
*** yboaron_ has joined #openstack-infra | 06:00 | |
openstackgerrit | OpenStack Proposal Bot proposed openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/624277 | 06:06 |
openstackgerrit | gaobin proposed openstack-infra/zuul master: Modify some file content errors https://review.openstack.org/624278 | 06:08 |
openstackgerrit | gaobin proposed openstack-infra/zuul master: Modify some file content errors https://review.openstack.org/624278 | 06:11 |
*** wolverineav has joined #openstack-infra | 06:18 | |
*** _alastor_ has joined #openstack-infra | 06:19 | |
*** _alastor_ has quit IRC | 06:23 | |
*** betherly has quit IRC | 06:24 | |
*** ahosam has joined #openstack-infra | 06:29 | |
*** jmccrory has quit IRC | 06:34 | |
*** sdake has quit IRC | 06:35 | |
*** jmccrory has joined #openstack-infra | 06:40 | |
*** sdake has joined #openstack-infra | 06:40 | |
*** apetrich has joined #openstack-infra | 06:40 | |
*** bobh has joined #openstack-infra | 06:41 | |
*** rcernin has quit IRC | 06:43 | |
*** wolverineav has quit IRC | 06:44 | |
*** bobh has quit IRC | 06:46 | |
*** jamesmcarthur has quit IRC | 06:50 | |
*** rlandy has quit IRC | 06:51 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul-base-jobs master: Add base openshift job https://review.openstack.org/570669 | 06:53 |
*** wolverineav has joined #openstack-infra | 07:01 | |
*** ahosam has quit IRC | 07:08 | |
*** e0ne has joined #openstack-infra | 07:10 | |
*** e0ne has quit IRC | 07:12 | |
*** e0ne has joined #openstack-infra | 07:13 | |
*** e0ne has quit IRC | 07:14 | |
*** quiquell|off is now known as quiquell | 07:14 | |
*** ramishra has quit IRC | 07:15 | |
*** e0ne has joined #openstack-infra | 07:15 | |
*** e0ne has quit IRC | 07:17 | |
openstackgerrit | Merged openstack-infra/project-config master: Normalize projects.yaml https://review.openstack.org/624277 | 07:22 |
*** bobh has joined #openstack-infra | 07:25 | |
*** wolverineav has quit IRC | 07:29 | |
*** jtomasek has joined #openstack-infra | 07:29 | |
*** bobh has quit IRC | 07:29 | |
*** ykarel is now known as ykarel|lunch | 07:31 | |
*** kjackal has joined #openstack-infra | 07:34 | |
*** yboaron_ has quit IRC | 07:35 | |
openstackgerrit | Merged openstack-infra/zuul master: Add spacing to Queue lengths line https://review.openstack.org/623960 | 07:37 |
*** jtomasek has quit IRC | 07:42 | |
*** jtomasek has joined #openstack-infra | 07:43 | |
*** ahosam has joined #openstack-infra | 07:43 | |
*** rossella_s has quit IRC | 07:46 | |
*** ahosam has quit IRC | 07:49 | |
*** quiquell is now known as quiquell|brb | 07:53 | |
*** oanson has quit IRC | 07:54 | |
*** agopi_ has joined #openstack-infra | 07:56 | |
*** tosky has joined #openstack-infra | 07:58 | |
*** agopi has quit IRC | 07:59 | |
*** ramishra has joined #openstack-infra | 08:01 | |
*** bobh has joined #openstack-infra | 08:01 | |
*** ginopc has joined #openstack-infra | 08:02 | |
*** longkb has joined #openstack-infra | 08:02 | |
*** rossella_s has joined #openstack-infra | 08:03 | |
*** ccamacho has joined #openstack-infra | 08:04 | |
*** agopi_ is now known as agopi | 08:04 | |
*** bobh has quit IRC | 08:05 | |
*** kjackal has quit IRC | 08:06 | |
*** mgoddard has quit IRC | 08:10 | |
*** mgoddard has joined #openstack-infra | 08:10 | |
*** agopi_ has joined #openstack-infra | 08:11 | |
*** agopi has quit IRC | 08:14 | |
*** kjackal has joined #openstack-infra | 08:15 | |
*** shardy has joined #openstack-infra | 08:15 | |
amorin | hey frickler and others, I am moving your instances on separate hosts | 08:20 |
amorin | in the meantime, we found a issue on the hypervisors | 08:20 |
amorin | about RAM usage | 08:20 |
amorin | if the instances are not having enough memory, they could be using swap instead, which could cause them to slow down a lot | 08:21 |
*** imacdonn has quit IRC | 08:22 | |
*** imacdonn has joined #openstack-infra | 08:23 | |
*** _alastor_ has joined #openstack-infra | 08:25 | |
*** agopi_ is now known as agopi | 08:27 | |
*** bhavikdbavishi has quit IRC | 08:28 | |
*** shardy has quit IRC | 08:28 | |
*** hwoarang has quit IRC | 08:28 | |
*** shardy has joined #openstack-infra | 08:29 | |
*** ykarel|lunch is now known as ykarel | 08:30 | |
*** _alastor_ has quit IRC | 08:30 | |
*** hwoarang has joined #openstack-infra | 08:30 | |
*** ahosam has joined #openstack-infra | 08:32 | |
*** ahosam has quit IRC | 08:32 | |
*** priteau has joined #openstack-infra | 08:39 | |
*** quiquell|brb is now known as quiquell | 08:40 | |
*** ramishra has quit IRC | 08:44 | |
*** ramishra has joined #openstack-infra | 08:51 | |
*** bobh has joined #openstack-infra | 08:51 | |
*** bobh has quit IRC | 08:56 | |
*** yamamoto has quit IRC | 09:01 | |
*** ahosam has joined #openstack-infra | 09:02 | |
*** jpena|off is now known as jpena | 09:03 | |
*** dpawlik has quit IRC | 09:03 | |
*** dpawlik has joined #openstack-infra | 09:04 | |
*** eumel8 has joined #openstack-infra | 09:05 | |
*** ahosam has quit IRC | 09:05 | |
*** wolverineav has joined #openstack-infra | 09:07 | |
*** jtomasek_ has joined #openstack-infra | 09:08 | |
*** dpawlik has quit IRC | 09:08 | |
*** jpich has joined #openstack-infra | 09:09 | |
*** jtomasek has quit IRC | 09:10 | |
*** wolverineav has quit IRC | 09:12 | |
*** kjackal has quit IRC | 09:13 | |
*** kjackal has joined #openstack-infra | 09:14 | |
*** yamamoto has joined #openstack-infra | 09:18 | |
*** lpetrut has joined #openstack-infra | 09:21 | |
AJaeger | ianw: looking at https://review.openstack.org/621840 - do you have a change that tests it and shows that it does the right thing? | 09:21 |
*** yamamoto has quit IRC | 09:27 | |
*** bobh has joined #openstack-infra | 09:30 | |
*** derekh has joined #openstack-infra | 09:36 | |
*** dpawlik has joined #openstack-infra | 09:39 | |
*** dpawlik has quit IRC | 09:39 | |
*** dpawlik has joined #openstack-infra | 09:39 | |
*** aojea has joined #openstack-infra | 09:40 | |
*** pbourke_ has quit IRC | 09:54 | |
*** yamamoto has joined #openstack-infra | 10:06 | |
*** yamamoto has quit IRC | 10:10 | |
*** e0ne has joined #openstack-infra | 10:12 | |
*** rossella_s has quit IRC | 10:21 | |
*** priteau has quit IRC | 10:24 | |
*** pbourke has joined #openstack-infra | 10:26 | |
*** electrofelix has joined #openstack-infra | 10:28 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Only reset working copy when needed https://review.openstack.org/624343 | 10:31 |
*** gfidente has joined #openstack-infra | 10:34 | |
*** rossella_s has joined #openstack-infra | 10:35 | |
*** yamamoto has joined #openstack-infra | 10:45 | |
*** yamamoto has quit IRC | 10:47 | |
*** udesale has quit IRC | 10:56 | |
*** tobias-urdin is now known as tobias-urdin|lun | 11:00 | |
*** tobias-urdin|lun is now known as tobias-urdin_afk | 11:01 | |
*** yamamoto has joined #openstack-infra | 11:11 | |
*** yamamoto has quit IRC | 11:16 | |
*** yamamoto has joined #openstack-infra | 11:16 | |
*** rfolco has quit IRC | 11:18 | |
*** dtantsur|afk is now known as dtantsur | 11:18 | |
*** rfolco has joined #openstack-infra | 11:23 | |
*** tobias-urdin_afk is now known as tobias-urdin | 11:27 | |
*** rossella_s has quit IRC | 11:28 | |
*** longkb has quit IRC | 11:39 | |
*** rossella_s has joined #openstack-infra | 11:43 | |
*** quiquell is now known as quiquell|brb | 11:50 | |
*** dpawlik has quit IRC | 11:56 | |
*** ahosam has joined #openstack-infra | 11:57 | |
*** dpawlik has joined #openstack-infra | 11:57 | |
ssbarnea|rover | i seen an interesting spike on timeouts which seems to re-occur after exactly one week: http://status.openstack.org/elastic-recheck/ | 12:00 |
*** ahosam has quit IRC | 12:01 | |
ssbarnea|rover | i am considering adding a new query for POST specific timeouts as the generic one seems too generic and we have a signifiant number of post ones. anyone against? | 12:02 |
*** quiquell|brb is now known as quiquell | 12:13 | |
sean-k-mooney | are there any docs on how to create an elastic recheck query | 12:16 |
sean-k-mooney | i want to create one for "os_vif error: [Errno 24] Too many open files" in the nova compute agent log | 12:17 |
*** wolverineav has joined #openstack-infra | 12:18 | |
*** wolverineav has quit IRC | 12:22 | |
*** yamamoto has quit IRC | 12:23 | |
*** fresta_ is now known as fresta | 12:27 | |
*** yamamoto has joined #openstack-infra | 12:28 | |
*** jamesden_ is now known as jamesdenton | 12:29 | |
*** jpena is now known as jpena|lunch | 12:32 | |
*** yamamoto has quit IRC | 12:32 | |
*** psachin has quit IRC | 12:35 | |
*** rh-jelabarre has joined #openstack-infra | 12:39 | |
*** jamesmcarthur has joined #openstack-infra | 12:42 | |
*** jamesmcarthur has quit IRC | 12:46 | |
*** e0ne has quit IRC | 12:47 | |
*** dave-mccowan has joined #openstack-infra | 12:53 | |
frickler | amorin: oh, that could indeed explain our issues. can you work around it by adjusting quota? do you still want us to proceed with the other tests? | 12:55 |
openstackgerrit | Chris Dent proposed openstack-infra/project-config master: Change os-resource-classes acl config to placement https://review.openstack.org/624387 | 13:00 |
ssbarnea|rover | sean-k-mooney: just create another file in queries/ folder, that's all. look at existing files to reverse engineer the docs ;) | 13:00 |
ssbarnea|rover | in the end is a 4-5 lines yaml file | 13:00 |
*** dave-mccowan has quit IRC | 13:01 | |
*** yamamoto has joined #openstack-infra | 13:05 | |
*** gfidente has quit IRC | 13:09 | |
sean-k-mooney | ya i figured that out but i cant figure out the kibana/elastic serach query | 13:09 |
sean-k-mooney | tags:"screen-n-cpu.txt" and message:"os_vif error: [Errno 24] Too many open files" and project:"openstack/neutron" | 13:09 |
sean-k-mooney | that does not seam to work | 13:09 |
*** boden has joined #openstack-infra | 13:09 | |
*** panda|off is now known as panda | 13:10 | |
*** ykarel is now known as ykarel|afk | 13:12 | |
*** trown|outtypewww is now known as trown | 13:13 | |
*** jamesmcarthur has joined #openstack-infra | 13:15 | |
fungi | amorin: oh, were the hosts oversubscribed on ram? i agree that could have been an explanation | 13:18 |
*** gfidente has joined #openstack-infra | 13:21 | |
fungi | sean-k-mooney: do you have a recent example of a job log in which that string appeared? | 13:23 |
fungi | message:"os_vif error: [Errno 24] Too many open files" isn't found in any indexed job logs for at least the past week | 13:23 |
*** zul has quit IRC | 13:26 | |
*** zul has joined #openstack-infra | 13:26 | |
*** rlandy has joined #openstack-infra | 13:28 | |
sean-k-mooney | fungi: yes so currently its commign up as an uncatogerised issue | 13:31 |
sean-k-mooney | once sec | 13:31 |
sean-k-mooney | http://logs.openstack.org/49/622449/4/check/neutron-tempest-iptables_hybrid/aa25876/logs/screen-n-cpu.txt.gz?level=TRACE#_Dec_11_10_29_35_876336 | 13:32 |
*** jpena|lunch is now known as jpena | 13:32 | |
sean-k-mooney | fungi: the neutron-tempest-iptables_hybrid entry on http://status.openstack.org/elastic-recheck/data/integrated_gate.html | 13:33 |
sean-k-mooney | is caused by https://bugs.launchpad.net/os-vif/+bug/1807949 | 13:33 |
openstack | Launchpad bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] - Assigned to sean mooney (sean-k-mooney) | 13:33 |
sean-k-mooney | or rather by pyrout2 | 13:33 |
fungi | we do seem to be indexing that file in that job | 13:44 |
fungi | since build_name:"neutron-tempest-iptables_hybrid" AND filename:"logs/screen-n-api.txt" returns plenty of hits in the past 6 hours | 13:45 |
fungi | but appending AND message:"Too many open files" has 0 matches in 24 hours | 13:46 |
fungi | or even 48 hours, so should have caught that run | 13:47 |
*** jamesmcarthur has quit IRC | 13:47 | |
*** jamesmcarthur has joined #openstack-infra | 13:48 | |
fungi | build_short_uuid:"aa25876" has hits for that file too though | 13:48 |
*** kgiusti has joined #openstack-infra | 13:50 | |
sean-k-mooney | ok so atleast i did not completely missunderstand how to use kibana | 13:50 |
fungi | well, either that or i completely misunderstand how to use kibana too ;) | 13:51 |
fungi | certainly not ruling that out | 13:51 |
sean-k-mooney | filenema:"logs/screen-n-api.txt" is the wrong file by the way | 13:51 |
sean-k-mooney | it should be screen-n-cpu.txt | 13:52 |
fungi | d'oh, thanks! | 13:53 |
fungi | that seemed to make a difference, though still no lines indexed with message:"os_vif error" | 13:54 |
sean-k-mooney | can you share your query by the way | 13:54 |
sean-k-mooney | this could become a neutorn gate blocker or it could just be intermitent so i wanted to get a querry to try and monitor it | 13:55 |
fungi | i'm currently combing through build_name:"neutron-tempest-iptables_hybrid" AND filename:"logs/screen-n-cpu.txt" AND build_short_uuid:"aa25876" AND message:"error" | 13:56 |
fungi | trying to work out why that line is missing | 13:56 |
fungi | noting the entries are in reverse-chronological order | 13:56 |
*** ykarel|afk has quit IRC | 13:57 | |
fungi | found! | 13:57 |
fungi | the message it parsed out for that line is "error: [Errno 24] Too many open files" | 13:57 |
fungi | okay, now working to generalize | 13:58 |
*** sthussey has joined #openstack-infra | 14:00 | |
*** jamesmcarthur has quit IRC | 14:02 | |
*** jamesmcarthur has joined #openstack-infra | 14:03 | |
fungi | sean-k-mooney: is the project:"openstack/neutron" part critical to this query? | 14:03 |
fungi | is this showing up in multiple jobs, but only jobs run on changes to neutron and not to any other projects? | 14:04 |
fungi | tags:"screen-n-cpu.txt" AND message:"error: [Errno 24] Too many open files" AND project:"openstack/neutron" shows up starting around 09:00 utc today | 14:05 |
sean-k-mooney | no | 14:06 |
fungi | if i drop the project filter, it's still the same number of hits | 14:06 |
sean-k-mooney | i think i have a query | 14:06 |
sean-k-mooney | http://logstash.openstack.org/#/dashboard/file/logstash.json?query=tags:%5C%22screen-n-cpu.txt%5C%22%20AND%20message:%5C%22OSError:%20%5BErrno%2024%5D%20Too%20many%20open%20files%5C%22%20AND%20module:%5C%22os_vif%5C%22%20AND%20loglevel:%20%5C%22ERROR%5C%22 | 14:06 |
*** mriedem has joined #openstack-infra | 14:06 | |
*** jamesmcarthur has quit IRC | 14:07 | |
sean-k-mooney | fungi: i catully want to check the nova and neutron and kurry-kubernetes jobs | 14:07 |
sean-k-mooney | so droping is fine | 14:07 |
fungi | lgtm | 14:07 |
*** dave-mccowan has joined #openstack-infra | 14:08 | |
*** rossella_s has quit IRC | 14:08 | |
fungi | first hit does still seem to be around 09:00 | 14:08 |
sean-k-mooney | ya so we did a release yesterday of os-vif | 14:08 |
sean-k-mooney | the thing is i dont know if this is intermitent or if it always happens | 14:08 |
sean-k-mooney | i think the issue is caused by pyroute2 however | 14:09 |
*** rossella_s has joined #openstack-infra | 14:09 | |
fungi | the other litmus test is that appending AND build_status:"SUCCESS" returns 0 hits, which seems to be the case | 14:09 |
fungi | so we know this pattern is present only in failed job runs | 14:09 |
sean-k-mooney | cool so this is the tracking bug https://bugs.launchpad.net/os-vif/+bug/1807949 | 14:10 |
openstack | Launchpad bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] - Assigned to sean mooney (sean-k-mooney) | 14:10 |
*** psachin has joined #openstack-infra | 14:11 | |
sean-k-mooney | if i add tags:"screen-n-cpu.txt" AND message:"OSError: [Errno 24] Too many open files" AND module:"os_vif" AND loglevel: "ERROR" as the query in the elastic serch repo the file just needs the same name as the bug number right | 14:11 |
*** dave-mccowan has quit IRC | 14:14 | |
*** quiquell is now known as quiquell|lunch | 14:18 | |
*** ykarel|afk has joined #openstack-infra | 14:18 | |
*** smarcet has joined #openstack-infra | 14:20 | |
smarcet | fungi: clarkb: morming , as i forementioned before, we need to migrate openstackid to latest Laravel version (5.6) and migrate puppet to start using php7.x, u mentioned that newer ubuntu version that u guys support is xenial, but xenial by default only support php 7.0 and i have a hard requirement to use PHP >= 7.1.3 bc https://laravel.com/docs/5.6 | 14:22 |
smarcet | its posible for me to update the puppet to use this ppa ppa:ondrej/php and be able to use php 7.2 ? | 14:22 |
*** rossella_s has quit IRC | 14:24 | |
fungi | i guess laravel has decided they don't support any ubuntu lts other than the latest one at this point? bionic (18.04 lts) seems ot have php 7.2 but we currently have problems using puppet on it and are looking at solutions for deploying containerized services on bionic as a result | 14:26 |
fungi | smarcet: given that ppa:ondrej/php is maintained by one of the official ubuntu php package maintainers, it seems like a safe enough compromise | 14:27 |
fungi | i guess this is his alternative to getting the php7.2 packages into xenial-backports | 14:28 |
*** quiquell|lunch is now known as quiquell | 14:30 | |
openstackgerrit | sean mooney proposed openstack-infra/elastic-recheck master: add query for os-vif pyroute2 open files https://review.openstack.org/624412 | 14:30 |
smarcet | fungi: ok cool, if its ok, then i will update puppet to work on that way, may i ask to remove openstackid production server from puppet agent ? so i could test dev server ? | 14:31 |
*** rossella_s has joined #openstack-infra | 14:31 | |
*** udesale has joined #openstack-infra | 14:33 | |
fungi | #status log added openstackid.org to the emergency disable list while smarcet tests out php7.2 on openstackid-dev.openstack.org | 14:34 |
openstackstatus | fungi: finished logging | 14:34 |
fungi | smarcet: i see that we're still running ubuntu trusty (14.04 lts) on both of those servers too | 14:35 |
fungi | maybe this is an opportunity to rebuild them on xenial (16.04 lts) too? | 14:35 |
*** smarcet has quit IRC | 14:36 | |
*** ykarel|afk is now known as ykarel | 14:37 | |
*** smarcet has joined #openstack-infra | 14:38 | |
smarcet | fungi: yes of course | 14:38 |
smarcet | i will test that and we could try first on dev server :) | 14:38 |
smarcet | thx u | 14:38 |
*** rossella_s has quit IRC | 14:39 | |
*** e0ne has joined #openstack-infra | 14:45 | |
*** rossella_s has joined #openstack-infra | 14:46 | |
*** gfidente has quit IRC | 14:59 | |
*** eharney has joined #openstack-infra | 15:00 | |
*** markvoelker has joined #openstack-infra | 15:00 | |
openstackgerrit | Fabien Boucher proposed openstack-infra/zuul master: WIP - Pagure driver https://review.openstack.org/604404 | 15:05 |
*** psachin has quit IRC | 15:05 | |
*** smarcet has quit IRC | 15:09 | |
*** oanson has joined #openstack-infra | 15:17 | |
*** smarcet has joined #openstack-infra | 15:20 | |
*** eharney_ has joined #openstack-infra | 15:23 | |
*** agopi has quit IRC | 15:24 | |
*** eharney has quit IRC | 15:26 | |
*** eharney_ is now known as eharney | 15:27 | |
*** agopi has joined #openstack-infra | 15:29 | |
*** geguileo has joined #openstack-infra | 15:31 | |
geguileo | dmsimard: hi, I'm trying to run this playbook https://review.openstack.org/#/c/620671/7/playbooks/cinderlib/run.yaml | 15:32 |
geguileo | dmsimard: and it's being called from here https://review.openstack.org/#/c/620671/7/playbooks/legacy/cinder-tempest-dsvm-lvm-lio-barbican/run.yaml | 15:32 |
*** bobh has quit IRC | 15:32 | |
geguileo | dmsimard: and I'm running into this error http://logs.openstack.org/71/620671/7/check/cinder-tempest-dsvm-lvm-lio-barbican/6de7951/job-output.txt.gz#_2018-12-04_19_52_26_753969 | 15:32 |
geguileo | dmsimard: which is a little opaque for me | 15:33 |
dmsimard | geguileo: there's a bit more info in the ara report: http://logs.openstack.org/71/620671/7/check/cinder-tempest-dsvm-lvm-lio-barbican/6de7951/ara-report/result/abc1dc34-2d56-43e9-9c11-730cf6ec8d1d/ | 15:33 |
dmsimard | (from http://logs.openstack.org/71/620671/7/check/cinder-tempest-dsvm-lvm-lio-barbican/6de7951/ara-report/ ) | 15:33 |
geguileo | dmsimard: thanks! | 15:34 |
dmsimard | does that directory exist or not ? there's the notion of sudoers in your playbook -- do the tests need to run with superuser privileges ? | 15:34 |
geguileo | dmsimard: how can I know where devstack is installed? r:-?? | 15:34 |
*** agopi has quit IRC | 15:35 | |
dmsimard | geguileo: the devstack installation occurs in a previous task: http://logs.openstack.org/71/620671/7/check/cinder-tempest-dsvm-lvm-lio-barbican/6de7951/ara-report/result/b1365e39-3d97-48e5-a474-e65e50aba1ff/ | 15:36 |
dmsimard | I'm not super familiar with devstack but it looks like there's stuff in /opt/stack for sure | 15:37 |
*** bobh has joined #openstack-infra | 15:38 | |
*** ykarel is now known as ykarel|away | 15:38 | |
geguileo | dmsimard: thanks | 15:39 |
geguileo | dmsimard: I'll try to figure out if there's a variable with the directory | 15:39 |
*** bobh has quit IRC | 15:40 | |
*** neilsun has quit IRC | 15:41 | |
openstackgerrit | Chris Dent proposed openstack-infra/project-config master: Change os-resource-classes and os-traits acl config to placement https://review.openstack.org/624387 | 15:47 |
*** gfidente has joined #openstack-infra | 15:52 | |
*** wolverineav has joined #openstack-infra | 15:54 | |
*** markvoelker has quit IRC | 15:55 | |
clarkb | our inap images are all up to date now' | 15:56 |
*** markvoelker has joined #openstack-infra | 15:56 | |
*** bobh has joined #openstack-infra | 15:57 | |
*** tpsilva has joined #openstack-infra | 15:58 | |
*** smarcet has quit IRC | 15:58 | |
*** wolverineav has quit IRC | 15:58 | |
*** smarcet has joined #openstack-infra | 15:59 | |
*** markvoelker has quit IRC | 16:01 | |
*** bobh has quit IRC | 16:01 | |
*** ccamacho has quit IRC | 16:09 | |
*** jamesmcarthur has joined #openstack-infra | 16:10 | |
*** udesale has quit IRC | 16:14 | |
*** bobh has joined #openstack-infra | 16:20 | |
*** bhavikdbavishi has joined #openstack-infra | 16:24 | |
*** bhavikdbavishi has quit IRC | 16:25 | |
*** bhavikdbavishi has joined #openstack-infra | 16:31 | |
*** e0ne has quit IRC | 16:36 | |
*** sean-k-mooney has quit IRC | 16:43 | |
*** quiquell is now known as quiquell|off | 16:48 | |
*** sean-k-mooney has joined #openstack-infra | 16:49 | |
*** eharney has quit IRC | 16:51 | |
*** d0ugal has quit IRC | 16:56 | |
*** bhavikdbavishi1 has joined #openstack-infra | 16:58 | |
*** kjackal has quit IRC | 16:59 | |
*** kjackal has joined #openstack-infra | 17:00 | |
*** bhavikdbavishi1 has quit IRC | 17:00 | |
*** bhavikdbavishi has quit IRC | 17:02 | |
*** bhavikdbavishi has joined #openstack-infra | 17:05 | |
*** rossella_s has quit IRC | 17:07 | |
*** eharney has joined #openstack-infra | 17:07 | |
*** jamesmcarthur has quit IRC | 17:13 | |
*** yamamoto has quit IRC | 17:19 | |
*** zul has quit IRC | 17:20 | |
*** gyee has joined #openstack-infra | 17:22 | |
*** jpich has quit IRC | 17:24 | |
clarkb | A lot of email to get through this morning. Probably a fairly slow start for me today between that and our meeting | 17:35 |
*** ykarel|away has quit IRC | 17:35 | |
*** pgaxatte has quit IRC | 17:37 | |
fungi | mordred: corvus: clarkb: jpmaxman is hacking on a gerrit backend driver for netlify cms and interested in having a repo in our gerrit for some test content. any concerns? | 17:43 |
clarkb | fungi: could possibly reuse the sandbox repo? (though that might get abused). I don't see any issues with having a test repo | 17:43 |
fungi | yeah, i figure it might be cleaner to use a dedicated repo and then just retire it once no longer needed (or keep it around for similar future sorts of netlify backend testing). i think he wants | 17:44 |
fungi | i think he wants to be able to test-drive it with zuul doing gating of content changes and stuff | 17:45 |
fungi | which is why i didn't suggest just using the official gerrit container to test with | 17:45 |
*** JpMaxMan has joined #openstack-infra | 17:45 | |
corvus | fungi: no objection here | 17:46 |
corvus | and also, now that i've read all the requirements -- no better ideas :) | 17:46 |
*** xarses has joined #openstack-infra | 17:46 | |
*** sshnaidm is now known as sshnaidm|afk | 17:47 | |
fungi | and exciting as this may mean easier collaboration on site content for zuul-ci.org and opendev.org | 17:47 |
*** xarses has quit IRC | 17:47 | |
JpMaxMan | yes that's the dream :) | 17:48 |
*** xarses has joined #openstack-infra | 17:48 | |
fungi | clarkb: should it just go in the openstack-infra namespace? seems more related to infra/opendev efforts than to openstack anyway, even if it's not something that would necessarily be an official deliverable repo of the infra team | 17:49 |
clarkb | that is fine with me. | 17:50 |
JpMaxMan | right now we're just working up a POC using the starlingx site as it is already in netlify | 17:50 |
JpMaxMan | https://github.com/StarlingXWeb/starlingx-website | 17:52 |
fungi | JpMaxMan: want me to get the project-config change going to create the repository? do you want starlingx-website imported as the initial repository content? | 17:53 |
*** lpetrut has quit IRC | 17:56 | |
JpMaxMan | I'm happy to take a stab at it - and yes we'd start with the starlingx-website as an initial repo. | 17:56 |
*** bobh has quit IRC | 17:56 | |
*** gyee has quit IRC | 17:56 | |
fungi | JpMaxMan: in that case we have instructions at https://docs.openstack.org/infra/manual/creators.html and are happy to help answer any questions you have | 17:57 |
fungi | JpMaxMan: i recommend something like openstack-infra/netlify-sandbox to fit with existing naming conventions for other repos in our gerrit | 17:57 |
JpMaxMan | excellent! Thank you - will let you know as I proceed. And yes, any naming conventions suggestions welcome - will use that to start :) | 17:58 |
fungi | note that a lot of what's in there isn't relevant for this particular case so you'll end up skipping some of it (e.g., anything having to do with pypi) | 17:59 |
fungi | and if you miss something or include something unnecessary, that's why we have automated checks and reviewers | 18:00 |
*** derekh has quit IRC | 18:00 | |
*** aojea has quit IRC | 18:01 | |
clarkb | unrelated but it is really cool that university researchers are starting to figure out we've got all this real world data freely available for research on software development process activity | 18:01 |
*** trown is now known as trown|lunch | 18:02 | |
JpMaxMan | ok good to know - I'll give the automation a run for its money :P | 18:02 |
fungi | we all do | 18:02 |
fungi | clarkb: yes, i love that academic research sees our work as a gold mine of behavioral (both human and systems) data | 18:06 |
*** dtantsur is now known as dtantsur|afk | 18:07 | |
clarkb | mwhahaha: ssbarnea|rover EmilienM I'm still in a try to better understand what the afilures are are experiencing are state and looking at http://logs.openstack.org/22/605722/2/gate/tripleo-ci-centos-7-undercloud-containers/d1a7140/logs/undercloud/ I see the undercloud failed due to configuring keepalived? Having a hard time seeing why/where keepalived failed. Can you help me find the appropriate | 18:07 |
clarkb | logs? | 18:07 |
mwhahaha | clarkb: error mounting image volumes: unable to find user root: no matching entries in passwd file | 18:07 |
mwhahaha | is a bug in podman (probably runc) | 18:08 |
clarkb | mwhahaha: which logfile do I look in for that? | 18:08 |
mwhahaha | http://logs.openstack.org/22/605722/2/gate/tripleo-ci-centos-7-undercloud-containers/d1a7140/logs/undercloud/home/zuul/undercloud_install.log.txt.gz#_2018-12-11_17_23_18 | 18:09 |
mwhahaha | https://bugs.launchpad.net/tripleo/+bug/1803544 | 18:09 |
openstack | Launchpad bug 1803544 in tripleo "unable to find user root: no matching entries in passwd file" [High,Triaged] | 18:09 |
*** Swami has joined #openstack-infra | 18:09 | |
clarkb | aha I needed to scroll up for more ERROR messages. Thank you | 18:09 |
mwhahaha | http://status.openstack.org/elastic-recheck/index.html#1803544 | 18:10 |
*** e0ne has joined #openstack-infra | 18:10 | |
mwhahaha | we're trying to figure it out, it's one of those really obscure bugs | 18:10 |
* mwhahaha wanders off | 18:10 | |
clarkb | cool so its being tracked already. Thanks | 18:10 |
EmilienM | clarkb: hi, yes I'm working with the podman team today and we have a fix already : https://github.com/containers/libpod/pull/1978 | 18:11 |
EmilienM | clarkb: I'm working on getting the fix merged and built asap... | 18:12 |
*** gfidente has quit IRC | 18:12 | |
clarkb | EmilienM: good to know. FWIW not singling out this specific bug I was jsut going through and trying to find the breadcrumbs and got lost. Thank you for pointing me at the other error messages and the bug and the fix | 18:12 |
clarkb | (this is me trying to better understand the variety of testing we run so that the infra team can help debug and/or fix things when it is on our end) | 18:12 |
EmilienM | yeah it makes sense | 18:13 |
clarkb | mriedem: http://logs.openstack.org/76/582376/8/gate/tempest-full-py3/a8f62b6/job-output.txt.gz#_2018-12-11_10_50_01_185172 is that one you recognize? looks like either the test node ran out of disk or the devstack test flavor is too small for cirros | 18:16 |
clarkb | unfortunately dstat doesn't capture disk usage | 18:17 |
clarkb | http://logs.openstack.org/76/582376/8/gate/tempest-full-py3/a8f62b6/controller/logs/df.txt.gz whenever that df is run by devstack indicates we have a lot of disk there though | 18:18 |
fungi | clarkb: could also be bubbling up from lack of disk space at the hypervisor layer, though that build was in inap-mtl01 which isn't somewhere we've seen disk issues like that in the past as far as i'm aware | 18:18 |
clarkb | fungi: ya the df shows we haev 150GB disk which is a lot more than we promise to have | 18:18 |
clarkb | maybe someone can boot the cirros image and check how much disk it ends up using (or is that something we can ask qemu-img) | 18:19 |
clarkb | oh you mean the hypervisor in inap, thats a good point | 18:19 |
*** wolverineav has joined #openstack-infra | 18:19 | |
clarkb | sorry misread it as the test node being cirros' hypervisor | 18:19 |
fungi | yeah, i can see now how that might have been vague on my part | 18:19 |
fungi | the provider's hypervisor layer/compute host | 18:19 |
fungi | not devstack's hypervisor layer | 18:20 |
clarkb | yup | 18:20 |
fungi | i think enospc gets plumbed up into the guest anyway | 18:20 |
clarkb | let us see what logstash says. If its inap specific then ya probably full up hypervisor. If we have more occurences across clouds then maybe cirros is too big | 18:20 |
fungi | good thinkin | 18:20 |
*** _alastor_ has joined #openstack-infra | 18:22 | |
*** d0ugal has joined #openstack-infra | 18:22 | |
clarkb | there is a blip of it in inap on the 11th. Then a smaller blip in rax-iad | 18:23 |
clarkb | though I'm only searching recent days /me expands search | 18:23 |
*** wolverineav has quit IRC | 18:24 | |
clarkb | it happens in rax-iad, ord, inap and ovh gra1 | 18:24 |
clarkb | inap is about 2/3 of the occurences and rax ord half that | 18:24 |
clarkb | mgagne_: ^ if it is easy for you to check, any idea what disk pressure looks like on those hypervisors? Also thank you for the image upload fix. Our images are up to date now | 18:25 |
*** wolverineav has joined #openstack-infra | 18:26 | |
*** wolverineav has quit IRC | 18:27 | |
*** wolverineav has joined #openstack-infra | 18:27 | |
*** rkukura_ has joined #openstack-infra | 18:32 | |
*** rkukura has quit IRC | 18:32 | |
*** rkukura_ is now known as rkukura | 18:32 | |
mgagne_ | clarkb: didn't check all hypervisors but disk is far from being full. and now going into a meeting. | 18:33 |
openstackgerrit | Clark Boylan proposed openstack-infra/elastic-recheck master: Add query for bug 1808010 https://review.openstack.org/624458 | 18:35 |
openstack | bug 1808010 in OpenStack-Gate "Tempest cirros boots fail due to lack of disk space" [Undecided,New] https://launchpad.net/bugs/1808010 | 18:35 |
clarkb | mgagne_: thanks | 18:35 |
clarkb | started tracking it ^ there | 18:35 |
clarkb | mriedem: ^ fyi | 18:35 |
*** trown|lunch is now known as trown | 18:36 | |
*** rfolco is now known as rfolco_brb | 18:38 | |
fungi | we're now down to 15 zuul mergers, and the merger queue seems to be getting backed up more often (though still clears fairly quickly) | 18:40 |
*** jpena is now known as jpena|off | 18:40 | |
clarkb | we expect 20 right? 12 executors + 8 dedicated mergers | 18:42 |
fungi | we don't seem to register mergers distinctly in gearman, they just show up in the merger:merge, merger:refstate, merger:fileschanges and merger:cat buckets so hard to tell which ones are missing | 18:42 |
fungi | yeah, should be 20 | 18:42 |
mriedem | clarkb: ack, | 18:43 |
mriedem | note that until https://review.openstack.org/#/c/619319/ | 18:43 |
mriedem | the flavors used by tempest via devstack specifiy 0 root_gb, | 18:43 |
mriedem | meaning compute uses whatever is the size of the image | 18:43 |
*** Adri2000 has quit IRC | 18:43 | |
clarkb | mriedem: possible that the image is too small for some of the writes then? you'd expect that to be more consistent though so maby does point to test node or host hypervisor | 18:44 |
fungi | it was 20 mergers registered for just a split second back on the 6th/7th (when we brought ze12 into production right after restarting everything): http://grafana.openstack.org/d/T6vSHcSik/zuul-status?panelId=30&fullscreen&orgId=1&from=now-7d&to=now | 18:44 |
fungi | looks like we were already down 2 before the restarts, so probably been going on for a while | 18:44 |
mriedem | clarkb: hmm, maybe, not sure what size the config drive is | 18:45 |
mriedem | looks like vfat is a fixed 64MB | 18:46 |
mriedem | but we don't use vfat by default | 18:46 |
fungi | seems two died around utc midnight on november 13th, prior to that we were running with a full compliment since beginning of october at least, so maybe we added something in early-to-mid november which made merger threads crashy? | 18:47 |
mriedem | oh wait is this config drive in the test node or a nested virt guest created by tempest? | 18:47 |
*** armax has joined #openstack-infra | 18:47 | |
clarkb | mriedem: this is the cirros nested "virt" guest created by tempest failing to configure networking because its disk is full (now that could be because the hypervisor running devstack is full disk or the hypervisor running the test node is also running with full disk) | 18:48 |
mriedem | looks like this is by far happening in networking-odl-tempest-fluorine | 18:48 |
clarkb | in particular it appears that it can't set the default route (I'm guessing because that needs disk to write to) | 18:48 |
*** Adri2000 has joined #openstack-infra | 18:48 | |
clarkb | and without a default route it seems that ssh is failing from tempest to the cirros node | 18:49 |
clarkb | ~10 minutes to the infra meeting | 18:49 |
fungi | `pgrep -c zuul-executor` returns "2" on all the executors | 18:50 |
clarkb | fungi: possibly the dedicated mergers have died? | 18:50 |
clarkb | or maybe haven't reconnected to gearman after restarting the scheduler? | 18:50 |
fungi | `pgrep -c zuul-merger` returns "1" on all the standalone mergers | 18:50 |
fungi | also we seem to lose mergers one or two at a time, over time, according to the graph | 18:51 |
fungi | not corresponding to scheduler/geard restarts | 18:51 |
fungi | we'll likely need to dig into merger logs on the servers to find out what's going on | 18:52 |
*** smarcet has quit IRC | 18:52 | |
*** bhavikdbavishi has quit IRC | 18:52 | |
clarkb | fungi: check netstat connections to 4730 on all of the executors and mergers? | 18:52 |
clarkb | and vice versa on from the gearman scheduler | 18:52 |
clarkb | that should narrow down where the connections don't exist | 18:52 |
*** bhavikdbavishi has joined #openstack-infra | 18:53 | |
fungi | good idea | 18:53 |
*** _alastor_ has quit IRC | 18:53 | |
fungi | odd that some are ipv4 and some v6 | 18:53 |
fungi | wonder if this is network instability in rax-dfw at play | 18:53 |
clarkb | I did confirm with the logstash switch to just geard that gear will fall back appropriately | 18:54 |
clarkb | https://review.openstack.org/#/c/611920/ was another output of that to make geard a bit more ipv6 friendly | 18:55 |
clarkb | mriedem: actually /run on cirros isn't necessarily a real fs either. It is possible that that is tmpfs or similar in which case it could be memroy pressure? | 18:56 |
fungi | confirmed that all executors see 2 established gearman connections and all standalone mergers 1 | 18:56 |
clarkb | I probably need to boot a cirros image locally | 18:56 |
fungi | will check from the scheduler end now | 18:56 |
*** rlandy is now known as rlandy|brb | 18:59 | |
mriedem | clarkb: on one of the failures i looked at, the config drive was .5 MB | 18:59 |
*** _alastor_ has joined #openstack-infra | 18:59 | |
openstackgerrit | Merged openstack-infra/nodepool master: Fix race in test_handler_poll_session_expired https://review.openstack.org/623269 | 19:00 |
fungi | clarkb: i think that got it: http://paste.openstack.org/show/737045/ (we have 5 executors showing only 1 established gearman connection on the geard end) | 19:01 |
fungi | ze02, ze03, ze07, ze08 and ze11 seen to have probably lost their gearman connections for their merger threads | 19:01 |
fungi | Shrews: this may also be up your alley to dig into once you get working internets again | 19:03 |
fungi | i guess we don't get separate merger logs on the executors, the messages are just all mixed into the executor logs? | 19:04 |
*** bobh has joined #openstack-infra | 19:07 | |
*** jamesmcarthur has joined #openstack-infra | 19:09 | |
*** wolverineav has quit IRC | 19:09 | |
Shrews | fungi: the gearman stuff? maybe alley-adjacent :) i can help you poke around in logs after the meeting | 19:10 |
*** wolverineav has joined #openstack-infra | 19:10 | |
fungi | yeah, no rush | 19:10 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Fix node leak when skipping child jobs https://review.openstack.org/613261 | 19:10 |
*** bhavikdbavishi has quit IRC | 19:11 | |
fungi | checking ze02 for a start, seems it logged zuul.Merger entries other than "Updating local repository" up until 2018-11-28 23:03:52,646 and then abruptly ceased | 19:11 |
openstackgerrit | Merged openstack-infra/elastic-recheck master: Add query for bug 1808010 https://review.openstack.org/624458 | 19:12 |
openstack | bug 1808010 in OpenStack-Gate "Tempest cirros boots fail due to lack of disk space" [Undecided,New] https://launchpad.net/bugs/1808010 | 19:12 |
*** jamesmcarthur has quit IRC | 19:14 | |
*** rlandy|brb is now known as rlandy | 19:14 | |
*** yamamoto has joined #openstack-infra | 19:17 | |
*** xarses has quit IRC | 19:19 | |
*** _alastor_ has quit IRC | 19:25 | |
*** electrofelix has quit IRC | 19:26 | |
*** xarses has joined #openstack-infra | 19:27 | |
*** lbragstad has quit IRC | 19:30 | |
*** lbragstad has joined #openstack-infra | 19:31 | |
fungi | digging into the log around that time, i don't see any exceptions/tracebacks | 19:32 |
fungi | current theory: network issues resulted in geard dropping the connection from the merger, but the merger on the executor still thinks the socket is established. lack of keepalive/dpd(?) means the merger thread is humming along blissfully unaware that it will never see any new requests | 19:35 |
*** shardy has quit IRC | 19:37 | |
*** wolverineav has quit IRC | 19:37 | |
fungi | related question: why is this only affecting the tag-along merger threads on the executors and not the stand-alone merger daemons? | 19:40 |
fungi | we've lost 25% of our mergers, and none are stand-alone even though those account for for 40% of the total | 19:40 |
fungi | statistically unlikely it's random distribution there | 19:41 |
*** jamesmcarthur has joined #openstack-infra | 19:43 | |
*** wolverineav has joined #openstack-infra | 19:46 | |
*** smarcet has joined #openstack-infra | 19:47 | |
*** jamesmcarthur has quit IRC | 19:47 | |
*** wolverineav has quit IRC | 19:48 | |
*** wolverineav has joined #openstack-infra | 19:49 | |
tobiash | fungi: related to your theory: https://review.openstack.org/599567 | 19:52 |
tobiash | fungi: we observed the same after a vm crash hosting the scheduler/geard | 19:52 |
fungi | tobiash: thanks!!! that's indeed interesting | 19:53 |
fungi | tobiash: any idea why it might affect the merger threads on our executors but not affect our stand-alone mergers? | 19:53 |
tobiash | fungi: that's just co-incidence, on our scheduler crash it affected *all* mergers | 19:54 |
fungi | got it, thanks again | 19:54 |
tobiash | fungi: the point is that if a merge was in progress while having network issues, the merger will try to send the result and notice that the connection is broken while an idle merger won't notice it | 19:54 |
fungi | makes sense. perhaps our stand-alone mergers are more active than our tag-along mergers | 19:55 |
fungi | and so statistically more likely to be in the middle of something when the disconnect occurs, so notice and reconnect | 19:56 |
fungi | no idea if our data backs that up, but one possible explanation anyway | 19:56 |
tobiash | maybe | 19:56 |
openstackgerrit | Chris Dent proposed openstack-infra/project-config master: Change os-resource-classes and os-traits acl config to placement https://review.openstack.org/624387 | 19:57 |
*** wolverineav has quit IRC | 19:58 | |
corvus | tobiash, fungi: don't we have keepalives on the server? shouldn't that be enough? | 19:59 |
tobiash | corvus: no, because an idle merger won't notice until it tries to send something | 19:59 |
fungi | corvus: if we do, then i'm indeed curious why it's not helping | 19:59 |
tobiash | so you need keepalive in both directions | 19:59 |
corvus | tobiash: oh, i get it. thanks :) | 20:00 |
tobiash | corvus: the server correctly notices that the client is gone, so that's fine | 20:00 |
fungi | we definitely seem to have connections which are marked as established on the client but absent on the server | 20:00 |
clarkb | corvus: ianw re https://review.openstack.org/#/c/605585/14 I left a comment on what I think is the issue and how to fix it. Do you think that fix is reasonable? if so I can get it up pretty quickly | 20:00 |
fungi | tobiash: yep, that's i think what we're seeing here then | 20:00 |
clarkb | oh wait there is another issue too | 20:01 |
corvus | tobiash, fungi: +3 | 20:01 |
fungi | thanks! | 20:01 |
tobiash | corvus, fungi: the according zuul change is 599567 (which needs an update to the requirements after a geard release) | 20:01 |
tobiash | corvus: thanks :) | 20:01 |
corvus | clarkb: yep; i think you or i may have suggested that originally too | 20:02 |
tobiash | er 599573 | 20:02 |
fungi | i'm just glad this is probably explained (and even known) and i can hopefully stop worrying about the cause now ;) | 20:02 |
clarkb | corvus: just left a second comment on another failure | 20:02 |
clarkb | corvus: this one will need a little more thought but I think we can safely converge that rule across our control plane | 20:02 |
corvus | clarkb: that's very amusing, btw -- this was my yesterday: https://review.openstack.org/619643 | 20:03 |
clarkb | ha | 20:04 |
ianw | clarkb: hrm, FORWARD DROP seems safter anyway? | 20:04 |
clarkb | ianw: ya I think FORWARD DROP is currently the more correct rule for how we use our nodes | 20:04 |
corvus | clarkb: but i agree that -- at least until we're running our own kubernetes clusters on top of our normal infrastructure, that should be fine | 20:05 |
*** zul has joined #openstack-infra | 20:05 | |
clarkb | its possible that kubernetes if we switch to it will change that as corvus has found (docker wants it set to DROP as well then it very carefully punches holes for what it passes through NAT) | 20:05 |
clarkb | since we'll docker with host network namespace its a noop for our docker | 20:05 |
corvus | clarkb: re https://review.openstack.org/624246 maybe we should just do it in project-config? | 20:09 |
clarkb | corvus: ya we could add stub projects for the tripleo repos | 20:09 |
clarkb | and project config is listed first so will win right? | 20:09 |
corvus | yep | 20:09 |
* corvus lunches | 20:10 | |
openstackgerrit | Merged openstack-infra/gear master: Add support for keepalive to client https://review.openstack.org/599567 | 20:11 |
clarkb | mriedem: that cirros instance seems to boot with 64MB of ram according to http://logs.openstack.org/76/582376/8/gate/tempest-full-py3/a8f62b6/controller/logs/libvirt/qemu/instance-00000022_log.txt.gz | 20:16 |
clarkb | (I think I mapped the instance id properly from the console log) | 20:17 |
*** smarcet has quit IRC | 20:17 | |
mriedem | is that what -m 64 is? | 20:17 |
clarkb | whcih seems to be the m1.nano flavor. I'm going to boot cirros here with 64MB memory and see if its unhappy | 20:18 |
clarkb | mriedem: ya | 20:18 |
fungi | i would rather plan n64 | 20:18 |
fungi | er, play n64 | 20:18 |
mriedem | only if not bond | 20:18 |
mriedem | clarkb: yeah http://logs.openstack.org/76/582376/8/gate/tempest-full-py3/a8f62b6/controller/logs/devstacklog.txt.gz#_2018-12-11_10_39_47_721 | 20:18 |
*** yamamoto has quit IRC | 20:19 | |
mriedem | i *think* i might have gotten to the bottom of this multiattach swap volume multinode race bug... | 20:19 |
mriedem | oh it would be so sweet | 20:19 |
*** tobiash has quit IRC | 20:19 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: Add a script to generate the static inventory https://review.openstack.org/622964 | 20:20 |
clarkb | nope I'm not going to boot cirros locally because apparmor says libvirtd is not allowed to start | 20:20 |
fungi | it knows best | 20:21 |
ianw | clarkb: ^^ the inventory script was a little too bare-bones i think, suggested updates | 20:21 |
clarkb | anyone have a quick easy way to boot https://download.cirros-cloud.net/0.3.5/cirros-0.3.5-x86_64-disk.img locally under qemu/kvm with 64MB memory to see if /run is a tmpfs or similar? | 20:21 |
clarkb | I want to rule out that the low memory environment is itself the source of the cp errors | 20:21 |
*** mriedem has quit IRC | 20:22 | |
*** tobiash has joined #openstack-infra | 20:23 | |
*** bobh has quit IRC | 20:23 | |
openstackgerrit | Jean-Philippe Evrard proposed openstack-infra/zuul-jobs master: Add docker insecure registries feature https://review.openstack.org/624484 | 20:23 |
clarkb | I'm going to find lunch then maybe when I get back I'll figureo ut apparmor | 20:23 |
*** mriedem has joined #openstack-infra | 20:24 | |
fungi | clarkb: also board meeting at 2100z if you are interested in dialling in | 20:24 |
clarkb | ya Ill have that in the background likely | 20:25 |
*** bobh has joined #openstack-infra | 20:26 | |
openstackgerrit | Jean-Philippe Evrard proposed openstack-infra/zuul-jobs master: Add docker insecure registries feature https://review.openstack.org/624484 | 20:26 |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul master: Use gearman client keepalive https://review.openstack.org/599573 | 20:32 |
*** eharney has quit IRC | 20:34 | |
*** wolverineav has joined #openstack-infra | 20:36 | |
*** wolverineav has quit IRC | 20:36 | |
*** wolverineav has joined #openstack-infra | 20:36 | |
frickler | clarkb: tmpfs on /run type tmpfs (rw,nosuid,relatime,size=200k,mode=755) | 20:46 |
frickler | clarkb: so that is bound to fail if the config drive contains > 200k data | 20:47 |
clarkb | frickler: thanks I think that means maybe 64MB isnt big enough | 20:47 |
clarkb | ya | 20:47 |
frickler | 64MB is pretty huge compared to that | 20:47 |
clarkb | oh ya 200k | 20:47 |
clarkb | mriedem: ^ fyi | 20:48 |
*** hamerins has joined #openstack-infra | 20:48 | |
*** d0ugal has quit IRC | 20:48 | |
*** eharney has joined #openstack-infra | 20:49 | |
fungi | why would we create the tempfs in /run anyway? that's supposed to just be for things like pidfiles during early boot | 20:49 |
mriedem | clarkb: but the 64MB here http://logs.openstack.org/76/582376/8/gate/tempest-full-py3/a8f62b6/controller/logs/libvirt/qemu/instance-00000022_log.txt.gz is the root disk, not the config drive | 20:49 |
fungi | er, i mean create the configdrive in /run | 20:49 |
clarkb | mriedem: its ram memory, but that may be orthogonal if the tmpfs is that small | 20:50 |
mriedem | oh right, was thinking root disk, nvm | 20:50 |
clarkb | 200kb tmpfs us pretty tiny | 20:50 |
fungi | we really should never use /run for *anything* | 20:50 |
clarkb | fungi: thats likely cirros/smoser | 20:51 |
frickler | it's the cirros init script that uses it | 20:51 |
clarkb | since it doesnt run glean or cloud init it does its own thing | 20:51 |
fungi | it's for pidfiles and fifos for services starting before /var/run is available | 20:51 |
clarkb | does the 4.0 image change that U wonder | 20:52 |
fungi | and if you want a reasonable-sized tmpfs for data you generally mount one yourself (like on /tmp) | 20:52 |
clarkb | could be a reasoon ti switch if so | 20:52 |
clarkb | frickler: ^ maybe you can check the newer 4.0 image too? | 20:52 |
*** fuentess has joined #openstack-infra | 20:52 | |
frickler | cirros 0.4 doesn't work in devstack last I checked, so not a short term option | 20:54 |
clarkb | ah | 20:54 |
frickler | I'm more wondering why the config-drive gets so large | 20:54 |
clarkb | mriedem may know | 20:56 |
clarkb | we did add a debugging script in tempest as user data | 20:56 |
clarkb | its not huge but could contribute maybe | 20:56 |
*** bobh has quit IRC | 20:56 | |
clarkb | also the reason not setting the route matters is we ssh via the fip | 20:57 |
clarkb | so it isnt shared l2 fom cirros perspective | 20:57 |
mriedem | clarkb: not sure, wondering if something changed in tempest recently | 20:57 |
*** rfolco_brb has quit IRC | 20:57 | |
frickler | clarkb: oh, where was that script added? "df -h /run" gives me 92% used, 16k free, so not much headroom there | 21:00 |
clarkb | frickler: its in tempest itself for the heavyweight ssh tests. was added to dumo debug info to console | 21:00 |
clarkb | i forget where exactly I added it though but its my name on the commit if that helps to find it (eating lunch and listening to board meeting now) | 21:01 |
frickler | I found a patch from 2017, so that by itself wouldn't explain any recent breakage | 21:02 |
clarkb | it may no longerbe helpful and we could rmeove it if it helps | 21:02 |
clarkb | ya it wasnt super re ent | 21:03 |
clarkb | *recent | 21:03 |
frickler | hmm, that only looks to be three lines of script. removing it may help a bit, but if thing are really so tight I think we need some more general measures | 21:04 |
frickler | anyway, eod for me, will followup tomorrow | 21:05 |
clarkb | ++ | 21:06 |
clarkb | thank you for getting that booted | 21:06 |
*** d0ugal has joined #openstack-infra | 21:06 | |
mriedem | clarkb: looks like we need https://review.openstack.org/#/c/623597/ on stable/rocky | 21:06 |
mriedem | because grenade on master is failing | 21:06 |
mriedem | if you want to cherry pick | 21:07 |
clarkb | I'll look after lunch | 21:09 |
clarkb | have a link to failure? | 21:09 |
mriedem | logstash still shows it hitting | 21:10 |
mriedem | in grenade jobs | 21:10 |
mriedem | so it's probably devstack in stable/rocky | 21:10 |
clarkb | ah | 21:10 |
*** bobh has joined #openstack-infra | 21:18 | |
*** bobh has quit IRC | 21:19 | |
*** bobh has joined #openstack-infra | 21:19 | |
*** bobh has quit IRC | 21:21 | |
*** auristor has quit IRC | 21:22 | |
openstackgerrit | MarcH proposed openstack-infra/git-review master: tox.ini: add passenv = http_proxy https_proxy # _JAVA_OPTIONS https://review.openstack.org/624496 | 21:28 |
*** kgiusti has left #openstack-infra | 21:28 | |
JpMaxMan | Hey random question - I'm helping someone get their git review for gerrit going - should this be 404'ing ? https://git.openstack.org/tools/hooks/commit-msg it's causing an error in the git review. | 21:30 |
openstackgerrit | MarcH proposed openstack-infra/git-review master: tox.ini: add passenv = http_proxy https_proxy # _JAVA_OPTIONS https://review.openstack.org/624496 | 21:30 |
clarkb | yes that should be served by review.openstack.org | 21:30 |
clarkb | what is your .gitreview file gerrit server value set to? | 21:31 |
clarkb | JpMaxMan: ^ | 21:31 |
JpMaxMan | lemme see | 21:31 |
fungi | JpMaxMan: when you run, e.g., `git review -s` it should just work. if this is in an empty repository you may need to create a .gitreview file to commit to it | 21:31 |
JpMaxMan | I was having him follow the instructions for the sandbox | 21:32 |
JpMaxMan | https://docs.openstack.org/infra/manual/sandbox.html | 21:32 |
*** auristor has joined #openstack-infra | 21:33 | |
fungi | https://git.openstack.org/cgit/openstack-dev/sandbox/tree/.gitreview#n2 looks correct | 21:33 |
corvus | JpMaxMan: we can get more debug info by running "git review -s -v" and copy/pasting the output to http://paste.openstack.org/ | 21:33 |
JpMaxMan | yeah checked the .gitreview it looks right | 21:34 |
JpMaxMan | host=review.openstack.org | 21:34 |
fungi | yes, i wonder if something is going sideways/getting guessed wrong due to a problem with a gerrit account | 21:34 |
fungi | so the verbose output will help | 21:34 |
*** eernst has joined #openstack-infra | 21:34 | |
JpMaxMan | http://paste.openstack.org/show/737053/ | 21:36 |
JpMaxMan | hmmm it seems to work if I clone... git clone https://review.openstack.org/openstack-dev/sandbox.git | 21:37 |
JpMaxMan | review instead of git ... | 21:38 |
corvus | fungi, JpMaxMan: the first two lines of the debug output are interesting -- apparently gitreview.remote is set | 21:38 |
fungi | could be set in ~/.gitconfig already? | 21:39 |
JpMaxMan | oh yes sorry I think I did that in my first bit of troubleshooting - it was complaining that there wasn't an remote named gerrit | 21:39 |
JpMaxMan | I looked and the remote was set to origin so I set that | 21:39 |
corvus | JpMaxMan: where did you set that? | 21:39 |
fungi | aha, yes if there is already a git remote named "gerrit" then git-review will assume that's what it should use to reach the gerrit server | 21:40 |
corvus | fungi: JpMaxMan said the opposite of that | 21:40 |
JpMaxMan | git config --global gitreview.remote origin | 21:41 |
fungi | oh, yep | 21:41 |
JpMaxMan | I first tried renaming the remote to gerrit which produced the same output | 21:41 |
corvus | JpMaxMan: can you run "git config --global --unset gitreview.remote" please? and then run 'git review -s -v' and paste the new output? | 21:41 |
JpMaxMan | sure | 21:42 |
*** eharney has quit IRC | 21:42 | |
fungi | git review should normally set a git remote named "gerrit" for you based on the content of the .gitreview file and the account name it attempts to determine via a test connection. if something goes wrong with the connection test that's when i've seen users start trying random things | 21:43 |
fungi | in the future we might want to revisit how it performs username determination | 21:43 |
*** jamesmcarthur has joined #openstack-infra | 21:44 | |
JpMaxMan | ok I think I see what happened one second | 21:44 |
*** markvoelker has joined #openstack-infra | 21:45 | |
*** e0ne has quit IRC | 21:47 | |
*** eernst has quit IRC | 21:47 | |
JpMaxMan | Ok - so the initial error was caused by a bad username: "We don't know where your gerrit is. Please manually create a remote named 'gerrit' and try again." | 21:47 |
*** jamesmcarthur_ has joined #openstack-infra | 21:48 | |
JpMaxMan | and yes @corvus - thank you - unsetting that did fix the issue | 21:48 |
JpMaxMan | but using the correct username ;) | 21:48 |
JpMaxMan | he had originally put in email instead of username and I didn't notice | 21:48 |
corvus | JpMaxMan: aha! glad it worked :) | 21:48 |
clarkb | mriedem: remote: https://review.openstack.org/624499 Set apache proxy-initial-not-pooled env var | 21:48 |
*** markvoelker has quit IRC | 21:49 | |
JpMaxMan | makes sense now - appreciate it | 21:49 |
*** jamesmcarthur has quit IRC | 21:50 | |
openstackgerrit | Clark Boylan proposed openstack-infra/system-config master: Import install-docker role https://review.openstack.org/605585 | 21:54 |
openstackgerrit | Clark Boylan proposed openstack-infra/system-config master: Set iptables forward drop by default https://review.openstack.org/624501 | 21:54 |
*** wolverineav has quit IRC | 21:54 | |
clarkb | corvus: ianw mordred ^ thats the outcome of the iptables discussion from a bit earlier | 21:54 |
*** wolverineav has joined #openstack-infra | 21:55 | |
*** wolverineav has quit IRC | 21:55 | |
*** wolverineav has joined #openstack-infra | 21:55 | |
clarkb | jungleboyj: any idea why cinder + lower constraints tests seem to be unhappy fairly often? | 21:57 |
jungleboyj | clarkb: No idea. I was wondering that too. | 21:58 |
clarkb | jungleboyj: http://logs.openstack.org/42/600442/1/gate/openstack-tox-lower-constraints/6592c5d/job-output.txt.gz#_2018-12-11_21_48_46_655602 seems related to database migrations? | 21:58 |
clarkb | but it isn't the old "disk is slow" timeout error. Instead this seems to complain about data types | 21:59 |
*** rcernin has joined #openstack-infra | 21:59 | |
jungleboyj | Jeez. I haven't seen that test case fail in a long time. | 22:00 |
*** smarcet has joined #openstack-infra | 22:00 | |
jungleboyj | It is strange that that would be seen more in the LowerConstraints test. | 22:01 |
*** hamerins has quit IRC | 22:01 | |
*** bobh has joined #openstack-infra | 22:03 | |
fungi | if you haven't seen it in a while and it's failing with older versions of deps... | 22:03 |
jungleboyj | :-) Yeah. | 22:03 |
*** trown is now known as trown|outtypewww | 22:04 | |
clarkb | I've updated https://bugs.launchpad.net/openstack-gate/+bug/1808010 to indicate I think its an interaction with cirros tmpfs and not a cloud issue | 22:04 |
openstack | Launchpad bug 1808010 in OpenStack-Gate "Tempest cirros boots fail due to lack of disk space" [Undecided,New] | 22:04 |
ianw | clarkb: ++ thanks. i like it when a change gets like 3 authors ... shows the system is working :) | 22:04 |
clarkb | ianw: I think we are all invested in getting this going :) | 22:05 |
ianw | clarkb: hrm, this isn't related to a recent change we made calculating tempest disk size? not sure if that merged ... | 22:05 |
clarkb | ianw: it hasn't mriedem linked to it and its unmerged. But also cirros mounts /run as tmpfs so its actually in memory | 22:05 |
clarkb | ianw: and its only 200kb according to frickler's testing | 22:05 |
ianw | ah, ok, should read the bug | 22:06 |
fungi | well, /run is pretty ubiquitously mounted tmpfs by all distros | 22:06 |
clarkb | fungi: ya thats why it occurred to me it may not be disk when I saw it was /run that had a problem | 22:06 |
fungi | they don't generally even create a /run directory unless it's going to be used for pre-rootfs situations | 22:07 |
*** EmilienM has quit IRC | 22:07 | |
fungi | and it's pretty much always teensy too | 22:07 |
clarkb | fungi: my hunch here is that cirros is abusing /run this way because config drive can tell you things about what goes into fstab | 22:08 |
*** EmilienM has joined #openstack-infra | 22:08 | |
clarkb | so its processing the config drive before it has real disk to write to because it may have to set up those real disks itself | 22:08 |
clarkb | but unfortunately it is leading to broken networking due to constraints being run up against | 22:08 |
clarkb | its also not a super common error. So we may not want to spend too many cycles on it while debugging more common ones first. It is being tracked by e-r now so we should see if it persists or gets worse or bubbles to the top of the list due to us fixing other stuff | 22:10 |
clarkb | according to e-r the top four bugs seem related to timeouts and network issues | 22:11 |
clarkb | there was a spiek in those that went away that I haven't debugged because it went away. Guessing a temporary provider issue | 22:12 |
clarkb | after that is http://status.openstack.org/elastic-recheck/#1807518 which I just pushed a backport to rocky in devstack for (so hopefully those go away) | 22:12 |
clarkb | then its a long long tail of all the random things that are unreliable | 22:12 |
clarkb | mwhahaha: EmilienM ssbarnea|rover it seems that the centos-ceph-luminous mirror your jobs are talkign to may be getting increasingly flaky | 22:18 |
EmilienM | damn | 22:18 |
EmilienM | weshay: ^ | 22:18 |
clarkb | http://mirror.dfw.rax.openstack.org/centos/7/storage/x86_64/ceph-luminous/ is something that I think we do mirror for you | 22:18 |
mwhahaha | we don't have any of those jobs in the gate anymore | 22:19 |
clarkb | so may just be a matter of switching to the in region mirrors for ceph-luminous | 22:19 |
mwhahaha | but yes we should check that out, not sure which we're using | 22:19 |
clarkb | I'm looking at gate e-r graphs (and the logstash links for them) | 22:19 |
clarkb | http://status.openstack.org/elastic-recheck/gate.html#1708704 specifically that one | 22:20 |
mwhahaha | we're not running any jobs that should require ceph | 22:20 |
mwhahaha | but will need to look | 22:20 |
clarkb | more than 50% of the failures in the gate are in the last 24 hours and they fail against centos-ceph-luminous against centos.org | 22:20 |
mwhahaha | hmm it's mirror.centos.org | 22:21 |
*** anteaya has joined #openstack-infra | 22:21 | |
mwhahaha | Failed to connect to 2607:f130:0:87::10: Network is unreachable | 22:21 |
clarkb | yup | 22:21 |
mwhahaha | ipv6'd | 22:21 |
clarkb | but we mirror it for you locally at http://mirror.dfw.rax.openstack.org/centos/7/storage/x86_64/ceph-luminous/ (replace region specific data as necessar) | 22:21 |
mwhahaha | yea let me go find the config | 22:22 |
mwhahaha | is that build into the image maybe? | 22:22 |
mwhahaha | cause i'm seeing NODEPOOL_CENTOS_MIRROR referenced in quickstart | 22:23 |
clarkb | zuul drops some hints as to where to find the various mirrors (nodepool did it in the past so the vars say nodepool for compat) | 22:23 |
clarkb | it writes /etc/ci/mirror_info.sh iirc. Let me see if I can find that | 22:23 |
mwhahaha | yea we use that | 22:23 |
mwhahaha | so i need to find out why that one isn't set | 22:23 |
*** markvoelker has joined #openstack-infra | 22:24 | |
mwhahaha | oh this is before we even get to our config | 22:24 |
mwhahaha | so yea it's the repos from the image | 22:24 |
mwhahaha | http://logs.openstack.org/23/624323/1/gate/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/98dc676/job-output.txt#_2018-12-11_21_45_14_659825 | 22:24 |
clarkb | http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/mirror-info/templates/mirror_info.sh.j2 | 22:24 |
*** wolverineav has quit IRC | 22:24 | |
mwhahaha | this is in pre-run | 22:24 |
clarkb | ya that should run very early in our base job | 22:25 |
mwhahaha | right so the pre roles don't properly configure the mirrors | 22:25 |
mwhahaha | not the tripleo stuff | 22:25 |
mwhahaha | we're configuring to use the mirrors | 22:25 |
mwhahaha | so this is likely the repo config of the image | 22:25 |
*** wolverineav has joined #openstack-infra | 22:25 | |
clarkb | the image doesn't have that data, we apply it in the job itself | 22:25 |
clarkb | and our base pre run should run before your pre run does | 22:26 |
clarkb | yes it is part of the base jobs defined in project-config | 22:26 |
mwhahaha | the images come with /etc/yum.repos.d cofigured | 22:26 |
mwhahaha | with the defaults from centos | 22:26 |
*** slaweq has quit IRC | 22:26 | |
mwhahaha | we're actually clearing out those configs when our code starts | 22:27 |
clarkb | why would centos have random repos enabled by default | 22:27 |
clarkb | (I've quickly grepped and project-config dib elements don't add it at least) | 22:27 |
* mwhahaha shrugs | 22:27 | |
clarkb | ianw: ^ this may interest you | 22:27 |
clarkb | I wonder if this is new with 7.6 | 22:28 |
mwhahaha | so by default all the CentOS-* files are in the cloud iamge | 22:29 |
mwhahaha | http://logs.openstack.org/23/624323/1/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/bcbde3b/logs/undercloud/etc/yum.repos.d/ | 22:29 |
mwhahaha | we turn them off when we run our ci code | 22:29 |
clarkb | but those failures are happening before the disabling occurs? | 22:30 |
mwhahaha | yes | 22:31 |
mwhahaha | this is before any of the tripleo code runs | 22:31 |
mwhahaha | this is just basic infra prep | 22:31 |
mwhahaha | to install OVS | 22:31 |
mwhahaha | for the multinode setup | 22:32 |
clarkb | but why would it care about the ceph repo in that case? I guess yum has to scan all the repos to see where the most appropriate ovs package lives? | 22:32 |
mwhahaha | yum update tries to get all the metadata | 22:32 |
mwhahaha | or yum install | 22:32 |
mwhahaha | if it doesn't exist | 22:32 |
mwhahaha | so it errors | 22:32 |
fungi | infra-root: is someone grooming the openstackadmin account on github right now? seeing some address removals/confirmations and just want to be sure it's one of us (i expect it's related to the discussion in our meeting but would like to be sure) | 22:33 |
clarkb | fungi: ianw volunteered in the meeting today | 22:33 |
*** jamesmcarthur_ has quit IRC | 22:33 | |
fungi | cool. ianw: i guess those are you? | 22:34 |
fungi | (removed root@o.o, confirmed infra-root@o.o...) | 22:34 |
mwhahaha | clarkb: so it's that role | 22:34 |
ianw | fungi: yep, poking at it now | 22:34 |
mwhahaha | clarkb: http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/multi-node-bridge/tasks/common.yaml#n10 probably drops that storage repo inplace | 22:34 |
*** jamesmcarthur has joined #openstack-infra | 22:35 | |
fungi | ianw: perfect. thanks again! | 22:35 |
mwhahaha | clarkb: but there is no code to swap out the mirrors | 22:35 |
mwhahaha | clarkb: so it uses what is shipped from centos-release-openstack-queens | 22:35 |
*** jamesmcarthur has quit IRC | 22:35 | |
clarkb | mwhahaha: gotcha, fwiw http://logs.openstack.org/18/607318/1/gate/tripleo-ci-centos-7-standalone/fbbd3a3/zuul-info/ also exhibits this behavior and is a single node test | 22:35 |
mwhahaha | yea so it's any centos job that installs OVS | 22:35 |
*** jamesmcarthur has joined #openstack-infra | 22:35 | |
clarkb | mwhahaha: not sure why it would be running multinode setup if it is single node (that might be a separate cleanup) | 22:35 |
mwhahaha | clarkb: we use ovs for fake interfaces | 22:36 |
*** jamesmcarthur has quit IRC | 22:36 | |
mwhahaha | but the issue is that the multi-node-bridge role does not properly configure mirrors to install ovs from | 22:36 |
clarkb | does centos-release-openstack-queens imply centos-ceph-lumnious transitively? | 22:36 |
mwhahaha | likely | 22:36 |
fungi | clarkb: do we miss setting a mirror url for the ovs packages? | 22:36 |
mwhahaha | clarkb: yes, https://rpmfind.net/linux/RPM/centos/extras/7.6.1810/x86_64/Packages/centos-release-openstack-queens-1-2.el7.centos.noarch.html | 22:37 |
fungi | is that the summary? | 22:37 |
*** jamesmcarthur has joined #openstack-infra | 22:37 | |
clarkb | fungi: possibly? I'm not sure if we set the mirror properly for the rdo/openstack repo | 22:37 |
clarkb | and then ceph is an unexpected addition | 22:37 |
clarkb | or if we fail to set both of them | 22:37 |
mwhahaha | so the repos get added and removed in multi-node-bridge | 22:37 |
mwhahaha | http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/multi-node-bridge/tasks/common.yaml#n45 | 22:37 |
clarkb | looks like we don't really do much between add repo and install package | 22:37 |
mwhahaha | so it adds the stock repos, installs ovs, removes the repos | 22:37 |
clarkb | so likely unset for both repos | 22:37 |
*** bobh has quit IRC | 22:38 | |
*** bobh has joined #openstack-infra | 22:38 | |
ianw | clarkb / fungi: so that turned out to be rather easy ... when you get a minute do you want to look at the password file and try logging into the shared account with 2fa token as described there? | 22:39 |
*** jamesmcarthur has quit IRC | 22:40 | |
clarkb | ianw: ya I can try when I've paged this ovs/ceph stuff out | 22:40 |
*** jtomasek_ has quit IRC | 22:40 | |
*** bobh has quit IRC | 22:41 | |
*** _alastor_ has joined #openstack-infra | 22:41 | |
*** jamesmcarthur has joined #openstack-infra | 22:42 | |
*** slaweq has joined #openstack-infra | 22:44 | |
*** bobh has joined #openstack-infra | 22:44 | |
ianw | clarkb: also can you take a look at stein mirroring request, seems straight foward -> https://review.openstack.org/#/c/621231/ | 22:45 |
*** boden has quit IRC | 22:46 | |
*** jamesmcarthur has quit IRC | 22:46 | |
*** slaweq has quit IRC | 22:48 | |
clarkb | mwhahaha: ianw: configure-mirror role tries to do this for centos but only applies it for epel and the base os/ portion of the mirror | 22:51 |
clarkb | I think it will work if we write out the file that specifies centos-ceph-luminous and disable it like we do with epel. Any idea where I can find a copy of that file? | 22:52 |
clarkb | https://github.com/CentOS-Storage-SIG/centos-release-ceph-luminous/blob/master/CentOS-Ceph-Luminous.repo that maybe? | 22:54 |
ianw | clarkb: won't the package install overwrite that? in the epel case, we have epel-release package installed | 22:55 |
*** yamamoto has joined #openstack-infra | 22:55 | |
clarkb | ianw: maybe? I know very little about how centos is expected to work. Its all a foreign language to me particularly the way everything is in a different repo and you have to do somethign special to install what seems like every other package | 22:57 |
clarkb | ianw: we configure epel with this j2 file https://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/configure-mirrors/templates/etc/yum.repos.d/epel.repo.j2 | 22:58 |
clarkb | seems like we set it to enabled=0 then expect something else to enable it. Can we write out a CentOS-Ceph-Luminous.repo file in a similar way and have the package that installs the repo flip the bit or will it overwrite entirely? | 22:59 |
ianw | clarkb: i think the package will overwrite it. for epel, we have pre-installed the package with https://git.openstack.org/cgit/openstack/diskimage-builder/tree/diskimage_builder/elements/epel | 23:00 |
clarkb | as an alternative we can have multi-node-bridge role do a text substition on that file after the packages install the repo | 23:00 |
clarkb | but before we install ovs | 23:00 |
clarkb | oh got it | 23:00 |
ianw | the idea for epel is that you do "yum install --enablerepo=epel ..." so we know what we're dragging in explicitly | 23:00 |
clarkb | in that case maybe the text substition in multi-node-bridge role is better | 23:00 |
mwhahaha | it's really specific to that role, so if the mirrors exist in the ansible vars then do a text substitution between the install of the repo before the package | 23:01 |
mwhahaha | this is the annoying problems with the CI repo configs that we end up duplicating this same thing all over the place | 23:01 |
ianw | hrm, i forget, we uninstall the repos after right? | 23:02 |
mwhahaha | in that role, yes | 23:02 |
mwhahaha | http://git.openstack.org/cgit/openstack-infra/zuul-jobs/tree/roles/multi-node-bridge/tasks/common.yaml#n45 | 23:02 |
mwhahaha | it's litterally to just get the queens version of OVS | 23:02 |
ianw | yeah, that's right, but if they were there we don't | 23:02 |
ianw | and i think at one point we used to install RDO in the base package, but that caused problems, which is why we moved it "up" to this point | 23:03 |
clarkb | mwhahaha: yes, and every other distro avoids this problem by having A repo | 23:03 |
clarkb | (even fedora has everything in a single repo iirc) | 23:03 |
mwhahaha | pretty sure ubuntu has more than one | 23:03 |
mwhahaha | UCA is the extra one | 23:03 |
mwhahaha | anyway | 23:03 |
* mwhahaha then points to pypi, yum, docker, etc mirrors | 23:04 | |
*** eernst has joined #openstack-infra | 23:04 | |
ianw | yeah, not really centos's fault because it's raison d'etre is to be rhel like, so if rhel doesn't have ovs in base then we end up like this | 23:04 |
clarkb | ya it just gets really complicated quickly | 23:05 |
*** bobh has quit IRC | 23:05 | |
ianw | we could install rdo like epel and disable it | 23:06 |
clarkb | ianw: I'm working on a lineinfile patch for multi-node-bridge | 23:06 |
*** bobh has joined #openstack-infra | 23:06 | |
clarkb | which will replace the remote with the mirror node | 23:06 |
clarkb | (I hope) | 23:07 |
mwhahaha | we used to always have the N-1 version installed by default but i think that caused more problems | 23:07 |
mwhahaha | it would be nice if we got OVS from something that only contained OVS | 23:07 |
clarkb | mwhahaha: ya we removed it from the image because that caused confusion too | 23:07 |
ianw | ++ | 23:07 |
mwhahaha | at this point i think just lineinfile mirror.centos.org with the local mirrors is probably the best bet | 23:08 |
ianw | yes the KISS approach | 23:08 |
mwhahaha | though i wonder how that plays in with the uninstall if the file is changed | 23:08 |
* mwhahaha shrugs | 23:08 | |
ianw | i think we almost had linuxbridge working for multinode too? i remember that being a possibility for removing ovs | 23:08 |
ianw | by "we", i mean clarkb, i didn't do anything useful :) | 23:09 |
clarkb | ianw: neutron assumed ovs unfortunately | 23:10 |
clarkb | so it got tricky to untable the unfortaunte dep from devstack + neutron that that ovs bridge existing | 23:11 |
clarkb | and I gave up | 23:11 |
clarkb | its entirely doable at this point if we get devstack + neutron to learn how to plug the linux bridge bridge into its own ovs bridges | 23:11 |
clarkb | anyone know where to get a copy of /etc/yum.repos.d/CentOS-OpenStack-queens.repo ? | 23:12 |
openstackgerrit | Jp Maxwell proposed openstack-infra/project-config master: Adding the netlify-sandbox project https://review.openstack.org/624523 | 23:13 |
openstackgerrit | Matt Riedemann proposed openstack-infra/elastic-recheck master: Add query for glance-api proxy error bug 1808063 https://review.openstack.org/624524 | 23:13 |
openstack | bug 1808063 in OpenStack-Gate "glanceclient.exc.HTTPBadGateway: 502 Proxy Error during server snapshot" [Undecided,Confirmed] https://launchpad.net/bugs/1808063 | 23:13 |
mriedem | clarkb: ^ | 23:13 |
*** slaweq has joined #openstack-infra | 23:14 | |
ianw | clarkb: http://paste.openstack.org/show/737099/ i think, from https://www.rdoproject.org/repos/rdo-release.rpm | 23:15 |
*** kjackal has quit IRC | 23:15 | |
mwhahaha | http://mirror.centos.org/centos/7/extras/x86_64/Packages/centos-release-openstack-queens-1-2.el7.centos.noarch.rpm | 23:15 |
* mwhahaha is downloading to fetch | 23:16 | |
mwhahaha | http://paste.openstack.org/show/737100/ | 23:17 |
mwhahaha | it's more than just the rdo-release | 23:17 |
*** _alastor_ has quit IRC | 23:17 | |
mwhahaha | if you swap out mirror.centois.org and buildlogs.centos.org i think we have mirrors for those | 23:17 |
mwhahaha | though only mirror.centos.org is the one that is enabled | 23:18 |
openstackgerrit | Clark Boylan proposed openstack-infra/zuul-jobs master: Use mirrors if available when installing OVS on centos https://review.openstack.org/624525 | 23:18 |
clarkb | ya I was just doing mirror.centos.org since it si the only one enabled | 23:18 |
clarkb | I think something like ^ should work | 23:18 |
mwhahaha | http://paste.openstack.org/show/737101/ is the ceph one | 23:19 |
clarkb | I don't think multi-node-bridge is a trusted role so we should be able to depends on that chagne from a tripelo change to make sure it works | 23:19 |
mwhahaha | yea that should work | 23:19 |
*** slaweq has quit IRC | 23:19 | |
clarkb | mwhahaha: care to push that depends on change (I don't know what would be a good representative set) | 23:20 |
mwhahaha | sure | 23:20 |
clarkb | thanks | 23:20 |
mwhahaha | https://review.openstack.org/#/c/624526/ | 23:21 |
mwhahaha | will get an assortment of jobs | 23:21 |
*** jamesmcarthur has joined #openstack-infra | 23:22 | |
*** eernst has quit IRC | 23:25 | |
*** jamesmcarthur has quit IRC | 23:26 | |
melwitt | clarkb: mriedem just told me about https://bugs.launchpad.net/openstack-gate/+bug/1808010 while I was looking at a failed job run, but in the log I see "WARN: failed: route add -net "0.0.0.0/0" gw "10.1.0.1"" but not any messages about no space left. is that a separate known launchpad bug or do you think it's the same thing? | 23:27 |
openstack | Launchpad bug 1808010 in OpenStack-Gate "Tempest cirros boots fail due to lack of disk space" [Undecided,New] | 23:27 |
melwitt | http://logs.openstack.org/82/623282/3/check/nova-next/a900344/logs/testr_results.html.gz | 23:27 |
clarkb | melwitt I thought it was thr same thing | 23:28 |
*** smarcet has quit IRC | 23:28 | |
melwitt | ok, thanks | 23:28 |
clarkb | melwitt: in the bug it has messages about the disk errors | 23:28 |
clarkb | happena before failing to set the route | 23:29 |
melwitt | yeah, I don't see them in the cirros log excerpt on the job I was looking at (linked above) so I wasn't sure | 23:29 |
clarkb | huh maybe disk space isnt the root cause then | 23:29 |
clarkb | Im pretty sure the broken default route is what breaks ssh | 23:29 |
clarkb | and thought it was caused by the disk issue | 23:29 |
melwitt | but indeed when I search for it on logstash I see most of the hits coming from the networking-odl-tempest-fluorine job, all failures | 23:30 |
melwitt | *when I search for the failed route add | 23:30 |
*** slaweq has joined #openstack-infra | 23:36 | |
openstackgerrit | Jp Maxwell proposed openstack-infra/project-config master: Adding the netlify-sandbox project https://review.openstack.org/624523 | 23:38 |
clarkb | melwitt: we probably want to better understand what could cause that route add failure | 23:40 |
clarkb | and go from there | 23:40 |
clarkb | cirros runs busybox so it may be different than whatever distro you have locally too | 23:40 |
*** slaweq has quit IRC | 23:40 | |
*** armax has quit IRC | 23:42 | |
*** xek_ has joined #openstack-infra | 23:43 | |
*** xek has quit IRC | 23:46 | |
*** smarcet has joined #openstack-infra | 23:46 | |
melwitt | clarkb: ack, thanks | 23:49 |
melwitt | I added a note to the launchpad | 23:49 |
*** dklyle has joined #openstack-infra | 23:51 | |
openstackgerrit | Ian Wienand proposed openstack-infra/system-config master: Enable github shared admin account https://review.openstack.org/624531 | 23:52 |
*** xarses has quit IRC | 23:59 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!