fungi | i was able to launch one myself in az2 (and delete it) | 00:02 |
---|---|---|
*** david-lyle_ has quit IRC | 00:06 | |
*** sarob has joined #openstack-infra | 00:13 | |
*** sdake has quit IRC | 00:14 | |
*** gokrokve has quit IRC | 00:17 | |
*** sarob has quit IRC | 00:18 | |
fungi | oh, i see one issue... we have name-filter: 'Performance' on the bare-precise entries for hpcloud | 00:33 |
fungi | that explains why we're getting the flavor error | 00:37 |
clarkb | odd that it seems to have recently stopped building slaves though | 00:37 |
fungi | yeah, however i can launch one myself manually, so something's up with nodepool i'm thinking | 00:38 |
fungi | still digging | 00:38 |
*** salv-orlando_ has joined #openstack-infra | 00:40 | |
*** salv-orlando has quit IRC | 00:40 | |
*** salv-orlando_ is now known as salv-orlando | 00:40 | |
clarkb | we should fix the performsnce thing. if you edit the file locally nodepool should just pick it up | 00:41 |
fungi | yeah, i'm going to | 00:41 |
fungi | i'm uploading a patch too while i'm thinking about it so we don't forget | 00:41 |
mordred | clarkb, fungi: we should also figure out how to get off of az[1-3] and on to the 1.1 cloud at HP - because the old azs are going to go away at some point | 00:42 |
mordred | sdague: yes. we should put that in the check queue - but that requires slightly more tooling - which I'm working on | 00:42 |
clarkb | mordred "figure out" | 00:42 |
*** gokrokve has joined #openstack-infra | 00:43 | |
mordred | clarkb: yeah. | 00:43 |
clarkb | mordred Im not sure there is anything we can do. we just increased test time by a major factor | 00:43 |
mordred | clarkb: we still need to test nodes that are twice the size | 00:43 |
clarkb | mordred we can use rax for all those tests and hpcloud for single use unittesters | 00:43 |
mordred | since that's supposed to give us twice the cpu throttling allocation | 00:43 |
mordred | oh wow: min-ram: 30720 | 00:46 |
mordred | we're asking for pretty large nodes in region-b | 00:46 |
clarkb | mordred: initial testing of that had very poor resulyts | 00:46 |
mordred | spectacular | 00:46 |
clarkb | yes and it didnt help | 00:46 |
clarkb | well it helped a tiny bit but not 2x | 00:46 |
mordred | well - maybe lifeless cloud will save us | 00:46 |
clarkb | mt rainier is out! | 00:47 |
mordred | nice! | 00:47 |
openstackgerrit | Jeremy Stanley proposed a change to openstack-infra/config: Remove incorrect name filters from nodepool config https://review.openstack.org/67684 | 00:51 |
*** talluri has joined #openstack-infra | 00:52 | |
*** talluri has quit IRC | 00:53 | |
*** talluri has joined #openstack-infra | 00:53 | |
fungi | aha, okay so manually launching a node in az2 from the webui seems to work, but launching one using novaclient hangs at "Instance building... 0% complete" | 00:54 |
clarkb | mordred: mattoliverau: I have to say the tmpfs/eatmydata zuul idea was really good. seems to still be humming along | 00:54 |
clarkb | fungi: weird, nova api version trouble maybe? | 00:55 |
sdague | clarkb: so fwiw, people are still approving stable/havana changes into the gate | 00:58 |
sdague | there are a few still in there | 00:58 |
sdague | I'm very tempted to bulk -2 all of stable/havana | 00:58 |
sdague | to prevent more of that | 00:58 |
*** talluri has quit IRC | 00:59 | |
clarkb | sdague: or and this is eviler, delete the stable havana branch permissions :) | 00:59 |
sdague | actually, that might be less evil. Then I don't have to bulk unset it | 00:59 |
fungi | we *could* just remove approve from the all-projects acl entry for refs/stable/* | 00:59 |
fungi | that way stable release managers can still +2, just can't approve | 01:00 |
sdague | fungi: sounds good tome | 01:00 |
fungi | lemme finish formulating this support case with hpcloud first while you discuss amongst yourselves | 01:00 |
sdague | I also think we should remove reverify completely | 01:01 |
clarkb | my reason for it being eviler is it is a bit like a coup | 01:01 |
sdague | because what happens is patch authors will reverify their code because they want it in, and are never reading the ML about things that will or will not work | 01:02 |
*** odyssey4me has quit IRC | 01:02 | |
sdague | if it's only cores that can toggle the approved, then it should be a responsible set | 01:02 |
*** Sonicwall[A] has joined #openstack-infra | 01:03 | |
*** Sonicwall[A] has left #openstack-infra | 01:03 | |
fungi | how about this... we unset any approval votes on stable changes and send an e-mail to the stable management ml pleading with them not to approve until the bug(s) linked in that message are resolved (and to please pitch in if they can) | 01:03 |
clarkb | removing reverify entirely has long been my stance definite +2 for thst from me | 01:03 |
sdague | fungi: so emails aren't helping, that was a set of approves | 01:03 |
fungi | did e-mails about it go to the -dev ml or the stable branch ml? | 01:04 |
sdague | -dev | 01:04 |
fungi | i wonder if some of them don't read -dev as regularly | 01:04 |
sdague | well they should be | 01:05 |
fungi | 'course they might just not read any lists regularly, the stable branch ml included | 01:05 |
sdague | seriously, we can't be going and tracking down every freaking bad actor | 01:05 |
fungi | oh, my boot test finally went to 100% but now any ssh attempt to the resulting vm gets an immediate connection closed | 01:07 |
mordred | clarkb: yay re: tmpfs | 01:07 |
fungi | clarkb: it was pointed out last night that we missed one more thing... i got the rechecks page working again by stopping recheckwatch on the old zuul, copying the pickle and report from it to new zuul and starting the service there | 01:09 |
*** dcramer_ has quit IRC | 01:09 | |
mordred | I approve removing +A from refs/stable/* | 01:09 |
mordred | and then we can let bad actors declare themselves | 01:09 |
mordred | when they complain | 01:09 |
*** odyssey4me has joined #openstack-infra | 01:10 | |
sdague | you also need to reset all the approved bits, so reverifies don't happen | 01:12 |
sdague | or kill reverify | 01:12 |
mordred | we could block reverify on stable/* too | 01:12 |
sdague | ok, well, I have a bulk -2 script I can loop on now. Or someone else can take those on | 01:13 |
openstackgerrit | Derek Higgins proposed a change to openstack-infra/config: Add some dependencies required by toci https://review.openstack.org/67685 | 01:13 |
*** sarob has joined #openstack-infra | 01:13 | |
mordred | fungi: do we have any example negative lookahead regexes in zuul anywhere? | 01:14 |
mordred | ref: ^(?!(refs/stable/.*)).*$ | 01:16 |
mordred | perhaps? | 01:16 |
sdague | actually, I'm going to bulk recheck all the stable havana jobs | 01:18 |
*** sarob has quit IRC | 01:18 | |
sdague | that don't already have a -1 | 01:18 |
mordred | ++ | 01:18 |
sdague | then they'll get a -1 and hopefully the reviewers won't be silly | 01:18 |
sdague | though with the node starvation, it will slow down the rest of things. But best idea I have. | 01:21 |
mordred | sdague: why not remove +A? | 01:21 |
sdague | doesn't solve reverify | 01:21 |
clarkb | it does | 01:21 |
sdague | also, I don't have those permissions. | 01:21 |
clarkb | zuul wont reverify without the votes | 01:21 |
sdague | ok, well, I already fired off the bulk recheck | 01:22 |
*** gokrokve has quit IRC | 01:23 | |
mordred | well - I just removed +A on stable/* from stable-maint | 01:23 |
mordred | between the two, let's see how it goes | 01:24 |
sdague | mordred: cool | 01:24 |
mordred | I'm going to announce that I've done that too | 01:24 |
sdague | mordred: or don't, and see who complains :) | 01:24 |
sdague | who didn't keep up with the list | 01:24 |
*** derekh has quit IRC | 01:24 | |
*** Hefeweizen has quit IRC | 01:26 | |
*** Hefeweizen has joined #openstack-infra | 01:26 | |
mordred | :) | 01:26 |
mordred | sdague: so how do we work on fixing the problem is stable/* is blocked for +A? | 01:27 |
sdague | mordred: a possible patch is in the queue | 01:28 |
sdague | https://review.openstack.org/#/c/67425/ | 01:28 |
sdague | though I have not tested a stable/havana change behind it | 01:29 |
sdague | so we could promote that to see | 01:29 |
sdague | I found an pulled out two more stable/havana changes from the queue | 01:34 |
sdague | and getting called to dinner, night all | 01:34 |
clarkb | ok into areas of I5 with poor service going to afk now too | 01:38 |
*** praneshp has joined #openstack-infra | 01:38 | |
*** morganfainberg|z has quit IRC | 01:39 | |
*** morganfainberg|z has joined #openstack-infra | 01:40 | |
*** morganfainberg|z is now known as morganfainberg | 01:40 | |
*** morganfainberg is now known as Guest52195 | 01:40 | |
*** FallenPegasus has joined #openstack-infra | 01:44 | |
fungi | that failing glance change in the gate hit a socket timeout pip-installing sqlalchemy on both its pep8 and python27 jobs | 01:47 |
clarkb | fungi I think rax has had network blips today | 01:48 |
clarkb | I did not check their status page though | 01:48 |
*** FallenPegasus has quit IRC | 01:48 | |
fungi | yeah, two different bare-precise nodes in iad | 01:49 |
fungi | status.r.c says investigating potential issue for next-gen cloud servers in london, but otherwise green | 01:50 |
*** talluri has joined #openstack-infra | 01:55 | |
*** talluri has quit IRC | 01:59 | |
fungi | actually, playing around with the az2 problem, i think it may just be our old friend ssh timeout | 02:00 |
clarkb | 120 seconds not long enough? this edge connection is surprisingly useable for irc | 02:02 |
fungi | yeah, i think it's taking longer. i was able to get into an az2 i launched after waiting a few minutes | 02:09 |
*** milki has quit IRC | 02:09 | |
*** milki has joined #openstack-infra | 02:10 | |
*** sarob has joined #openstack-infra | 02:13 | |
*** sarob has quit IRC | 02:18 | |
*** oubiwann_ has quit IRC | 02:18 | |
*** oubiwann_ has joined #openstack-infra | 02:20 | |
*** thuc has joined #openstack-infra | 02:21 | |
fungi | yep, bumping my test script to sleep 300 seconds allows me to continue building... | 02:28 |
fungi | in fact, we've already got it set to 180 | 02:31 |
fungi | for hpcloud | 02:31 |
*** thuc has quit IRC | 02:32 | |
*** thuc has joined #openstack-infra | 02:33 | |
*** thuc has quit IRC | 02:37 | |
*** rfolco has joined #openstack-infra | 02:40 | |
fungi | actually 180 seems to be enough for my tests too | 02:42 |
*** talluri has joined #openstack-infra | 02:56 | |
*** talluri has quit IRC | 03:01 | |
*** ok_delta has quit IRC | 03:06 | |
*** HenryG has joined #openstack-infra | 03:07 | |
*** rfolco has quit IRC | 03:10 | |
*** talluri has joined #openstack-infra | 03:11 | |
*** ok_delta has joined #openstack-infra | 03:13 | |
*** sarob has joined #openstack-infra | 03:13 | |
fungi | mmm, i'm starting to think nodepoold might be in a bad way with respect to az2, because all those "building" status nodes are from 4-7 hours ago, don't show up at all in nova list, and can't be nodepool delete'd | 03:16 |
fungi | http://paste.openstack.org/show/61504 | 03:16 |
fungi | sqlalchemy.exc.OperationalError: (OperationalError) (1205, 'Lock wait timeout exceeded; try restarting transaction') 'UPDATE node SET state=%s, state_time=%s WHERE node.id = %s' (4, 1390101192, 1131534L) | 03:17 |
*** sarob has quit IRC | 03:18 | |
fungi | i don't think i'll be able to cleanly stop nodepool, but will get it restarted and see whether that helps | 03:19 |
*** talluri_ has joined #openstack-infra | 03:20 | |
*** ok_delta has quit IRC | 03:20 | |
*** talluri has quit IRC | 03:24 | |
*** talluri_ has quit IRC | 03:29 | |
fungi | huh, actually it stopped cleanly | 03:31 |
fungi | seems to have fixed the inability to delete at least | 03:33 |
fungi | yeah, i think it's okay now. i'm deleting all the remaining stale nodes now | 03:41 |
*** yamahata has quit IRC | 03:45 | |
*** rakhmerov has joined #openstack-infra | 03:52 | |
*** rakhmerov1 has joined #openstack-infra | 03:53 | |
*** rakhmerov has quit IRC | 03:54 | |
*** rakhmerov1 has quit IRC | 03:55 | |
*** jhesketh has quit IRC | 04:03 | |
*** rakhmerov has joined #openstack-infra | 04:04 | |
*** obondarev_ has joined #openstack-infra | 04:04 | |
*** sarob has joined #openstack-infra | 04:13 | |
*** senk has joined #openstack-infra | 04:15 | |
*** sarob has quit IRC | 04:18 | |
*** obondarev_ has quit IRC | 04:28 | |
*** obondarev_ has joined #openstack-infra | 04:29 | |
*** senk has quit IRC | 04:31 | |
*** senk has joined #openstack-infra | 04:31 | |
*** obondarev_ has quit IRC | 04:45 | |
*** odyssey4me has quit IRC | 04:48 | |
*** DennyZhang has joined #openstack-infra | 04:53 | |
*** odyssey4me has joined #openstack-infra | 04:55 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 05:00 | |
clarkb | fungi weird, glad to know all is better now | 05:01 |
*** DinaBelova_ is now known as DinaBelova | 05:02 | |
fungi | still keeping an eye on it, but probably passing out soon | 05:03 |
*** DinaBelova is now known as DinaBelova_ | 05:09 | |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 05:09 | |
*** rakhmerov1 has joined #openstack-infra | 05:10 | |
*** rakhmerov has quit IRC | 05:10 | |
*** sarob has joined #openstack-infra | 05:13 | |
*** sarob has quit IRC | 05:18 | |
*** david-lyle_ has joined #openstack-infra | 05:21 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 05:28 | |
*** david-lyle_ has quit IRC | 05:32 | |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 05:37 | |
*** vkozhukalov has joined #openstack-infra | 05:38 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 05:43 | |
*** SergeyLukjanov is now known as SergeyLukjanov_a | 05:43 | |
*** senk has quit IRC | 05:44 | |
*** SergeyLukjanov_a is now known as SergeyLukjanov_ | 05:45 | |
*** DinaBelova_ is now known as DinaBelova | 05:45 | |
*** rakhmerov1 has quit IRC | 05:46 | |
*** DinaBelova is now known as DinaBelova_ | 05:49 | |
*** rakhmerov has joined #openstack-infra | 05:54 | |
*** rakhmerov has quit IRC | 05:55 | |
*** rakhmerov has joined #openstack-infra | 06:09 | |
*** sarob has joined #openstack-infra | 06:13 | |
*** rakhmerov has quit IRC | 06:13 | |
*** San_D has quit IRC | 06:14 | |
*** sarob has quit IRC | 06:18 | |
*** oubiwann_ has quit IRC | 06:19 | |
*** nati_ueno has joined #openstack-infra | 06:20 | |
*** nati_uen_ has joined #openstack-infra | 06:25 | |
*** nati_ueno has quit IRC | 06:28 | |
*** sarob has joined #openstack-infra | 06:45 | |
*** odyssey4me has quit IRC | 06:45 | |
*** sarob has quit IRC | 06:50 | |
*** sarob has joined #openstack-infra | 07:06 | |
*** rakhmerov has joined #openstack-infra | 07:09 | |
*** rakhmerov1 has joined #openstack-infra | 07:11 | |
*** rakhmerov has quit IRC | 07:11 | |
*** DennyZhang has quit IRC | 07:16 | |
*** madmike has joined #openstack-infra | 07:21 | |
*** rakhmerov1 has quit IRC | 07:22 | |
*** salv-orlando has quit IRC | 07:22 | |
*** salv-orlando has joined #openstack-infra | 07:22 | |
*** bnemec_ has joined #openstack-infra | 07:24 | |
*** rakhmerov has joined #openstack-infra | 07:25 | |
*** crank has quit IRC | 07:25 | |
*** mfink has quit IRC | 07:25 | |
*** bnemec has quit IRC | 07:25 | |
*** lifeless has quit IRC | 07:25 | |
*** akscram has quit IRC | 07:25 | |
*** bradm has quit IRC | 07:25 | |
*** obondarev has quit IRC | 07:25 | |
*** obondarev has joined #openstack-infra | 07:26 | |
*** sandywalsh has quit IRC | 07:26 | |
*** rakhmerov has quit IRC | 07:26 | |
*** rakhmerov1 has joined #openstack-infra | 07:26 | |
*** crank has joined #openstack-infra | 07:26 | |
*** bradm has joined #openstack-infra | 07:27 | |
*** lifeless has joined #openstack-infra | 07:27 | |
*** akscram has joined #openstack-infra | 07:28 | |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Allow use of node from packages https://review.openstack.org/67604 | 07:29 |
*** rakhmerov1 has quit IRC | 07:33 | |
*** sandywalsh has joined #openstack-infra | 07:39 | |
*** sarob has quit IRC | 07:56 | |
*** julim has joined #openstack-infra | 08:21 | |
*** julim has quit IRC | 08:25 | |
*** yolanda has joined #openstack-infra | 08:26 | |
*** derekh has joined #openstack-infra | 08:27 | |
*** sarob has joined #openstack-infra | 08:28 | |
*** rakhmerov has joined #openstack-infra | 08:30 | |
*** sarob has quit IRC | 08:33 | |
*** sarob has joined #openstack-infra | 08:44 | |
*** rakhmerov has quit IRC | 08:45 | |
*** sarob has quit IRC | 08:49 | |
*** derekh has quit IRC | 09:02 | |
*** sarob has joined #openstack-infra | 09:13 | |
*** sarob has quit IRC | 09:19 | |
*** flaper87|afk is now known as flaper87 | 09:31 | |
*** rakhmerov has joined #openstack-infra | 09:41 | |
*** rakhmerov has quit IRC | 09:47 | |
*** rakhmerov has joined #openstack-infra | 09:57 | |
*** praneshp_ has joined #openstack-infra | 10:02 | |
*** praneshp has quit IRC | 10:02 | |
*** praneshp_ is now known as praneshp | 10:02 | |
*** rakhmerov has quit IRC | 10:03 | |
*** elasticio has joined #openstack-infra | 10:17 | |
*** yolanda has quit IRC | 10:49 | |
*** emagana has joined #openstack-infra | 10:50 | |
*** rakhmerov has joined #openstack-infra | 11:00 | |
*** emagana has quit IRC | 11:02 | |
*** thuc has joined #openstack-infra | 11:04 | |
*** yolanda has joined #openstack-infra | 11:06 | |
*** rakhmerov has quit IRC | 11:09 | |
*** thuc has quit IRC | 11:13 | |
*** matrohon has quit IRC | 11:40 | |
*** odyssey4me has joined #openstack-infra | 11:42 | |
*** odyssey4me has quit IRC | 11:44 | |
*** rakhmerov has joined #openstack-infra | 12:05 | |
*** rakhmerov has quit IRC | 12:10 | |
*** praneshp has quit IRC | 12:12 | |
*** obondarev_ has joined #openstack-infra | 12:17 | |
*** elasticio has quit IRC | 12:34 | |
sdague | so I'm starting to feel that we need to take the jenkins outage and get the logs fixed, because our fail rate is as high as it was before russellb's concurreny patch, but we're pretty blind on what's causing it without console logs | 12:53 |
*** dizquierdo has joined #openstack-infra | 12:57 | |
*** DinaBelova_ is now known as DinaBelova | 12:57 | |
*** afazekas_ has joined #openstack-infra | 12:57 | |
openstackgerrit | Andreas Jaeger proposed a change to openstack-infra/config: Remove non-voting documentation gate job https://review.openstack.org/67702 | 13:03 |
*** rakhmerov has joined #openstack-infra | 13:07 | |
*** rakhmerov has quit IRC | 13:11 | |
*** sarob has joined #openstack-infra | 13:13 | |
*** sarob has quit IRC | 13:18 | |
*** DinaBelova is now known as DinaBelova_ | 13:25 | |
*** salv-orlando has quit IRC | 13:33 | |
*** DinaBelova_ is now known as DinaBelova | 13:40 | |
openstackgerrit | Sean Dague proposed a change to openstack-infra/config: add in elastic-recheck-unclassified report https://review.openstack.org/67591 | 13:44 |
*** obondarev_ has quit IRC | 14:06 | |
*** rakhmerov has joined #openstack-infra | 14:08 | |
*** rakhmerov has quit IRC | 14:12 | |
openstackgerrit | Sean Dague proposed a change to openstack-infra/devstack-gate: Timestamp setup logs https://review.openstack.org/67086 | 14:13 |
*** sarob has joined #openstack-infra | 14:13 | |
*** sarob has quit IRC | 14:18 | |
*** oubiwann_ has joined #openstack-infra | 14:29 | |
*** beagles has quit IRC | 14:43 | |
*** oubiwann_ has quit IRC | 14:46 | |
*** bknudson has left #openstack-infra | 14:52 | |
sdague | there also seems to be a zuul issue that if a failing change makes it to the top of queue, it's rerun - http://ubuntuone.com/49l5sd2U7JLrMzx8RGavPU | 14:56 |
*** dizquierdo has quit IRC | 15:04 | |
*** rakhmerov has joined #openstack-infra | 15:09 | |
*** rakhmerov has quit IRC | 15:14 | |
*** bknudson has joined #openstack-infra | 15:15 | |
*** coolsvap has joined #openstack-infra | 15:25 | |
*** obondarev has quit IRC | 15:41 | |
*** obondarev has joined #openstack-infra | 15:42 | |
*** skraynev has quit IRC | 15:44 | |
*** skraynev has joined #openstack-infra | 15:44 | |
*** luqas has joined #openstack-infra | 15:46 | |
mordred | sdague: I agree with all of the things in your email | 15:52 |
sdague | mordred: great, now we just need to implement them :) | 15:54 |
sdague | the early kick out is something I'd like to understand how we handle it | 15:54 |
sdague | mostly how do we signal back about it | 15:55 |
sdague | basically, under the current state of things, I don't think icehouse-2 is possible next week | 15:57 |
sdague | we've got comment typo fixes in the gate queue that have been there for 40 hrs | 15:58 |
mordred | sdague: early kick is tricky (responded to email) | 15:58 |
sdague | cool | 15:58 |
*** yolanda has quit IRC | 15:59 | |
mordred | sdague: the theory is that we should start streaming subunit results back to zuul, I believe | 15:59 |
mordred | hrm | 15:59 |
sdague | the gate is at a new level of bad | 15:59 |
mordred | and/or have the thing running/processing the tests be attached to the gearman bus so that it could read the stream and return a gearman status on detected fail | 15:59 |
sdague | and we're actually completely blind as to why, because we're loosing at least 3/4 of the console logs from elastic search | 16:00 |
mordred | the work towards getting rid of jenkins was work towards being able to do early kick | 16:00 |
mordred | I wonder if maybe we should think about how to implement it without not-jenkins | 16:00 |
mordred | sdague: I _think_ zaro has the scp-plugin fix | 16:00 |
sdague | yeh, I think it needs implementing by march 1st, otherwise i3 will not be possible | 16:01 |
sdague | mordred: right, I keep hearing that :) | 16:01 |
mordred | sdague: :) | 16:01 |
mordred | sdague: is our incoming velocity higher than usual? or have things just gotten worse in the racy department? | 16:02 |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/config: Remove reverify entirely https://review.openstack.org/67708 | 16:05 |
sdague | so we've merged 48 changes since friday | 16:05 |
sdague | means we're merging only 30 changes / day right now | 16:06 |
sdague | git log --since=2014-01-17 --author=jenkins | grep '^commit' | wc -l on openstack/openstack | 16:06 |
mordred | wow | 16:06 |
notmyname | wow | 16:06 |
bknudson | how many could merge in a day? | 16:07 |
sdague | bknudson: if we weren't reseting... hundreds | 16:07 |
sdague | but the biggest issue in my mind right now is we're actually completely blind to *why* we are failing, and largely have been for the last 2 weeks | 16:08 |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/config: Early fail on pep8 in the check pipeline https://review.openstack.org/67709 | 16:08 |
mordred | sdague: ok. there's two of your things | 16:08 |
sdague | as we're seeing huge loss of logs going into elastic search | 16:08 |
sdague | mordred: awesome | 16:08 |
mordred | fungi: you awake? ^^ | 16:09 |
sdague | also, out of those 48, at least 2 were ninja merges that I had fungi do to relieve some of the fails | 16:09 |
mordred | sdague: https://github.com/jenkinsci/scp-plugin/pull/8 | 16:09 |
mordred | anybody else who wants to review some java ^^ | 16:09 |
*** DinaBelova is now known as DinaBelova_ | 16:10 | |
*** SergeyLukjanov_ is now known as SergeyLukjanov | 16:10 | |
*** rakhmerov has joined #openstack-infra | 16:10 | |
sdague | mordred: we could also post populate the missing log files not from jenkins. We don't loose them to the log server, just to ES | 16:11 |
mordred | agree | 16:11 |
sdague | that would at least let the bulk query for ES actually help us sort issues | 16:11 |
sdague | there also seems to be a new zuul bug, where if a failing change hits the top of the gate, it gets restarted again instead of thrown out | 16:12 |
notmyname | mordred: why not run all the project-specific tests before the common integration tests (instead of just pep8)? | 16:13 |
*** sarob has joined #openstack-infra | 16:13 | |
sdague | notmyname: I think it's a judgement call of bang vs. buck. Throwing out on pep8 takes very little time, and saves a bunch of nodepool resources. | 16:15 |
*** rakhmerov has quit IRC | 16:15 | |
*** luqas has quit IRC | 16:16 | |
mordred | notmyname: I agree with you on that too - but it would require adding some new logic to zuul's config processing, where I can do the pep8 early right now | 16:16 |
notmyname | ok | 16:16 |
mordred | notmyname: specifically, I don't have a way of saying "run these _Three_ things and then when all are done run this additional thing" | 16:17 |
mordred | I think it's a feature we need, tbh | 16:17 |
notmyname | I was looking at the neutron one the just fell off the top of the gate queue (of course causing a gate flush). https://review.openstack.org/#/c/67475/ but I didn't realize those neutron tests were taking longer than the integration ones | 16:17 |
notmyname | mordred: isn't a feature of zuul the composability of the jobs? run set one, then set two | 16:17 |
*** sarob has quit IRC | 16:18 | |
zaro | clarkb: i've added the additional logging to scp-plugin. new build is on review-dev.o.o | 16:19 |
sdague | notmyname: well, actually neutron at top was the other issue | 16:19 |
sdague | where that thing failed deeper in the queue | 16:19 |
fungi | zaro: jenkins-dev? | 16:19 |
sdague | the stuff above it merged | 16:19 |
sdague | and zuul then reconnected the pipeline to it | 16:19 |
zaro | ohh yeah, jenkins-dev. | 16:19 |
sdague | <sdague> there also seems to be a zuul issue that if a failing change makes it to the top of queue, it's rerun - http://ubuntuone.com/49l5sd2U7JLrMzx8RGavPU | 16:20 |
bknudson | I've seen the failure from 67475 in another review, since I was just looking into why the other one failed ... "tempest.scenario.test_cross_tenant_connectivity.TestNetworkCrossTenant.test_cross_tenant_traffic" | 16:21 |
sdague | yeh, so that's just which test failed. That's not why it failed. We need to know why that was expected to work and did not | 16:23 |
notmyname | sdague: when you added the timer to the zuul queue (you did that right?), did you by any chance add in a statsd timing metric for graphite? I'd love to graph the average time a patch spends in the gate queue | 16:23 |
sdague | notmyname: nope, didn't touch graphite at all | 16:24 |
sdague | I actually think the graphite metrics in this space are kind of broken. The jenkins interupt which happens on resetting nodes is often classified as failure | 16:25 |
sdague | which it isn't | 16:25 |
sdague | so all the graphite numbers are worse that reality | 16:25 |
*** jkt has joined #openstack-infra | 16:25 | |
sdague | s/that/than/ | 16:26 |
bknudson | looks like the failure from 67475 and my own review is a known problem -- https://bugs.launchpad.net/tempest/+bug/1262613 | 16:26 |
jkt | hi there, I'm reading through your openstack-infra/config repo, and have noticed the remark about an ongoing transition towards puppet modules "straight from puppetforge" | 16:26 |
fungi | sdague: your screen capture doesn't look to me like it's showing what your comment implies | 16:26 |
sdague | fungi: so I don't have the capture from before | 16:27 |
sdague | that was already in a fail state | 16:27 |
jkt | I'm trying to deploy openstack setup at $job, to be managed by puppet, and I've never done a green-field deployment of puppet before | 16:27 |
sdague | got moved to the head | 16:27 |
bknudson | there's an e-r check for bug/1262613 already | 16:27 |
sdague | then got the entire stream behind it | 16:27 |
jkt | I'm wondering about the security implications of using "random" version of code from "random" site on the web | 16:27 |
bknudson | the e-r check says it was resolved. | 16:27 |
sdague | fungi: you kind of have to be watching zuul to see these happen | 16:27 |
jkt | I mean, I'm OK with using their packages for puppet itself, but blindly loading modules from forge makes me a bit uneasy | 16:28 |
fungi | sdague: that screen capture says that the failing head was severed and still has 19 minutes remaining until its other tests complete, but a recalculation of all the rest of the gate is underway so the following changes don't have all their jobs started yet | 16:28 |
jkt | is that really the plan, and do you have something for version management of these? | 16:28 |
bknudson | oh, I guess I got the wrong bug. | 16:28 |
mordred | jkt: hi! | 16:28 |
mordred | jkt: do you mean an install of the openstack-infra testing stuff? or of openstack itself? | 16:29 |
sdague | fungi: I'm pretty sure that change was off failing, the jobs behind were running, after merge they all reset again | 16:29 |
sdague | I've seen this twice this morning | 16:29 |
jkt | mordred: in the end, I'd like to install and configure openstack, but I'm not that far yet. What I'm doing now is getting familiar with puppetizing the infrastructure from day zero | 16:29 |
fungi | sdague: oh, okay. in that case i'll keep an eye out and see if i can catch it doing what you're suggesting | 16:29 |
jkt | mordred: and I'm asking here because I'm more or less copying the setup you have documented in that repo | 16:30 |
mordred | jkt: gotcha. so, those of us in here don't know anything about installing openstack itself via puppet - so I wanted to be clear on expectations :) | 16:30 |
mordred | jkt: for the other stuff - yeah, it's still in the plans to use what we can from puppet forge | 16:30 |
mordred | but we don't really do it blindly - we take one module at a time as we can - and already use several key ones - like the puppet-mysql module | 16:31 |
mordred | that said - one of the other goal of that is to break that repo apart and treat several of the modules liek they're forge modules | 16:31 |
mordred | even though we wrote them | 16:31 |
clarkb | zaro: thanks | 16:31 |
mordred | for better lifecycle and composability | 16:31 |
jkt | mordred: so essentially specifying a version beforehand, in that install_modules.sh script, and running it every now and then on the master? | 16:32 |
mordred | clarkb, fungi: could you +2/+A the two config changes I posted above - I agree with sdague that we shoudl do them | 16:32 |
mordred | jkt: that's right | 16:32 |
fungi | mordred: will 67709 deal more sanely now with the situation which caused us to turn it off before (changes to requirements result in very obscure failures on the pep8 jobs which are hard for devs to diagnose)? | 16:32 |
jkt | what I like a *lot* in your setup is that everything is in one repo; and having to update modules manually is something which, to me, looks a bit against that goal | 16:32 |
mordred | fungi: I believe so - but even if it doesn't , I think the tradeoff is worth it for the next couple of weeks | 16:33 |
mordred | jkt: the thigns we want in external modules are the things that dont' change much - or that _We_ don't change that much | 16:33 |
jkt | I've just learned about `git subtree add --prefix modules/... ... --squash`, and I have to admit I like it a lot | 16:33 |
mordred | hehe | 16:33 |
mordred | we don't use submodules at all :) | 16:34 |
jkt | subtree != submodule, that's the catch | 16:34 |
mordred | $ git subtree --help | 16:34 |
mordred | No manual entry for gitsubtree | 16:34 |
jkt | it's essentially "get a checkout of that remote ref I specify as ..., squash it asa single commit, and merge it as a subdirectory under the --prefix" | 16:34 |
jkt | http://blogs.atlassian.com/2013/05/alternatives-to-git-submodule-git-subtree/ | 16:35 |
mordred | ah. interesting. so like a way to keep a super-repo for composability - but not attempt to treat the subtrees as things you'd do dev on in that context | 16:35 |
jkt | it's also 100% client-side; what gets pushed is a boring old commit | 16:36 |
mordred | it's also not installed in the version of git in debian | 16:37 |
mordred | :) | 16:37 |
mordred | anyway - looks like interesting evening reading | 16:37 |
jkt | yeah, looks like it got only added in 2012 | 16:38 |
clarkb | subtree is bad for pther reasons :) | 16:38 |
fungi | mordred: oh, i see... 67709 only skips py26/27 and docs, but not other things like requirements checks, dsvm jobs, py33, one-offs and so on | 16:38 |
jkt | clarkb: I would love to listen to them | 16:38 |
mordred | fungi: oh! piddle. thats a bug and you're right | 16:38 |
mordred | hrm. in fact | 16:38 |
clarkb | biggest pronlem for us immediately would be no gerrit support | 16:38 |
mordred | with the new template org - I do not think it's possible to do what I was trying to di | 16:38 |
fungi | mordred: well, i already approved | 16:38 |
mordred | clarkb: it doesn't need it | 16:38 |
fungi | but i can -2 | 16:38 |
mordred | fungi: well, it'll be _something_ | 16:38 |
jkt | mordred: and btw, on that Atlassian page, they even show how to push commits back to the original repo (which provided the subtree contents) | 16:39 |
mordred | fungi: but yeah - go ahead and -2 and I'll rework | 16:39 |
clarkb | second problem is you have to manually know to write commits that dont span trees iirc | 16:39 |
mordred | that's the main problem I'd see | 16:39 |
jkt | clarkb: it's "subtree", not "submodule", not what you'll see in Gerrit (or any other git client) is a simple commit adding whole directory at once | 16:39 |
clarkb | and humans fail at things like that | 16:40 |
zaro | clarkb, sdague : i tried disabling drafts on review-dev.o.o but was _not_ able to. | 16:40 |
clarkb | jkt i knlw | 16:40 |
sdague | zaro: ok, thanks | 16:40 |
mordred | zaro: darn | 16:40 |
clarkb | jkt but you cant have commits thst span trees | 16:40 |
jkt | clarkb: how come? | 16:40 |
sdague | ok, I need to get away from the computer for a bit. I'll check back in during football later. | 16:40 |
clarkb | without gerrit support you cant enforce that easily | 16:40 |
mordred | clarkb: so, early fail - how do we do that without getting rid of jenkins first? | 16:41 |
clarkb | jkt because then you cant split trees iirc (or can but need filter branching) | 16:41 |
clarkb | its a matter of sanity | 16:41 |
clarkb | mordred our test runner jenkins side eg run-testr needs to do testr run --subunit and return one as soon as a failhappens | 16:42 |
jkt | clarkb: https://github.com/git/git/blob/master/contrib/subtree/git-subtree.txt shows the split feature (even with stable, i.e. deterministic and non-changing commit IDs, but I have no experience with them | 16:42 |
mordred | clarkb: but won't that abort the rest of the test run/ | 16:42 |
clarkb | mordred not hard to do but would require a special subunit parser unless lifeless has a flag for that | 16:42 |
clarkb | mordred yes | 16:43 |
clarkb | without changing zuul to undsrstand fail without return 1 I think that is our option | 16:43 |
mordred | clarkb: yeah - I don't think there's any quick and dirty way to do it - I'm just trying to figure out if there is any conceivable way at all to get there that I could parcel some tasks out to achieve | 16:44 |
mordred | also - anybody know when we get jeblair back? today? tomorrow? tuesday? | 16:44 |
clarkb | either tomorrow or tuesday | 16:45 |
jkt | anyway, thanks mordred and clarkb, your experience is appreciated | 16:46 |
*** elasticio has joined #openstack-infra | 16:46 | |
mordred | jkt: you're welcome - thanks for the pointer to the blog - I'm less unhappy about it than clarkb is - although I've got a really narrow usecase for it I'd like to poke at | 16:46 |
clarkb | mordred because I have seen people use subtree thinking it fixes the world but really it just changes the problems :) | 16:47 |
*** sdake has joined #openstack-infra | 16:47 | |
mordred | clarkb: yeah - but it seems like it could be a specific solution for module composabilty for _us_ instead of puppet librarian or whatnot | 16:48 |
clarkb | oh and since subtree smashes trees together you have an extra element of license stuff to consider | 16:48 |
clarkb | mordred no that is the case people thoyght it would fix | 16:48 |
clarkb | they wrote r10k instead | 16:48 |
*** rakhmerov has joined #openstack-infra | 16:48 | |
mordred | but purely for deployment mechanics - not as something I'd expect us to ever check out ourselves | 16:48 |
mordred | clarkb: I betcha they thought they could use it for composability and development | 16:49 |
*** sandywalsh has quit IRC | 16:49 | |
clarkb | mordred: so you are thinking some post merge step that build a new tree that only deployments use? | 16:51 |
mordred | clarkb: yes | 16:51 |
mordred | and the way we'd only do commits to that repo ourselves to update the commit tracking the external module | 16:52 |
mordred | it's probably a bad idea still and I should probably figure out r10k | 16:52 |
clarkb | I think r10k is simpler and it can consume items not in git as well | 16:53 |
mordred | yeah | 16:53 |
*** rakhmerov has quit IRC | 16:53 | |
clarkb | back to jenkins. should I shutdown a jenkins and try new scp plugin? | 16:55 |
mordred | clarkb: yes | 16:55 |
clarkb | ok starting that shortly | 16:55 |
mordred | clarkb: is there any specific reason why the dvsm jobs don't have a template? | 16:56 |
mordred | or just not gotten to? | 16:56 |
clarkb | not gotten to | 16:58 |
mordred | k | 16:58 |
clarkb | we have been doing that refactor with small deltas to make it easier to review and help prevent massive test breakage | 16:59 |
mordred | gotcha | 16:59 |
mordred | I LOVE that work, btw | 16:59 |
*** senk has joined #openstack-infra | 16:59 | |
clarkb | mordred: what we need is a template that uses the envinject plugin to set all of the various d-g flags without changing the actual script calls | 16:59 |
clarkb | since the env vars are what vary test to test | 17:00 |
mordred | clarkb: we need many things | 17:00 |
clarkb | I am going to put the scp plugin on jenkins02 because it runs the largest variety of tests | 17:00 |
mordred | okie | 17:00 |
clarkb | well 01 does too but 01 is old jenkins and old scp plugin so its fine | 17:00 |
clarkb | is fungi still around? | 17:01 |
clarkb | fungi: any opinions on ^ | 17:01 |
clarkb | 02 is in shutdown mode | 17:03 |
*** senk has quit IRC | 17:06 | |
clarkb | estimated time remaining 54 minutes :( | 17:07 |
mordred | clarkb: I'm too dumb to have actually followed the whole thing - can you give me the tl;dr on why we have different jenkins version ? | 17:07 |
clarkb | mordred: we upgraded one jenkins host (02) then went on holidays | 17:07 |
clarkb | spun up 03 and 04 on the new version but haven't upgraded jenkins.o.o and jenkins01 yet | 17:07 |
clarkb | we were being conservative | 17:08 |
*** luqas has joined #openstack-infra | 17:08 | |
openstackgerrit | Khai Do proposed a change to openstack-infra/jenkins-job-builder: Fix references to examples in api documentation https://review.openstack.org/67712 | 17:08 |
mordred | gotcha | 17:08 |
Mithrandir | (that jjb change)> what? Literalinclude looks pretty broken then | 17:10 |
Mithrandir | and should rather be fixed | 17:10 |
clarkb | Mithrandir: that is kind of funny | 17:11 |
clarkb | would ./ work too? that might be more intuitive | 17:11 |
Mithrandir | I'd be fine with ./, but / meaning "start looking from where this file is located" is.. not how paths work. | 17:11 |
clarkb | Mithrandir: yup | 17:12 |
Mithrandir | The file name is usually relative to the current file’s path. However, if it is absolute (starting with /), it is relative to the top source directory. | 17:12 |
Mithrandir | is what the docs say | 17:12 |
Mithrandir | so the current one should work, barring bugs | 17:12 |
zaro | the './' didn't work, got the same warning. | 17:16 |
*** gokrokve has joined #openstack-infra | 17:16 | |
Mithrandir | can we use a custom sphinx tag instead? nonbrokenliteralinclude? | 17:17 |
clarkb | mordred: I am going to afk for a short time while I wait for tests to finish on 02 | 17:17 |
Mithrandir | or override the built-in | 17:17 |
mordred | clarkb: kk | 17:18 |
zaro | Mithrandir: how would custom tag work? | 17:20 |
*** senk has joined #openstack-infra | 17:20 | |
openstackgerrit | Brant Knudson proposed a change to openstack-infra/elastic-recheck: Add check for bug 1270608 https://review.openstack.org/67713 | 17:20 |
*** rakhmerov has joined #openstack-infra | 17:20 | |
Mithrandir | zaro: http://sphinx-doc.org/extensions.html, apparently | 17:21 |
* fungi is back... reading now | 17:21 | |
zaro | Mithrandir: i'm not following, is literalinclude an extension? | 17:24 |
Mithrandir | zaro: it's a directive, I'm not sure if it's core or not. | 17:25 |
Mithrandir | and we could have our own that works like literalinclude, but with non-crazy semantics. | 17:25 |
fungi | clarkb: on getting jeblair back, keep in mind that he, i and the rest of the foundation staff will be in utah or in transit most of the week. i'll be working from airline seats and airport lounges most of tuesday and friday | 17:25 |
*** rakhmerov has quit IRC | 17:25 | |
clarkb | gah | 17:26 |
clarkb | its like the conference madness never ends | 17:26 |
fungi | also, r10k was the cpu for the sgi o2. what else is it in this context? | 17:26 |
mordred | Mithrandir: perhaps it has to do with how we're running sphinx | 17:26 |
clarkb | and we are rotating batters | 17:26 |
mordred | Mithrandir: and what it thinks the top of our source dir is | 17:26 |
mordred | rather than being a bug in literalinclude itself | 17:26 |
clarkb | fungi puppet librarian that actually works | 17:26 |
mordred | fungi: jesus, really/ | 17:27 |
mordred | ? | 17:27 |
Mithrandir | according to the docs, if it starts with something else than /, it should be relative to the file. | 17:27 |
mordred | oh - salt conf | 17:27 |
Mithrandir | mordred: so either it's a bug in the implementation or the docs. | 17:27 |
mordred | hrm | 17:27 |
fungi | clarkb: upgrading scp plugin on 02 sounds good | 17:27 |
clarkb | fungi great. currently waiting for tests to finish there | 17:27 |
mordred | Mithrandir: but we're doing extraction with an extension of our own | 17:28 |
clarkb | I copied the scp.jpi from -dev to my homedir on 02 | 17:28 |
mordred | Mithrandir: so the file that it's relative to might not be the file we think it is | 17:28 |
*** sdake has quit IRC | 17:28 | |
Mithrandir | mordred: oh, that might make for extra fun. | 17:28 |
Mithrandir | mordred: maybe that extension should adjust the paths or something, then? | 17:29 |
mordred | Mithrandir: so, while I do think that it's a bug somewhere ... yeah - we might want to investigate the yaml extension we're using, and/or the results of pulling docstrings and generating sphinx from them | 17:30 |
*** sdague has quit IRC | 17:30 | |
fungi | mordred: well, we happen to be there coincident with saltconf, but it's mainly a couple days mid-cycle for the staff to get face time not at a summit | 17:30 |
mordred | fungi: wait - so you're saying we don't REALY get jeblair back, AND we lose you? | 17:30 |
*** sdague has joined #openstack-infra | 17:30 | |
clarkb | mordred: yes | 17:30 |
clarkb | you and I are batting next | 17:30 |
bknudson | there's an e-r check for 1269940 that hit on https://review.openstack.org/#/c/66209/ | 17:30 |
bknudson | but the string in the yaml doesn't match anything in console.html | 17:30 |
mordred | fungi: just so you know, I don't think it's valuable to anyone for the foundation staff to have face time | 17:31 |
fungi | mordred: well, i'll probably be worthless wednesday/thursday, but will be working with lossy/high-latency network access on tuesday and friday | 17:31 |
mordred | fungi: but I have no leg to stand on as I've been afk for sevearl weeks | 17:31 |
mordred | and i'll be going to brussels week after next | 17:31 |
clarkb | jeblair is going to brussels too | 17:31 |
fungi | mordred: annual performance reviews and whatnot. i guess there's some perceived benefit to do face-to-face team building | 17:32 |
clarkb | we need to clone a few fungis | 17:32 |
mordred | fungi: but your team is all of opensatck - not each other | 17:32 |
*** gokrokve has quit IRC | 17:32 | |
* fungi reproduces asexually through spore propagation, so should be doable | 17:32 | |
clarkb | fungi also north carolinians are all robots | 17:33 |
mordred | fungi: you have ZERO goals separate from the project's goals, or at least you _shouldn't_ have any goals separate from the project's goals | 17:33 |
mordred | clarkb: ++ | 17:33 |
fungi | mordred: that i agree with. i'd rather see my performance review come from random cross-sections of the project ;) | 17:33 |
mordred | fungi: ++ | 17:33 |
clarkb | we just cp your AI from one machine to another :P | 17:33 |
mordred | fungi: in fact, seriously - performance review for foundation staff shoudl be done by the project - maybe using condorcet | 17:33 |
fungi | clarkb: i haven't figured out how to upload my consciousness yet, but once i get that working we should be able to make copies just fine | 17:33 |
fungi | we have clouds | 17:33 |
fungi | mordred: sounds like a motion for the board | 17:34 |
openstackgerrit | A change was merged to openstack-infra/storyboard-webclient: Allow use of node from packages https://review.openstack.org/67604 | 17:34 |
*** nati_uen_ has quit IRC | 17:35 | |
mordred | clarkb: oh! you know what - if both fungi and jeblair are afk, we can finally push those HP-specific goals we've been hiding caring about!!! | 17:35 |
fungi | heh | 17:35 |
*** dcramer_ has joined #openstack-infra | 17:42 | |
*** gokrokve has joined #openstack-infra | 17:42 | |
*** coolsvap has quit IRC | 17:50 | |
openstackgerrit | Khai Do proposed a change to openstack-infra/jenkins-job-builder: Fix references to examples in api documentation https://review.openstack.org/67712 | 17:51 |
*** gokrokve has quit IRC | 17:53 | |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Allow people to source bin/setenv.sh https://review.openstack.org/67714 | 17:54 |
clarkb | jenkins02 is idle now | 18:00 |
clarkb | fungi: mordred: I am going to turn it off and start it with zaro's scp plugin build | 18:00 |
clarkb | jenkins02 is starting again | 18:03 |
bknudson | is there a good way to download logs in http://logs.openstack.org/47/66247/4/check/check-tempest-dsvm-full/0d6e9cc/logs/ ? | 18:03 |
bknudson | when I wget screen-n-cpu.txt.gz it's downloading forever | 18:03 |
bknudson | the logs page shows 10M but I canceled the download after 50M | 18:04 |
clarkb | bknudson: it is a massive file :), if you set the encoding type to gzip you will get gzipped files instead of uncompressed data | 18:04 |
clarkb | bknudson: it is on disk as 10MB compressed, wget requests uncompressed data though | 18:04 |
bknudson | I can uncompress it after I download if I can figure out how to get it compressed. | 18:05 |
clarkb | bknudson: right you set the encoding header to gzip to get it compressed | 18:05 |
dims | http://www.commandlinefu.com/commands/view/7180/get-gzip-compressed-web-page-using-wget. | 18:05 |
*** senk has quit IRC | 18:06 | |
*** thuc has joined #openstack-infra | 18:07 | |
clarkb | a quick spot check of console logs in logstash after the new scp plugin went in looks good | 18:08 |
clarkb | sdague: fungi zaro I figure we let that burn in a bit today then do the others | 18:08 |
clarkb | I also need to run errands shortly so lettign it run for a bit is my excuse to let me d othat :) | 18:08 |
mordred | clarkb: awesome! | 18:10 |
bknudson | dims: Thanks, --header="Accept-Encoding: gzip" worked for me. | 18:12 |
*** salv-orlando has joined #openstack-infra | 18:12 | |
*** persia has quit IRC | 18:12 | |
*** 23LAAXGQE has joined #openstack-infra | 18:15 | |
*** pcrews has joined #openstack-infra | 18:15 | |
zaro | clarkb: sounds good. | 18:15 |
*** 23LAAXGQE has quit IRC | 18:16 | |
*** 23LAAXGQE has joined #openstack-infra | 18:16 | |
*** 23LAAXGQE is now known as persia | 18:16 | |
*** gokrokve has joined #openstack-infra | 18:17 | |
openstackgerrit | Felipe Reyes proposed a change to openstack-infra/jenkins-job-builder: Adds Mercurial SCM support https://review.openstack.org/61547 | 18:20 |
fungi | clarkb: awesome. i'm happy to iterate over the other jenkins masters later if we want | 18:27 |
*** nati_ueno has joined #openstack-infra | 18:35 | |
*** nati_ueno has quit IRC | 18:35 | |
*** nati_ueno has joined #openstack-infra | 18:36 | |
*** ewindisch is now known as zz_ewindisch | 18:38 | |
notmyname | if someone wants to abort the jobs runing for 65399,3 I'm ok with that. I've got the log I need (https://jenkins02.openstack.org/job/gate-swift-python26/3523/console). Looks like a timing thing, but I'll make sure there is a LP bug for it in case it comes up again | 18:41 |
notmyname | bug 1224208 | 18:45 |
notmyname | well, actually since it's not at the top of the queue, some failure ahead of it could re-enqueue it and it would pass next time | 18:48 |
*** thuc has quit IRC | 18:52 | |
fungi | yep | 18:54 |
fungi | unless you expect that to be a consistent failure, better to just let it ride | 18:54 |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Add tox.ini file to run things via tox https://review.openstack.org/67721 | 18:55 |
mordred | ok. | 18:55 |
mordred | that's the craziest thing I've written in quite some time | 18:55 |
mordred | ttx, fungi, clarkb, sdague ^^ that, I'm not even kidding, allows you to run javascript toolchain stuff via tox and virtualenvs without having to install any javascript toolchain globally | 18:56 |
openstackgerrit | Andreas Jaeger proposed a change to openstack-infra/config: Early abort documentation builds https://review.openstack.org/67722 | 18:56 |
notmyname | fungi: nope. I don't expect it to be constant. thanks | 18:56 |
*** Ajaeger has joined #openstack-infra | 18:57 | |
openstackgerrit | Jeremy Stanley proposed a change to openstack-infra/nodepool: Delete nodes more aggressively https://review.openstack.org/67723 | 18:58 |
notmyname | fungi: yup. just reset | 19:01 |
*** gokrokve has quit IRC | 19:06 | |
*** gokrokve has joined #openstack-infra | 19:07 | |
lifeless | o/ | 19:08 |
lifeless | clarkb: flag for what? | 19:08 |
*** oubiwann_ has joined #openstack-infra | 19:08 | |
*** thuc has joined #openstack-infra | 19:08 | |
*** thuc has quit IRC | 19:11 | |
*** thuc has joined #openstack-infra | 19:11 | |
*** gokrokve has quit IRC | 19:12 | |
*** thuc has quit IRC | 19:15 | |
clarkb | lifeless fail fast which we can do in the runner now that I think of it | 19:17 |
fungi | clarkb: if we do it in the runner, will that still let it run to completion and simply signal zuul earlier that the job is not going to succeed so it can get a head start on the ensuing gate reset? | 19:19 |
*** gokrokve has joined #openstack-infra | 19:19 | |
fungi | or would we be sacrificing remaining test results for the failing job? | 19:20 |
*** thuc has joined #openstack-infra | 19:20 | |
mordred | fungi: I almost think that sacrificing remaining test results at this point might be acceptable sacrifice | 19:22 |
mordred | fungi: in fact, given the state of the gate right now - I'd say that sacrificing remaining test results is almost certainly acceptable sacrifice | 19:23 |
mordred | sdague: ^^ ? | 19:23 |
fungi | i'm inclined to agree, just curious of the implications | 19:23 |
mordred | yeah | 19:23 |
mordred | I mean, I think ultimately it's not what we want | 19:23 |
mordred | ultimately we want all of the tests to run to completion and we want to fail fast | 19:24 |
*** nati_uen_ has joined #openstack-infra | 19:24 | |
mordred | but perfect might be getting in teh way of good here, and just having the test runner hard fail and exit on first subunit stream failure might be what we want today | 19:24 |
*** flaper87 is now known as flaper87|afk | 19:24 | |
fungi | also, 67723 is intended to relieve me from scraping nodepool list for nodes in a delete state for more than 20 minutes and manually retrying to delete them to get available resources back | 19:25 |
fungi | left unchecked, they end up accounting for more than 50% of our aggregate quota | 19:26 |
mordred | fungi: lgtm | 19:26 |
*** nati_ueno has quit IRC | 19:27 | |
sdague | mordred: reading scroll back... | 19:30 |
mordred | sdague: most pressing thing is the idea of doing hard-fail with no run continuation on first fail | 19:31 |
sdague | so if we can get it out of zuul faster, that's a clear win. Loosing the rest of the tests might not be. | 19:32 |
*** sarob has joined #openstack-infra | 19:33 | |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/storyboard-webclient: Add tox.ini file to run things via tox https://review.openstack.org/67721 | 19:34 |
mordred | sdague: that's the question - how much not would it be? | 19:37 |
*** sarob has quit IRC | 19:38 | |
mordred | sdague: in the balance between early ejection and keeping the remaining tests after the fail | 19:38 |
mordred | do we, in our current issues, get a lot of good data from the stuff that happens on tests after the first fail? | 19:38 |
sdague | well if we end up aborting, we might abort too early for the service logs to continue | 19:40 |
sdague | I know, for instance the issue on some of the network tests was the fact the allocation was taking too long | 19:40 |
sdague | it actually shows up later in the logs | 19:41 |
sdague | that would be missed | 19:41 |
*** DinaBelova_ is now known as DinaBelova | 19:42 | |
*** senk has joined #openstack-infra | 19:42 | |
sdague | so basically we can't get fail fast but keep running? | 19:42 |
sdague | zuul also failed to allocate a dvsm node on job #2, given starvation, I wonder when that is going to show up | 19:43 |
lifeless | clarkb: fungi: mordred: I've said before :) - don't make zuul parse subunit | 19:44 |
mordred | sdague: fast fail but keep running is a harder problem. fast fail and stop running is easy and can be done nw | 19:44 |
lifeless | it's a central non distributed process | 19:44 |
mordred | lifeless: yeah - I don't mean zuul parse subunit | 19:45 |
mordred | I mean SOMETHING needs to parse subunit, and that something needs to be able to talk back to the gearman status | 19:45 |
lifeless | testr can be taught to raise a signal (e.g. run a script that calls something over gear) on detecting a failure | 19:45 |
lifeless | mordred: so testr is the thing that parses subunit here; I don't see why testr *itself* needs to talk gearman | 19:45 |
lifeless | I mean it could, but it seems overly intimate coupling | 19:45 |
mordred | I agree. and in fact, it's problematic for testr to be the one talking gear | 19:46 |
mordred | because that means that we've violated the trust boundry | 19:46 |
mordred | we need the thing that is in the context to talk to gearman to be able to peer in to what's going on as it happens | 19:46 |
lifeless | mordred: so move testr to a different context | 19:47 |
lifeless | mordred: it's not part of the code under test | 19:47 |
lifeless | mordred: and it can run things remotely, on multiple machines etc | 19:47 |
mordred | possibly - but now we're once again talking about massive changes that will not happen soon | 19:47 |
lifeless | so, the reason I'm pushing back on zuul (or even jenkins) parsing the full subunit stream is because 400 test machines will overwhelm 5 jenkins or 1 zuul | 19:48 |
*** sdake has joined #openstack-infra | 19:48 | |
lifeless | many MB of data to look for one bit | 19:48 |
mordred | sure | 19:48 |
mordred | but the reason taht I'm pushing back on testr being invovled with distributing work across workers | 19:48 |
lifeless | we could use subunit as the encoding but have testr only send failure signals | 19:48 |
mordred | is that there is too few devs hacking on testr and our ability to fix ui issues in it this past year has been very poor | 19:49 |
mordred | so placing it in a positionof more operational complexity at this point would be a bad idea | 19:49 |
fungi | sdague: i don't think it actually failed to allocate a node. we get that behavior when jenkins fails to start a job i think or screws it up in certain ways which zuul recognizes as a need to restart that job | 19:49 |
lifeless | mordred: I've merged every patch I've been given :(, but I ack your point | 19:49 |
sdague | yeh, testr being locking in bzr means testr changes to fix this is basically a non starter | 19:49 |
lifeless | sdague: so its not locked in bzr | 19:49 |
lifeless | sdague: it just needs tuits to move it | 19:49 |
clarkb | fungi: yup should reschedule on another node | 19:49 |
sdague | tuits? | 19:50 |
fungi | sdague: the round variety | 19:50 |
lifeless | sdague: http://2.bp.blogspot.com/-op8uJYMwdfI/TYpTkmur5hI/AAAAAAAAAd0/ty8WqHjiS58/s1600/A_Round_Tuit_Picture.jpg | 19:50 |
mordred | sdague: "round-to-it == tuit" | 19:50 |
fungi | well, "a round tuit" (you generally only need one) | 19:51 |
sdague | right. well, we're basically at the point of having to start a separate project to work around testr ui for tempest. | 19:51 |
*** DinaBelova is now known as DinaBelova_ | 19:51 | |
mordred | lifeless: in any case - the thing you and I are both talking about here are implementation details of the fact that the system overall needs to be designed to handle detect-fail-and-keep-running | 19:51 |
sdague | so it would be good if the tuits got prioritized | 19:51 |
lifeless | sdague: I'd like to talk in detail before such a project is started; it might be the right answer, but I suspect not. | 19:52 |
sdague | it was really clear at the neutron code sprint that we are inflicting massive pain on our developers by them needing to manually dig through the test layers to do the things they need to do | 19:53 |
lifeless | sdague: is that this discussion, or a different one ? | 19:53 |
sdague | it was a seperate one, which sort of joined on the first one | 19:53 |
sdague | lifeless: I 100% agree it's a less than ideal solution. But testr is locked in bzr in a corner. So going to be pragmatic about it and just having to stop copy and pasting work arounds between projects. | 19:54 |
lifeless | sdague: you said that already; I got that much. | 19:55 |
lifeless | sdague: I am happy to prioritise the tuits if they are the actual blocker, but until I understand the problem I won't know if even after doing a move to git a separate project might be the right thing | 19:56 |
lifeless | sdague: so when I say I want to talk about it, I mean I want to talk about it :) | 19:56 |
lifeless | sdague: but since its a different discussion, lets not distract the fix-the-gate thread | 19:56 |
sdague | sure, fair | 19:56 |
*** david-lyle_ has joined #openstack-infra | 19:57 | |
lifeless | I will have to go out in ~ 10m to do some minor errors, including picking up my pushbike after repair, should be < 1 hr | 19:57 |
mordred | I do not think that we're going to magic in the real solution to fast-fail-continue-running this or next or the week after | 19:57 |
*** yolanda has joined #openstack-infra | 19:57 | |
*** vkozhukalov has quit IRC | 19:58 | |
mordred | like, we've deep dived into what needs to happen for it already, and it's not quick, or it would be done already | 19:58 |
mordred | especially with jeblair largely out for more time the next two weeks | 19:58 |
lifeless | fail fast and stop is fairly straight forward | 19:58 |
mordred | yes | 19:58 |
lifeless | what proposed impl is on the table ? | 19:59 |
sdague | yeh, I'm not sure fail fast is a win at this point if we have to stop | 19:59 |
mordred | fail fast and stop is straight forward and actionable | 19:59 |
mordred | but it may not be a win | 19:59 |
mordred | sdague: would it be worth trying? or are you pretty sure about that | 19:59 |
* mordred is fine either way - just wants to be presenting options when we have them | 19:59 | |
sdague | mordred: given what I know of the delayed network allocations, I think it would have completely masked that issue and made it undebuggable | 20:00 |
sdague | and I'm going to assume there are other such issues | 20:00 |
*** sarob has joined #openstack-infra | 20:00 | |
mordred | sdague: so that I understand - you're saying the appropriate logging happened AFTER the test timed out? | 20:00 |
bknudson | seems like we need the bugs fixed more than we need workarounds | 20:00 |
sdague | mordred: correct | 20:00 |
mordred | sdague: ok. I grok | 20:01 |
mordred | we could put a delay on returning the error to jenkins perhaps | 20:01 |
lifeless | whats the proposed implementation ? | 20:01 |
sdague | mordred: because 2 minutes later we'd get a message from a network service that it allocated the network | 20:01 |
lifeless | it will help a little if I know that ;) | 20:01 |
mordred | wow. gotcha | 20:01 |
sdague | that was one of the things that russellb saw, that had us bring down the concurency | 20:02 |
mordred | lifeless: proposed impl was just to have our testr runner scan for failures and exit 1 if it sees one | 20:02 |
lifeless | righto | 20:02 |
lifeless | so a small patch to testr - ack | 20:02 |
mordred | sdague: maybe we only fail-fast in the gate, and we leave it on run-all-the-way in check? | 20:02 |
lifeless | ^ this | 20:02 |
lifeless | I don't think the gate should be able diagnostics | 20:02 |
lifeless | except for things that *only* run in the gate | 20:03 |
lifeless | s/able/about/ | 20:03 |
sdague | mordred: so if we don't keep enough of those logs, we loose our fingerprinting | 20:03 |
sdague | or potentially loose our finger printing | 20:03 |
mordred | that's true - but if we're ramping up auto-recheck, we should still have tons of check-level fails | 20:03 |
sdague | sure, but then we don't know what's *actually* failing us in the gate | 20:04 |
mordred | just ones that aren't also wrecking-balls | 20:04 |
mordred | nod | 20:04 |
sdague | the check queue is super noisy | 20:04 |
lifeless | because bad code + flake | 20:04 |
sdague | because there is tons of bad code in it | 20:04 |
* mordred has to run - back in a bit... | 20:04 | |
lifeless | how do you determine ignal | 20:04 |
lifeless | *signal* | 20:04 |
sdague | lifeless: basically job fail rate check vs. gate | 20:05 |
*** sarob has quit IRC | 20:05 | |
lifeless | on the fingerprint of the failure, right? | 20:05 |
sdague | this is overall fails. We don't have things broken down yet per job the way I'd like. That's blocked by ES having lost data, and just time. | 20:06 |
lifeless | tuits again :P | 20:07 |
lifeless | so | 20:07 |
lifeless | the question is, if we didn't have full gate logs | 20:07 |
lifeless | it seems like that will limit the ER evolution | 20:07 |
sdague | yes | 20:07 |
lifeless | because gate signal will be extremely hard to correlate to checks | 20:07 |
sdague | correct | 20:07 |
lifeless | unless we have some canaries - no-op changes that we use to probe for flakiness | 20:08 |
*** luqas has quit IRC | 20:08 | |
lifeless | that run on the gate machines, in the gate trust (thinking ahead to baremetal stuff), but with no changes merged | 20:08 |
lifeless | and more often than 1/day | 20:08 |
sdague | yep, we need probably 100 / day to have a solid baseline | 20:09 |
lifeless | I'm thinking whilte true: do | 20:09 |
lifeless | as a starting point | 20:09 |
lifeless | which would get 24 odd | 20:09 |
lifeless | ok, bb in an hour | 20:09 |
sdague | yep | 20:09 |
fungi | that plays back into the idle pool jobs idea | 20:10 |
fungi | but we'd need a dedicated idle pool if we compromise gate results in that manner | 20:10 |
sdague | yes | 20:11 |
fungi | rather than just available nodes to round out the count when the gate is less busy | 20:11 |
sdague | correct | 20:12 |
*** sarob has joined #openstack-infra | 20:13 | |
*** salv-orlando has quit IRC | 20:15 | |
*** luqas has joined #openstack-infra | 20:15 | |
sdague | hmmm... I really wish we had the data over in ES. I think basically neutron jobs are 100% failing right now in gate, but it's hard to see | 20:16 |
*** sarob has quit IRC | 20:18 | |
fungi | sdague: well, jobs run via jenkins01 or jenkins02 should be getting their console logs into elasticsearch currently | 20:18 |
fungi | it's just 03 and 04 which aren't | 20:18 |
fungi | (well, and jenkins.o.o but it's special-purpose jobs only anyway) | 20:19 |
*** afazekas_ has quit IRC | 20:20 | |
sdague | http://logstash.openstack.org/#eyJzZWFyY2giOiJmaWxlbmFtZTpjb25zb2xlLmh0bWwgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRzdm0tbmV1dHJvbi1pc29sYXRlZCBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBcIiBBTkQgYnVpbGRfcXVldWU6Z2F0ZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDE2Mjk5MTgyM30= | 20:23 |
sdague | so of the logs we have, the isolated job in neutron has a 66% failure rate in the gate for the last 48hrs | 20:23 |
sdague | based on that, I think we should remove all neutron from the gate | 20:24 |
*** luqas has quit IRC | 20:25 | |
sdague | the regular neutron job is at 45% failure - http://logstash.openstack.org/#eyJzZWFyY2giOiJmaWxlbmFtZTpjb25zb2xlLmh0bWwgQU5EIGJ1aWxkX25hbWU6Z2F0ZS10ZW1wZXN0LWRzdm0tbmV1dHJvbiBBTkQgbWVzc2FnZTpcIkZpbmlzaGVkOiBcIiBBTkQgYnVpbGRfcXVldWU6Z2F0ZSIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiMTcyODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTM5MDE2MzA5Mzk5NiwibW9kZSI6InRlcm1zIiwiYW5hbHl6ZV9maWVsZCI6ImJ1aWxkX3N0YXR1cyJ9 | 20:25 |
*** thuc has quit IRC | 20:26 | |
*** thuc has joined #openstack-infra | 20:27 | |
*** yolanda has quit IRC | 20:29 | |
mattoliverau | Good morning infra peeps! | 20:30 |
*** thuc has quit IRC | 20:32 | |
fungi | sdague: what are the chances for the improvements from the sprint? i gathered a couple of them needed updated patchsets to pass tests, but that combined they should provide drastic improvement. any idea if they're ready to go yet? | 20:36 |
fungi | morning mattoliverau | 20:36 |
mikal | Morning | 20:36 |
sdague | fungi: honestly, I don't know. | 20:37 |
clarkb | fungi they didnt pass testing | 20:39 |
clarkb | and were recheck no bugged so no debug context | 20:39 |
fungi | clarkb: right, just didn't know if their respective owners had been working on them since | 20:39 |
*** SergeyLukjanov is now known as SergeyLukjanov_ | 20:39 | |
sdague | fungi: I don't know | 20:39 |
sdague | I think there is a base failure rate that's crept back up in the isolated jobs now | 20:39 |
sdague | part of the challenge for the week was it was taking 4 - 5 hrs to get check results back | 20:40 |
sdague | so the timing of the zuul overload was unfortunate | 20:40 |
*** pcrews has quit IRC | 20:40 | |
fungi | or the timing of the spring coincided with the timing of other sprints and general pre-milestone patch rush | 20:40 |
fungi | s/spring/sprint/ | 20:41 |
sdague | yeh, agreed | 20:41 |
sdague | i2 runup, definitely an issue | 20:41 |
fungi | i know of at least one other major openstack project which had a sprint the same week | 20:41 |
*** pcrews has joined #openstack-infra | 20:41 | |
fungi | but there may have been more | 20:41 |
sdague | but I think we're basically at a point now where there no longer is a "good" time during the cycle to get together because activity is always so high | 20:41 |
lifeless | back | 20:45 |
*** elasticio has quit IRC | 20:48 | |
*** mrodden has joined #openstack-infra | 20:50 | |
*** Ajaeger has quit IRC | 20:52 | |
*** dcramer_ has quit IRC | 20:52 | |
*** rakhmerov has joined #openstack-infra | 20:53 | |
*** mrodden1 has joined #openstack-infra | 20:53 | |
sdague | fungi: so we are about to get a merge | 20:55 |
*** mrodden has quit IRC | 20:55 | |
sdague | we can see if the change after the neutron fail is made to start over | 20:55 |
sdague | nope, seemed to do the right thing | 20:56 |
fungi | yeah, i've watched several such incidents since you mentioned it, and haven't see it happen yet | 20:57 |
fungi | so must be an odd combo of circumstances triggering it | 20:58 |
sdague | I think in those cases where I saw it the jobs were still running on the failed job | 21:02 |
sdague | I'm now cron grabbing zuul every 60 seconds so I can reconstruct some of these | 21:02 |
fungi | good idea | 21:02 |
clarkb | was it just continueing to run tests detached ftom the rrst of the queue? | 21:03 |
clarkb | it will do that | 21:03 |
sdague | clarkb: it looked like it reset the job below it | 21:05 |
*** flaper87|afk is now known as flaper87 | 21:05 | |
*** mrda has joined #openstack-infra | 21:07 | |
fungi | sdague: it does that when the change fails, because the change behind it needs to be retested against the branch tip as the new head of the gate, rather than on top of the failing change | 21:08 |
fungi | but it sounded like you were describing something else, like a second gate reset | 21:09 |
openstackgerrit | Eli Klein proposed a change to openstack-infra/jenkins-job-builder: Added rbenv-env wrapper https://review.openstack.org/65352 | 21:09 |
sdague | yeh, that's what it looked like | 21:11 |
lifeless | mordred: on fail-early-keep-running; what if we added a second untrusted geard that can *only* signal failure | 21:13 |
*** sarob has joined #openstack-infra | 21:13 | |
*** sarob has quit IRC | 21:17 | |
sdague | so I'm basically manually pulling all of neutron out right now | 21:21 |
sdague | any neutron or python neutron client job has something like a 5% chance of passing at the moment | 21:22 |
sdague | and there were 5 of them in a row in there | 21:22 |
mikal | What's the state of stable at the moment? | 21:24 |
mordred | lifeless: right. so, honestly I cant' remember the state of teh design for that - but jeblair has a plan for implementing the complex version of this | 21:24 |
mikal | Rechecks are ok but the gate is still busted? | 21:24 |
mordred | lifeless: but with all things being perfect, I would expect it to take us at least a month to get all of the various pieces landed | 21:24 |
mordred | lifeless: the issue there isn't figuring it out - it's just working through the steps to do it | 21:25 |
lifeless | ok | 21:25 |
mordred | lifeless: today's question is more "are there any less-ideal shortcuts we can take to help the current gate-slam" | 21:25 |
*** salv-orlando has joined #openstack-infra | 21:25 | |
lifeless | sue | 21:25 |
lifeless | sure | 21:25 |
mordred | lifeless: btw - not related to this, but just because it's the other thing I'm hacking on right now ... apparently it's possible to comingle js tooling in a python venv | 21:26 |
fungi | mikal: all of stable is still broken because of some exercises not passing on grizzly (which in turn means grenade can't test upgrading to havana changes) | 21:31 |
fungi | mikal: sdague mass-rechecked all outstanding stable changes so that they would get an obvious -1 from jenkins, to prevent stable cores from approving any more of them | 21:32 |
sdague | yep | 21:32 |
sdague | fungi: it's actually because cinder can't run because of stevedore version checking explosion | 21:33 |
fungi | right, but that's where it manifests in the jobs | 21:34 |
sdague | yeh | 21:34 |
sdague | well it also manifests in *all* grizzly being broken | 21:34 |
sdague | but stable maint didn't seem to care on that one :) | 21:34 |
fungi | i thought it was only devstack/tempest failures for grizzly, but regardless yeah | 21:35 |
sdague | sure, but that means you couldn't land any changes | 21:35 |
fungi | if it was also failing grizzly changes on non-integration jobs i missed that | 21:35 |
*** dizquierdo has joined #openstack-infra | 21:38 | |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/config: Use nodeenv via tox to do javascript testing https://review.openstack.org/67729 | 21:39 |
mordred | there we go. generic tox-based js-unittest job template that has docs-draft-like functionality | 21:39 |
notmyname | "BuildErrorException: Server %(server_id)s failed to build and is in ERROR status" seems familiar to me, but I can't seem to find anything in LP | 21:43 |
notmyname | ring a bell with anyone else or is it something new? | 21:44 |
bknudson | https://bugs.launchpad.net/nova/+bug/1266740 | 21:44 |
notmyname | hmm...this is in test_volume_boot_pattern | 21:45 |
notmyname | same bug or should it be filed as something new in LP? | 21:45 |
notmyname | logs at http://logs.openstack.org/16/66916/1/gate/gate-tempest-dsvm-full/b7f51bb/console.html | 21:45 |
*** gokrokve has quit IRC | 21:46 | |
*** senk has quit IRC | 21:47 | |
bknudson | notmyname: I opened this bug https://bugs.launchpad.net/nova/+bug/1270608 | 21:47 |
bknudson | which has the same log message from n-cpu. | 21:47 |
notmyname | bknudson: thanks. I'll use that one | 21:47 |
bknudson | notmyname: I added a e-r check for it https://review.openstack.org/#/c/67713/ | 21:48 |
portante | notmyname: I filed https://bugs.launchpad.net/cinder/+bug/1270350 so that I could find it searching LP | 21:55 |
portante | bknudson: shall I close that one in favor of 1270608? | 21:56 |
mikal | sdague: I think you missed at least one, because my script just rechecked it | 21:56 |
mikal | sdague: so, same outcome... | 21:56 |
mikal | Oh, no I see. | 21:57 |
mikal | sdague rechecked it, jenkins passed | 21:57 |
mikal | This is yesterday | 21:57 |
mikal | https://review.openstack.org/#/c/62206 | 21:57 |
openstackgerrit | Monty Taylor proposed a change to openstack-infra/config: Genericize javascript release artifact creation https://review.openstack.org/67731 | 21:57 |
mikal | Ahhh, because grenade is non-voting for neutron | 21:58 |
sdague | yeh | 21:58 |
sdague | neutron, ceilometer, and oslo still pass on havana | 21:58 |
sdague | because they don't attempt to upgrade | 21:58 |
mikal | That's good. Users never upgrade those thigns. | 21:58 |
sdague | nope | 21:59 |
sdague | bknudson: before putting this through - https://review.openstack.org/#/c/67713/ question in line | 21:59 |
notmyname | sdague: on the gate timings, how feasible is adding the "time in gate" to the log message (both success and failure)? without a statsd timing metric, it would at least give the ability to track how long a piece of code stays in the gate | 21:59 |
sdague | notmyname: to what log message? | 22:00 |
notmyname | sdague: the jenkins message in gerrit | 22:00 |
sdague | don't know | 22:01 |
sdague | you could dive into the zuul code to see | 22:01 |
*** gokrokve has joined #openstack-infra | 22:01 | |
notmyname | sdague: ie I'm now looking at another many hours to get https://review.openstack.org/#/c/66916/ to the top of the gate (which has a 61% chance of passing). and the zuul status page has now been reset to 1 minute | 22:01 |
notmyname | sdague: got a starting point to look at? | 22:01 |
sdague | nope, I don't know zuul code very well, I just dove in to do the enqueue time stuff | 22:02 |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: Add query for bug 1270309 https://review.openstack.org/67594 | 22:02 |
*** gokrokve_ has joined #openstack-infra | 22:02 | |
*** gokrokve has quit IRC | 22:06 | |
*** gokrokv__ has joined #openstack-infra | 22:06 | |
*** gokrokve_ has quit IRC | 22:06 | |
openstackgerrit | A change was merged to openstack-infra/elastic-recheck: Add noirc option to bot https://review.openstack.org/67525 | 22:07 |
sdague | so the stable/grizzly fix made it to top of queue now, with any luck | 22:09 |
bknudson | portante: is it the same problem? see the e-r query | 22:10 |
bknudson | portante: if the query for the e-r works for https://bugs.launchpad.net/cinder/+bug/1270350 then close https://bugs.launchpad.net/nova/+bug/1270608 as a dup | 22:11 |
bknudson | and I'll update the e-r change to use https://bugs.launchpad.net/cinder/+bug/1270350 | 22:11 |
bknudson | portante: I just want there to be an e-r query for it. | 22:11 |
portante | bknudson: agred | 22:13 |
portante | agreed, looking | 22:13 |
*** sarob has joined #openstack-infra | 22:13 | |
*** sarob has quit IRC | 22:18 | |
*** sarob has joined #openstack-infra | 22:19 | |
*** salv-orlando has quit IRC | 22:20 | |
*** sarob has quit IRC | 22:25 | |
bknudson | portante: logstash query with 'message:"BuildErrorException: Server %(server_id)s failed to build and is in ERROR status" AND filename:"console.html"' | 22:26 |
bknudson | gets more hits than 'filename:"logs/screen-n-cpu.txt" AND message:"Error: iSCSI device not found at /dev/disk/by-path/"' | 22:26 |
bknudson | but they're all failures either way. | 22:27 |
portante | yes, and I made 1270350 a dupe of 1270608 since it is more specific | 22:27 |
sdague | bknudson: the n-cpu.txt message is better, as that is specific of an underlying error, not just the symptom it causes | 22:29 |
sdague | mordred: so thinking about this more, while we are still at starvation, fast fail doesn't really help all that much, we're still going to be waiting around for nodes to tear down and rebuild | 22:31 |
*** dizquierdo has quit IRC | 22:31 | |
sdague | that's another bit of why were are hurting right now. We can't restart the changes behind the fail point very quickly | 22:32 |
*** salv-orlando has joined #openstack-infra | 22:39 | |
*** 45PAA4WSM has joined #openstack-infra | 22:41 | |
*** 45PAA4WSM is now known as jhesketh | 22:41 | |
*** dcramer_ has joined #openstack-infra | 22:43 | |
*** yassine has joined #openstack-infra | 22:43 | |
*** yassine has quit IRC | 22:43 | |
fungi | i've got a few things in play to help with some of the starvation... manually removed nodepool tracking for several nodes which are hung deleting at the provider or for a nonexistent provider, cleaning up some stray "alien" vms which nodepool has forgotten it created through unclean daemon restarts, and going to give 67723 a whirl to see if we reclaim some deleted nodes faster | 22:52 |
*** gokrokv__ has quit IRC | 22:52 | |
fungi | we'll still be resource-starved, but at least the available resources should be increased a bit | 22:52 |
*** rakhmerov has quit IRC | 23:03 | |
*** praneshp has joined #openstack-infra | 23:05 | |
*** gokrokve has joined #openstack-infra | 23:08 | |
*** jamielennox|away is now known as jamielennox | 23:12 | |
sdague | yeh, I've had to walk away from beating my head on the gate for a while. I'm off trying to clean up nova request logs now | 23:14 |
*** sarob has joined #openstack-infra | 23:20 | |
*** rakhmerov has joined #openstack-infra | 23:36 | |
*** flaper87 is now known as flaper87|afk | 23:47 | |
openstackgerrit | lifeless proposed a change to openstack-infra/nodepool: Don't load system host keys. https://review.openstack.org/67738 | 23:54 |
openstackgerrit | lifeless proposed a change to openstack-infra/nodepool: Ignore vim editor backup and swap files. https://review.openstack.org/67651 | 23:54 |
openstackgerrit | lifeless proposed a change to openstack-infra/nodepool: Add some debugging around image checking. https://review.openstack.org/67650 | 23:54 |
openstackgerrit | lifeless proposed a change to openstack-infra/nodepool: Only attempt to copy files when bootstrapping. https://review.openstack.org/67678 | 23:54 |
openstackgerrit | lifeless proposed a change to openstack-infra/nodepool: Document that fake.yaml isn't usable. https://review.openstack.org/67679 | 23:54 |
openstackgerrit | lifeless proposed a change to openstack-infra/nodepool: Don't load system host keys. https://review.openstack.org/67738 | 23:54 |
openstackgerrit | lifeless proposed a change to openstack-infra/nodepool: Ignore vim editor backup and swap files. https://review.openstack.org/67651 | 23:54 |
*** sarob has quit IRC | 23:55 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!