*** willthames has joined #zuul | 00:17 | |
*** adam_g has quit IRC | 00:37 | |
*** adam_g_ is now known as adam_g | 00:37 | |
*** jamielennox is now known as jamielennox|away | 00:45 | |
*** jamielennox|away is now known as jamielennox | 01:50 | |
*** Cibo_ has joined #zuul | 02:55 | |
openstackgerrit | Merged openstack-infra/zuul: Remove v3 project template test https://review.openstack.org/395722 | 03:49 |
---|---|---|
*** harlowja has joined #zuul | 04:37 | |
*** bhavik1 has joined #zuul | 05:00 | |
*** harlowja has quit IRC | 05:46 | |
*** bhavik1 has quit IRC | 05:58 | |
*** abregman has joined #zuul | 06:22 | |
*** saneax-_-|AFK is now known as saneax | 07:59 | |
*** hashar has joined #zuul | 08:23 | |
*** jamielennox is now known as jamielennox|away | 08:33 | |
*** hogepodge has quit IRC | 09:45 | |
*** hogepodge has joined #zuul | 09:46 | |
*** saneax is now known as saneax-_-|AFK | 10:08 | |
*** saneax-_-|AFK is now known as saneax | 10:09 | |
*** hogepodge has quit IRC | 11:01 | |
*** hogepodge has joined #zuul | 11:21 | |
*** hogepodge has quit IRC | 11:26 | |
*** hogepodge has joined #zuul | 11:38 | |
*** hogepodge has quit IRC | 11:48 | |
*** hogepodge has joined #zuul | 12:00 | |
*** bhavik1 has joined #zuul | 13:39 | |
*** abregman has quit IRC | 14:26 | |
*** abregman has joined #zuul | 14:49 | |
*** abregman has quit IRC | 15:02 | |
*** abregman has joined #zuul | 15:08 | |
*** abregman has quit IRC | 15:29 | |
*** saneax is now known as saneax-_-|AFK | 15:41 | |
*** hashar is now known as hasharCall | 17:02 | |
*** adam_g is now known as adm_g | 17:46 | |
*** adm_g is now known as adam_g | 17:47 | |
*** adam_g has quit IRC | 17:47 | |
*** adam_g has joined #zuul | 17:47 | |
pabelanger | o/ | 18:06 |
pabelanger | just checking nb01, I see we are getting an exception | 18:07 |
*** hasharCall is now known as hashar | 18:07 | |
pabelanger | http://paste.openstack.org/show/591842/ | 18:07 |
Shrews | pabelanger: is that recent? | 18:10 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Use diskimage.name for _checkForScheduledImageUpdates exception https://review.openstack.org/408756 | 18:10 |
pabelanger | Shrews: just happened | 18:10 |
pabelanger | Shrews: however, its stopped | 18:11 |
pabelanger | let me check nb02 to see what happened | 18:11 |
Shrews | pabelanger: ok. that's due to the change jeblair made yesterday. i thought that was transitive | 18:11 |
pabelanger | oh, no. It is still happening | 18:11 |
pabelanger | let me check zk-shell | 18:11 |
pabelanger | Shrews: okay, lets see what jeblair wants to do | 18:12 |
jeblair | drat | 18:13 |
Shrews | we might just need to start fresh :( | 18:13 |
jeblair | yes, though i think if we land pabelanger's change, we will know which note to go in and fix harder | 18:13 |
jeblair | er node | 18:13 |
Shrews | which change? | 18:13 |
jeblair | 756 | 18:14 |
pabelanger | I don't mind a fresh start, rebuilds and uploads are working very well | 18:14 |
*** hashar has quit IRC | 18:16 | |
*** harlowja has joined #zuul | 18:17 | |
jeblair | let's restart with 756, poke at that node (mostly i want to see what i missed earlier). if that's confusing or doesn't work, let's reboot. | 18:17 |
pabelanger | wfm | 18:20 |
*** harlowja_ has joined #zuul | 18:20 | |
Shrews | maybe our config objects should just implement __repr__() | 18:22 |
*** harlowja has quit IRC | 18:22 | |
jeblair | Shrews: ++ | 18:27 |
pabelanger | jeblair: regarding https://review.openstack.org/#/c/404976/, which moves fake-image-create out of production code, how would you like to see it reworked? | 18:27 |
jeblair | pabelanger: i'm fine with that change as-is -- the thing i wanted to communicate is that i don't think we should change how we cause failures to happen. we can set that command in our test configs, but we should not change the command to cause failures, we should continue using the 'should_fail' (or whatever it's called) image metadata attribute to cause build failures. | 18:29 |
jeblair | pabelanger: (basically, my -1 was in reference to the last sentence of the commit message) | 18:30 |
pabelanger | okay, I am okay with that | 18:30 |
pabelanger | I'll update it here in a minute | 18:30 |
jeblair | (the reason being: by using the metadata, we can say "image A should succeed, image B should fail". we can't do that if we set the command to fail for all images) | 18:31 |
pabelanger | That make sense | 18:34 |
openstackgerrit | Merged openstack-infra/nodepool: Use diskimage.name for _checkForScheduledImageUpdates exception https://review.openstack.org/408756 | 18:36 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Make diskimage-builder command configurable for testing https://review.openstack.org/404976 | 18:38 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Make diskimage-builder command configurable for testing https://review.openstack.org/404976 | 18:39 |
pabelanger | k, 756 has landed on disk. going to stop / start nodepool-builder | 18:44 |
pabelanger | http://paste.openstack.org/show/591848/ | 18:47 |
jeblair | okay, i'll poke around in zkshell | 18:52 |
pabelanger | ya, just looking now too. | 18:54 |
pabelanger | /nodepool/images/ubuntu-trusty/builds does exist | 18:54 |
pabelanger | which I think is the path it should be using | 18:55 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Add __repr__ to ConfigValue objects https://review.openstack.org/408776 | 18:57 |
jeblair | cversion is 17, numchildren is 3, 17-3+1=15 which is > 11.... | 18:58 |
jeblair | (i'm poking at zk -- ignore json object decode errors) | 19:01 |
pabelanger | ack | 19:02 |
pabelanger | we should dump threads again too, when finished with zookeeper, nodepool-builder still using 139% CPU. Figured it would have dropped a little | 19:03 |
jeblair | okay, i did a create/rm cycle 2 times and now it works | 19:04 |
jeblair | i no longer think i understand the math about how it picks the next sequence number | 19:04 |
jeblair | i think if we want to be able to understand how to do this in the future, we should probably read some code. | 19:05 |
pabelanger | indeed, I have no idea why it was failing :) | 19:05 |
jeblair | pabelanger, Shrews: we *could* just let things proceed and if it crops up again, perform a fix like i just did. i think there's a chance that this may happen a few more times as other images age out and the builders decide they need to be rebuilt. there's also a chance that things could work now but break unexpectedly in the future because of some quirk of the math. so if we want to play it safe and scrap the whole thing and restart, that's ... | 19:07 |
jeblair | ... okay too. | 19:07 |
Shrews | i feel like us not understanding the knobs we are twiddling is a good reason to just start fresh. but we could wait and see what happens, too. | 19:08 |
pabelanger | Ya, I think a fresh start if we are going live with nodepool.o.o is a good idea. But same, if we want to see what happens, I'm sure we'll learn more | 19:09 |
jeblair | okay, let's do a fresh start then. | 19:10 |
jeblair | pabelanger: do you want to do the honors? | 19:10 |
pabelanger | jeblair: sure, how do we want to do the fresh start? | 19:11 |
Shrews | pause everything, delete images, shutdown builder, rmr /nodepool, unpause everything, start builder? | 19:12 |
pabelanger | ya, that's what I've come up with too | 19:13 |
jeblair | Shrews: yeah, i think that's it | 19:13 |
pabelanger | okay, let me but nb01 / nb02 into emergency file so puppet doesn't do things | 19:13 |
Shrews | make sure to hit both nb01 and nb02 | 19:13 |
pabelanger | since we don't have a pause via CLI yet | 19:13 |
Shrews | pabelanger: oh, should probably delete local dib files before the restart, too | 19:15 |
pabelanger | ++ | 19:15 |
jeblair | Shrews: yeah, any that are left that the delete's don't take care of | 19:16 |
Shrews | yeah, hopefully there are none | 19:17 |
pabelanger | okay, deleting centos-7 dibs | 19:22 |
pabelanger | let see what happens | 19:22 |
clarkb | is there a tldr of what the problem is/was? | 19:25 |
Shrews | pabelanger: image-delete properly did the things? | 19:25 |
Shrews | (since we had a bug there previously) | 19:26 |
pabelanger | Shrews: I believe so, looking at logs now | 19:27 |
Shrews | clarkb: this https://review.openstack.org/407736 was a non-backward compatible change, causing manual intervention via zk-shell | 19:28 |
pabelanger | we have 1 builds in process, ubuntu-xenial | 19:28 |
pabelanger | going to try deleting that | 19:29 |
Shrews | clarkb: that caused the zk sequence nodes (used for build and upload IDs) to stop working | 19:29 |
pabelanger | not sure if it will work | 19:29 |
*** bhavik1 has quit IRC | 19:33 | |
pabelanger | clean up still working | 19:33 |
Shrews | pabelanger: hmm, unlikely to work via CLI | 19:33 |
pabelanger | Ya, we don't have a good way to abort a DIB in progress from CLI | 19:34 |
pabelanger | aside from -9 disk-image-create | 19:35 |
Shrews | pabelanger: hrm, the CLI has a bug. we *should* report that we can't delete it b/c it's in progress. i don't have a lock around it :( | 19:35 |
pabelanger | same goes for uploads in process, we they didn't stop. | 19:37 |
pabelanger | k | 19:37 |
pabelanger | I had to stop nodepool-builder to stop the uploads, started again | 19:38 |
pabelanger | all uploaded images are now gone | 19:38 |
pabelanger | we have 3 DIBs stuck in deleting, but I think we expected that | 19:39 |
pabelanger | both nb01 / nb02 have empty /opt/nodepool_dib | 19:39 |
pabelanger | I think we are good | 19:40 |
pabelanger | okay, both builders stopped | 19:40 |
pabelanger | just for fun, running sudo -H -u nodepool nodepool alien-image-list on nodepool.o.o | 19:41 |
pabelanger | it should be empty | 19:41 |
pabelanger | and it is, no images from nb01 / nb02 | 19:42 |
pabelanger | rmr /nodepool done | 19:43 |
pabelanger | starting builders again | 19:43 |
pabelanger | and removing nb01 / nb02 from emergency file | 19:44 |
Shrews | oh, no. i was wrong. code is fine | 19:45 |
Shrews | can only delete READY builds (or uploads, for that matter) | 19:45 |
Shrews | ugh, no | 19:47 |
pabelanger | and diskimage builds started again | 19:47 |
pabelanger | Shrews: jeblair: we are back online | 19:49 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Check for in progress build/upload in CLI https://review.openstack.org/408794 | 19:49 |
pabelanger | only 2 issues, if we want to call them issues. | 19:49 |
pabelanger | 1) cannot abort inprogress diskimage builds, had to -9 disk-image-create | 19:49 |
pabelanger | 2) cannot abort uploads, had to stop nodepool-builder | 19:50 |
pabelanger | aside from that, dib-image-delete worked great | 19:50 |
Shrews | 794 should at least fix the CLI and it will say you can't do those things | 19:50 |
Shrews | being able to *actually* do those things is a new feature. Could you do that in the old builder? | 19:51 |
Shrews | i don't recall seeing code to handle that (without actually stopping the builder itself) | 19:51 |
pabelanger | no, I don't think we could | 19:51 |
Shrews | oh, so this is feature creep :) | 19:52 |
jeblair | clarkb: note that we chose to go ahead and do 407736 knowing we might have to restart, as an opportunity to potentially learn some things about zk. we did. :) i think we know enough to come up with a better migration plan next time something like this comes up. | 19:52 |
pabelanger | however, that was the first time I've ever had to purge all images in nodepool :) | 19:52 |
Shrews | pabelanger: lol, touche | 19:52 |
pabelanger | but ya, feature creep :) | 19:53 |
jeblair | pabelanger: yeah, i think restarting builder after setting pause in order to abort builds/uploads is the right thing | 19:53 |
pabelanger | Yup, it wasn't too painful | 19:53 |
jeblair | it looks like f23 and f24 failed? | 19:55 |
pabelanger | checking | 19:55 |
jeblair | 2016-12-08 19:50:28,424 INFO nodepool.image.build.fedora-24: Error: Failed to synchronize cache for repo 'updates' | 19:56 |
jeblair | they both say that | 19:56 |
pabelanger | oh, ya | 19:57 |
pabelanger | that happens from time to time | 19:57 |
pabelanger | actually | 19:57 |
pabelanger | 2016-12-08 19:47:43,817 INFO nodepool.image.build.fedora-24: mount: none already mounted or /opt/dib_tmp/dib_build.bIPidWaE/mnt/proc busy | 19:57 |
jeblair | 'info' huh? :) | 19:58 |
jeblair | pabelanger: i think killing dib was the wrong thing :( | 19:58 |
pabelanger | yes | 19:59 |
jeblair | pabelanger: probably should have let the nodepool-builder shutdown handle stopping it | 19:59 |
pabelanger | agreed | 19:59 |
pabelanger | I didn't check the dib_tmp folder | 20:00 |
Shrews | db/lockfile errors seem to be happening frequently today | 20:02 |
pabelanger | http://paste.openstack.org/show/591854/ | 20:03 |
jeblair | pabelanger: i don't understand that error -- that path should be for the current build, not something left over | 20:03 |
pabelanger | when should we expect the cleanup worker to pickup the deleting state? | 20:03 |
pabelanger | jeblair: ya, this might be something in diskimage-builder, http://logs.openstack.org/88/408288/8/check/gate-dib-dsvm-functests-ubuntu-xenial/9c7656f/console.html#_2016-12-08_19_23_48_471594 has the same message | 20:04 |
jeblair | pabelanger: we may never delete those builds from zk since there are no files for them :( -- no builder will think it's responsible for it. | 20:07 |
pabelanger | jeblair: sad face indeed | 20:08 |
pabelanger | will poke at it shortly | 20:09 |
jeblair | we may need to make sure the builder field is set correctly and then use that to determine whether a given builder should delete the zk record | 20:10 |
jeblair | (it looks like the builder got unset when the state changed to deleting) | 20:10 |
Shrews | touch fedora-23-0000000001.raw should do it | 20:11 |
jeblair | agreed | 20:11 |
pabelanger | k, trying that | 20:12 |
pabelanger | yup, worked | 20:13 |
*** jamielennox|away is now known as jamielennox | 20:13 | |
pabelanger | did fedora-24 too | 20:14 |
Shrews | i'm not sure what the code path is that would NOT set the hostname | 20:17 |
Shrews | or would remove it | 20:17 |
pabelanger | Shrews: Would it be line 451 that uses a new ImageBuild()? over updating the existing data? | 20:20 |
pabelanger | in builder.py | 20:20 |
jeblair | pabelanger: yep | 20:20 |
Shrews | oh, i see it | 20:20 |
Shrews | yes | 20:20 |
pabelanger | cool | 20:21 |
Shrews | that *should* just be reusing 'build' | 20:21 |
Shrews | fix coming | 20:21 |
pabelanger | k | 20:22 |
pabelanger | getting a coffee | 20:22 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Re-use build data when we set for DELETING https://review.openstack.org/408808 | 20:24 |
jeblair | ianw: do you know what went wrong with this build? http://nb01.openstack.org/dib.fedora-23.log | 20:24 |
jeblair | ianw: same thing happened to f24 i think | 20:24 |
ianw | jeblair: 2016-12-08 19:48:19,954 INFO nodepool.image.build.fedora-23: Error: Failed to synchronize cache for repo 'updates' | 20:26 |
jeblair | Shrews: should we lock around that? | 20:26 |
jeblair | ianw: yeah, that looked suspicious to me. but i don't understand what that means or why it would happen 3 times on 2 hosts. | 20:26 |
ianw | i think upstream mirror issues. we had a couple of failures in dib CI similar | 20:26 |
jeblair | ianw: okay, so hopefully will be fixed by the time the builders cycle around to them again | 20:27 |
ianw | jeblair: yeah, i think so. i really need to get rid of f23! keep getting sidetracked | 20:27 |
Shrews | jeblair: that really shouldn't be necessary as it's not a current build or in progress build | 20:29 |
Shrews | we short-circuit the loop if it is | 20:29 |
jeblair | Shrews: yeah, but any builder can choose to mark it as deleting, so i'm wondering about two builders doing that at the same time, with the second one possiblying issuing the store call after the first one actually deleted the znode. | 20:30 |
Shrews | jeblair: good point | 20:31 |
jeblair | Shrews: (also, we should probably only store it if the state has changed so we don't do so more than necessary) | 20:32 |
Shrews | jeblair: how do you mean? | 20:33 |
Shrews | oh, if it's not already DELETING | 20:33 |
jeblair | ya | 20:33 |
Shrews | k | 20:33 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/zuul: Fix retry accounting off-by-one bug https://review.openstack.org/408814 | 20:40 |
SpamapS | jeblair: ^ found it ;) | 20:40 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Re-use build data when we set for DELETING https://review.openstack.org/408808 | 20:42 |
jeblair | SpamapS: \o/ | 20:58 |
SpamapS | jeblair: hey, looking at test_swift_instructions .. looks like that will need some glue written. That's still a thing in v3, yes? | 21:07 |
jeblair | SpamapS: yes, there's a little bit in the spec about moving it to the new auth section of the job config. so yeah, will require some re-plumbing. | 21:26 |
jeblair | SpamapS: it looks like the master branch has tries as base 1 as well? http://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/model.py#n675 | 21:29 |
SpamapS | jeblair: entirely possible. I did not look at master, I looked at why we gave up early in v3. | 21:53 |
SpamapS | jeblair: Since I have 0 familiarity with master, it's a lot harder for me to go back that far (though I have done it a bit lately) | 21:54 |
SpamapS | much simpler for me to try and grok the test, and make sure v3 does what the test wanted and the spec wants. :-P | 21:54 |
SpamapS | so my commit message does in fact contain an assumption that may be untrue | 21:56 |
SpamapS | equally possible is that in master we don't set tries until we've _tried_ once... and in v3 we set it as part of the model creation | 21:57 |
* SpamapS should be more careful with words | 21:57 | |
jeblair | SpamapS: yeah, maybe this is the right fix (it seems to make what we expect happen)... i probably won't be able to sleep until i know what changed so i can update my mental map, but i'm happy to dig into that since that might just be a personal character flaw. :) | 21:59 |
SpamapS | jeblair: I get the benefit of having no base on which to build my anxiety .. my condolences to your sleep ;) | 22:03 |
*** jamielennox is now known as jamielennox|away | 22:29 | |
*** jamielennox|away is now known as jamielennox | 22:34 | |
*** saneax-_-|AFK is now known as saneax | 23:12 | |
openstackgerrit | James E. Blair proposed openstack-infra/zuul: WIP triggers https://review.openstack.org/408848 | 23:25 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul: WIP organize connections into drivers https://review.openstack.org/408849 | 23:25 |
jeblair | jamielennox, jhesketh: ^ that isn't remotely ready for review; i'm literally about halfway into the change. i'm mostly pushing it up so i don't lose it. the commit message is literally my shorthand notes. don't worry if you can't make heads or tails of it -- i think it will be clear when i'm done (and i'll add nice docstrings). but basically i think that direction gives us the ability to have nice singleton objects (drivers, connections) for ... | 23:29 |
jeblair | ... the things we don't want lots of copies of, but also the ability to have lots of per-tenant/pipeline objects (triggers, reporters, etc). and to nicely organize all the supporting code. i think it will make it easier to make extensible using entrypoints, and provide a nice foundation for new drivers (github, sqlalchemy, etc) | 23:29 |
jeblair | jamielennox, jhesketh: i have to run now, but i think i can finish that change tomorrow and maybe have it ready to look at next week | 23:30 |
jlk | oh sorry, I already peeked at it :) | 23:31 |
jeblair | jlk: np :) but hopefully that ^ explains the 'print' statements remaining :) | 23:32 |
jlk | yup well, that and the giant "WIP" | 23:32 |
jlk | and part of commenting was so that I see updates, since I've been staring at the 2.5 + github code | 23:32 |
jlk | an am eager to get to porting it over :) | 23:32 |
*** jamielennox is now known as jamielennox|away | 23:49 | |
*** Cibo_ has quit IRC | 23:58 | |
*** Cibo_ has joined #zuul | 23:59 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!