jeblair | jamielennox: i think the runtime would be the same even if we make the triggers smart enough to act like that -- because ultimately the goal of project-change-merged is to put every open change for a project in the queue | 00:00 |
---|---|---|
jeblair | jamielennox: so whether that's one event which enqueues multiple changes (your suggestion), or multiple events each enqueing one change (current), we still need to go ask gerrit for all the changes | 00:01 |
*** yolanda has joined #zuul | 00:01 | |
*** saneax is now known as saneax-_-|AFK | 00:01 | |
jeblair | the current approach at least has the advantage of matching up pretty well with the "events map directly to a ref/change" idea, so while creating the synthetic events is weird, the actual event matching is very simple and behaves like the gerrit trigger | 00:02 |
*** pabelanger_ has joined #zuul | 00:05 | |
*** yolanda has quit IRC | 00:05 | |
pabelanger_ | webchat FTW | 00:06 |
pabelanger_ | can | 00:06 |
pabelanger_ | err | 00:06 |
pabelanger_ | can't get to IRC proxy atm | 00:06 |
pabelanger_ | ianw: mordred: rbergeron: harlowja: I'm hoping to show up next week with a CORS repo for zookeeper for EPEL7 next week. Like ianw said, not a priority for anybody on internal list. | 00:08 |
harlowja | cool | 00:08 |
mordred | awesome | 00:09 |
pabelanger_ | if that goes well, I'll see what is needed to sync into EPEL7 | 00:09 |
harlowja | sweet | 00:09 |
pabelanger_ | otherwise, roll the CORS repo until centos8? | 00:09 |
mordred | yah - sounds like a good plan | 00:09 |
pabelanger_ | since rawhide has it | 00:09 |
mordred | pabelanger_: is a CORS repo like a PPA? | 00:09 |
pabelanger_ | ya | 00:09 |
mordred | neat | 00:09 |
mordred | pabelanger_: I, for one, welcome our CORS repo overlords | 00:10 |
ianw | pabelanger_: yeah, just working with ggillies on it a bit right now :) | 00:10 |
* mordred hands pies to pabelanger_ and ianw | 00:10 | |
ianw | i think pabelanger_ means COPR | 00:10 |
pabelanger_ | my other thought, was just to roll the packaging in openstack-infra, using the same process as zigo | 00:10 |
pabelanger_ | oh, haha, ya | 00:10 |
pabelanger_ | that | 00:10 |
pabelanger_ | COPR | 00:10 |
pabelanger_ | now I disappear again | 00:11 |
*** pabelanger_ has quit IRC | 00:11 | |
*** yolanda has joined #zuul | 00:19 | |
*** yolanda has quit IRC | 00:23 | |
*** yolanda has joined #zuul | 00:26 | |
*** yolanda has quit IRC | 00:29 | |
*** yolanda has joined #zuul | 00:30 | |
jamielennox | jeblair: sorry to keep coming in and out, any advice on where this event searching would live? | 00:35 |
jeblair | jamielennox: i think the event searching is currently in the gerrit connection and can stay there. that's called by the onChangeMerged method which i think should move from zuultrigger into the scheduler | 00:36 |
*** yolanda has quit IRC | 00:38 | |
jamielennox | so wouldn't the event searching have to be common across connectoins? | 00:38 |
*** yolanda has joined #zuul | 00:44 | |
*** yolanda has quit IRC | 00:49 | |
jeblair | jamielennox: yeah, i think this needs to be scoped to the source/connection associated with the originating pipeline. i'm starting to think that maybe you should kick this over to me for a bit -- i keep typing and deleting things i'm not 100% sure of and i'm afraid i may be about to send you astray. | 00:51 |
jamielennox | jeblair: i'm fine to let you play with it for a while | 00:52 |
jamielennox | there does seem to be scoping problems going on that might be bigger issues | 00:52 |
jeblair | jamielennox: yeah, and i think this might have an "interesting" interaction with dynamic layouts | 00:52 |
jamielennox | also for our usage i'm pretty sure we'd be fine with just deleting the zuultrigger and adding it back if and when it's required | 00:52 |
jeblair | jamielennox: anyway, i'll poke at it tomorrow and see where i get | 00:53 |
jamielennox | jeblair: ok, i'll leave it with you | 00:53 |
jeblair | jamielennox: did you grab a test in storyboard? | 00:55 |
jamielennox | jeblair: i don't think so for zuultrigger, i was trying to see if i knew enough to fix it first then went down the rabbithole | 00:56 |
jeblair | ok, i'll grab it tommorw then | 00:57 |
jamielennox | night - and thanks | 00:57 |
*** yolanda has joined #zuul | 00:59 | |
*** yolanda has quit IRC | 01:07 | |
*** jamielennox is now known as jamielennox|away | 01:07 | |
*** jamielennox|away is now known as jamielennox | 01:21 | |
*** yolanda has joined #zuul | 01:31 | |
*** yolanda has quit IRC | 01:33 | |
*** yolanda has joined #zuul | 01:34 | |
*** yolanda has quit IRC | 01:36 | |
*** yolanda has joined #zuul | 01:37 | |
*** yolanda has quit IRC | 01:41 | |
*** yolanda has joined #zuul | 01:53 | |
*** yolanda has quit IRC | 01:58 | |
*** yolanda has joined #zuul | 02:10 | |
Shrews | jeblair: my idea for 406411 wasn't really great. let's go with PS10 | 02:28 |
*** yolanda has quit IRC | 02:33 | |
*** yolanda has joined #zuul | 02:34 | |
*** hogepodge has quit IRC | 02:38 | |
*** yolanda has quit IRC | 02:42 | |
*** yolanda has joined #zuul | 02:54 | |
*** yolanda has quit IRC | 02:58 | |
*** yolanda has joined #zuul | 03:05 | |
*** yolanda has quit IRC | 03:11 | |
*** yolanda has joined #zuul | 03:11 | |
*** yolanda has quit IRC | 03:16 | |
*** yolanda has joined #zuul | 03:28 | |
*** yolanda has quit IRC | 03:33 | |
*** yolanda has joined #zuul | 03:34 | |
*** yolanda has quit IRC | 03:40 | |
*** yolanda has joined #zuul | 03:47 | |
*** yolanda has quit IRC | 03:51 | |
*** yolanda has joined #zuul | 04:05 | |
*** yolanda has quit IRC | 04:10 | |
*** Cibo_ has joined #zuul | 04:17 | |
*** yolanda has joined #zuul | 04:22 | |
*** yolanda has quit IRC | 04:26 | |
*** yolanda has joined #zuul | 04:29 | |
*** yolanda has quit IRC | 04:35 | |
*** yolanda has joined #zuul | 04:37 | |
*** yolanda has quit IRC | 04:41 | |
*** yolanda has joined #zuul | 04:57 | |
*** yolanda has quit IRC | 05:02 | |
*** yolanda has joined #zuul | 05:14 | |
*** mgagne has quit IRC | 05:15 | |
*** morgan has quit IRC | 05:15 | |
*** tflink has quit IRC | 05:16 | |
*** saneax-_-|AFK has quit IRC | 05:16 | |
*** jamielennox has quit IRC | 05:16 | |
*** yolanda has quit IRC | 05:19 | |
*** tflink has joined #zuul | 05:21 | |
*** morgan has joined #zuul | 05:23 | |
*** saneax-_-|AFK has joined #zuul | 05:27 | |
*** jamielennox has joined #zuul | 05:31 | |
*** yolanda has joined #zuul | 06:12 | |
*** saneax-_-|AFK is now known as saneax | 06:24 | |
*** yolanda has quit IRC | 06:28 | |
*** yolanda has joined #zuul | 06:30 | |
*** abregman has joined #zuul | 06:31 | |
*** yolanda has quit IRC | 06:37 | |
*** yolanda has joined #zuul | 06:40 | |
*** yolanda has quit IRC | 06:45 | |
*** yolanda has joined #zuul | 06:46 | |
*** willthames has quit IRC | 06:52 | |
*** yolanda has quit IRC | 07:10 | |
*** jamielennox is now known as jamielennox|away | 07:11 | |
*** yolanda has joined #zuul | 07:14 | |
*** yolanda has quit IRC | 07:23 | |
*** yolanda has joined #zuul | 07:24 | |
openstackgerrit | Joshua Hesketh proposed openstack-infra/nodepool: Merge branch 'master' into feature/zuulv3 https://review.openstack.org/407923 | 08:23 |
*** Cibo_ has quit IRC | 09:35 | |
*** bhavik1 has joined #zuul | 09:48 | |
*** Cibo_ has joined #zuul | 09:50 | |
*** mgagne has joined #zuul | 10:47 | |
*** mgagne is now known as Guest2615 | 10:47 | |
*** bhavik1 has quit IRC | 10:48 | |
*** openstackgerrit has quit IRC | 11:32 | |
*** hashar has joined #zuul | 11:51 | |
*** hashar_ has joined #zuul | 11:54 | |
*** hashar has quit IRC | 11:57 | |
*** hashar_ is now known as hashar | 13:33 | |
*** Guest2615 is now known as mgagne | 13:54 | |
*** mgagne has quit IRC | 13:54 | |
*** mgagne has joined #zuul | 13:54 | |
*** saneax is now known as saneax-_-|AFK | 14:16 | |
*** abregman has quit IRC | 14:55 | |
*** yolanda has quit IRC | 14:59 | |
*** yolanda has joined #zuul | 14:59 | |
*** openstackgerrit has joined #zuul | 15:25 | |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Add --checksum support to disk-image-create https://review.openstack.org/406411 | 15:25 |
pabelanger | jeblair: Shrews: clarkb: revert back to PS10^ for --checksum | 15:25 |
*** rcarrillocruz has quit IRC | 15:31 | |
Shrews | pabelanger: +1'd | 15:31 |
Shrews | pabelanger: btw, what's the git or gerrit magic to easily revert to a previous patchset? | 15:32 |
Shrews | oh, maybe review -m | 15:34 |
clarkb | git review -d change,patchset && git commit --amend #change something because gerrit && git review | 15:34 |
pabelanger | ya | 15:35 |
pabelanger | I've been know to cherry-pick the previous patchset too from gerrit ui | 15:36 |
*** hashar is now known as hasharAway | 15:36 | |
mordred | fascinating - gerrit has re-applied the votes from clarkb and jeblair from ps10 to ps12 | 15:40 |
clarkb | yup I learned it does that when it applied a -1 of mine when an old patchset was pushed and I couldnt figure out why | 15:42 |
clarkb | "I didnt -1 this patchset" later "oh its that old patchset again that I -1'd" | 15:42 |
*** hogepodge has joined #zuul | 15:50 | |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194 | 15:59 |
*** abregman has joined #zuul | 16:04 | |
pabelanger | okay, doing some ops things with nb01 and nb02 | 16:11 |
pabelanger | image-build ubuntu-precise: build started on nb02 | 16:11 |
pabelanger | image-build ubuntu-trusty: build started on nb01 | 16:11 |
pabelanger | image-build ubuntu-xenail: nothing listed in dib-image-list | 16:11 |
pabelanger | looks like we don't display pending builds | 16:12 |
pabelanger | only active | 16:12 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194 | 16:16 |
rbergeron | ianw / pabelanger / mordred: re: zookeeper -- i find it odd that someone is ... an approver / owner for it for epel (different from the owner in fedora, which isn't unheard of) -- but not sure if anyone has pinged that human to see if he's ... going to ever do anything on that front or not. | 16:17 |
rbergeron | but i can rustle up package approvers and all that faiiiirly easily | 16:17 |
pabelanger | Right, I think what i was looking for, was to get zookeeper package into some product managers pipeline at Red Hat, and having some team be come responsible for it. After some emails, it's now clear, no such team exists. The current roadmap in tooz and etcd | 16:18 |
pabelanger | so, guess ianw and I will push on zookeeper package ourselfs and see where to maintain it | 16:19 |
pabelanger | COPR for now, maybe into epel7 | 16:19 |
pabelanger | Shrews: ^ questions on image-build when you are free | 16:23 |
openstackgerrit | Monty Taylor proposed openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194 | 16:26 |
*** abregman_ has joined #zuul | 16:28 | |
*** abregman has quit IRC | 16:31 | |
mordred | pabelanger, rbergeron: well - not to take over the channel with red hat things completely- but what I got was that it's not on any _openstack_ team's radar | 16:38 |
mordred | but I'm not convinced that there is no interest anywhere in the company with zookeeper, kafka or mesos (kafka and mesos both use zk too) | 16:38 |
mordred | pabelanger: so we may still be able to find some product team somewhere - maybe over in jboss land? | 16:39 |
Shrews | pabelanger: no, pending image builds are not displayed. the only things stored in ZK are active or past builds | 16:52 |
Shrews | pabelanger: but honestly, pending builds should not be pending for long. builders should build them as soon as it notices they need building | 16:54 |
pabelanger | in our case, it takes about 1h20m to do a build | 16:54 |
pabelanger | only issue I have, I don | 16:55 |
pabelanger | err | 16:55 |
Shrews | i'm not sure what that has to do with it | 16:55 |
pabelanger | only issue I have, I don't actually have a way to confirm builds are queued up, without using zk-shell (found the key it sets) | 16:55 |
Shrews | pabelanger: are you saying that a build in the 'building' state is not being diplayed, even though it is being built? | 16:56 |
pabelanger | Shrews: no, other way around. I have 2 building images, but issues image-build 3 times | 16:56 |
pabelanger | issued* | 16:57 |
*** hasharAway has quit IRC | 16:58 | |
Shrews | pabelanger: i'm failing to grok something. so, you issued 'image-build ubuntu-xenial', it is not actually building, but you see the build request node for it in zk-shell? is that correct? | 16:59 |
pabelanger | Shrews: okay, give me a sec, I'll get a pastebin | 17:00 |
Shrews | k. thx | 17:00 |
pabelanger | current value of dib-image-list: http://paste.openstack.org/show/591687/ | 17:01 |
pabelanger | Shrews: from that, we don't actually know we have a pending ubuntu-xenial build | 17:02 |
pabelanger | using zk-shell, I can tell: http://paste.openstack.org/show/591688/ | 17:02 |
pabelanger | since I think that is the key we use to trigger it | 17:02 |
pabelanger | the issue, I see, if clarkb can along an looked at dib-image-list, he would have no idea I;ve already queued up the ubuntu-xenial build | 17:03 |
pabelanger | came* | 17:03 |
Shrews | pabelanger: is that image paused? | 17:04 |
pabelanger | Hmm, | 17:04 |
pabelanger | I don't think so | 17:04 |
pabelanger | let me check | 17:04 |
pabelanger | no | 17:05 |
Shrews | pabelanger: then that is odd. if there is a builder thread free to build it (i'm assuming there is), then seems like a bug since the request should be unhandled for very long | 17:06 |
jeblair | Shrews, pabelanger: both builders are currently occupied | 17:07 |
pabelanger | yes | 17:07 |
jeblair | based on that pastebin | 17:07 |
pabelanger | nb01 is almost done | 17:07 |
Shrews | how many build threads though? | 17:07 |
pabelanger | 1 per server | 17:07 |
jeblair | Shrews: 1 per build machine | 17:07 |
Shrews | OH! | 17:07 |
Shrews | well, yeah | 17:07 |
Shrews | i assumed we used at least the default # of build workers | 17:08 |
Shrews | which is 4 IIRC | 17:08 |
jeblair | Shrews: i think that is the default | 17:08 |
jeblair | Shrews: 4 is uploaders (we use 16) | 17:08 |
pabelanger | I don't think diskimage-builder will support parallel builds | 17:08 |
Shrews | pabelanger: really? wow | 17:09 |
pabelanger | Shrews: I believe so, I'll have to check again | 17:09 |
Shrews | then perhaps we shouldn't support more than one build worker? or at least warn about it | 17:10 |
clarkb | dib can do it so I think its worth having the option | 17:10 |
clarkb | you just have to be extremely careful doing it | 17:10 |
clarkb | mostly in your use of the cache | 17:10 |
clarkb | (tl;dr its mostly up to your elements not dib itself which should be fine as it builds things with all sorts of unique ids and properly loopbacks etc) | 17:11 |
jeblair | so that's probably the warning we should put in the docs :) | 17:11 |
pabelanger | I found this old blueprint a few weeks ago: https://blueprints.launchpad.net/tripleo/+spec/tripleo-diskimage-builder-parallel-builds that's what I am taking my queue from | 17:11 |
Shrews | pabelanger: so, yeah, unhandled manual build requests will not show up. Would be easy enough to add, but there would be no other information in the output other than image name and some sort of new "pending" status | 17:13 |
Shrews | so, if that's useful, we can add it | 17:14 |
pabelanger | okay, something to think about. Not a blocker | 17:16 |
pabelanger | ubuntu-xenial is now building too | 17:16 |
pabelanger | okay, not to stop zookeeper | 17:16 |
pabelanger | now* | 17:16 |
jeblair | yeah, it's an 'image' attribute, not a 'build' attribute. i'm thinking we may need another set of commands for those... | 17:16 |
jeblair | (like image-list, build-list, upload-list) | 17:17 |
clarkb | maybe even possibly a summarize command too? | 17:18 |
clarkb | summarize ubuntu-xenial outputs "pending: 1 builds: 2 uploaded_to: rax-dfw, ovh-gra1, osic-cloud1" or something | 17:19 |
pabelanger | zookeeper stopped / started | 17:19 |
pabelanger | now to see what happened | 17:19 |
*** abregman_ has quit IRC | 17:28 | |
*** adam_g has quit IRC | 17:33 | |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Clean up exception message to use image / provider name https://review.openstack.org/408239 | 17:47 |
pabelanger | Shrews: what we seen on nb02 when zookeeper was stopped / started: http://paste.openstack.org/show/591694/ | 17:58 |
pabelanger | currently waiting to see if all ubuntu-precise images will be uploaded | 17:58 |
*** adam_g has joined #zuul | 18:00 | |
jeblair | Shrews: see reply on 408239 | 18:01 |
Shrews | jeblair: see my reply to myself :) | 18:02 |
jeblair | Shrews: aha :) | 18:02 |
Shrews | pabelanger: that looks normal. i'm very interested in what happens when ZK is kill during a build or during an upload. | 18:02 |
pabelanger | Shrews: that's what i did for this test, we had 2 builds going an uploads. Was stopped for 15 seconds | 18:03 |
pabelanger | so far, I don't see problems | 18:03 |
Shrews | pabelanger: should see something when the upload actually completes | 18:04 |
Shrews | pabelanger: because our upload lock *should* be lost | 18:04 |
pabelanger | hmm | 18:04 |
pabelanger | 2016-12-07 17:28:41,710 INFO nodepool.builder.UploadWorker.7: Image build ubuntu-precise-0000000011 in infracloud-vanilla is ready | 18:05 |
Shrews | pabelanger: did you stop zk gracefully or -9 it? | 18:05 |
pabelanger | that is the first image uploaded after stop / start | 18:05 |
pabelanger | graceful | 18:05 |
pabelanger | stop / start | 18:05 |
pabelanger | I can do -9 next | 18:05 |
jeblair | let's etherpad this: https://etherpad.openstack.org/p/nN69gpYoAO | 18:05 |
Shrews | ah, that could be different. i think zk does some storing of session state (at least, that's what it looked like when i played around with it) | 18:06 |
Shrews | so graceful might not break locks | 18:06 |
pabelanger | let me add some logs | 18:06 |
jeblair | pabelanger: i don't seen an UploadWorker.7 from before the shutdown | 18:07 |
openstackgerrit | Merged openstack-infra/nodepool: Add --checksum support to disk-image-create https://review.openstack.org/406411 | 18:08 |
jeblair | pabelanger: oh are you on nb01 or 02? | 18:08 |
pabelanger | jeblair: ya, this is nb02 | 18:08 |
jeblair | ah ok, that's better | 18:08 |
jeblair | pabelanger, Shrews: i agree, it looks like uploadworker.7 survived the graceful zk restart with an upload in progress | 18:12 |
jeblair | neat :) | 18:12 |
pabelanger | I don't actually think UploadWorker.11 was actually uploading anything | 18:13 |
Shrews | anyone know offhand how long until a session is considered expired by zk/kazoo? those SUSPENDED messages indicate it had NOT expired | 18:13 |
pabelanger | I am not sure | 18:13 |
Shrews | wondering if the results would be different if we waited longer to restart | 18:14 |
*** adam_g has quit IRC | 18:14 | |
pabelanger | For sure, we should do that too. | 18:14 |
jeblair | pabelanger: yeah, uploadworker.11 (and the others) were likely in their poll loop looking for things to upload | 18:16 |
jeblair | harlowja: ^ do you know the answer to Shrews question? | 18:17 |
Shrews | kazoo code seems to indicate 10s | 18:17 |
pabelanger | jeblair: on nb01, you can see uploadworker.06 has an exception, but just after zookeeper is back online, it finds an image | 18:17 |
*** adam_g has joined #zuul | 18:18 | |
jeblair | pabelanger: yeah, i think that agrees with what i said | 18:19 |
pabelanger | \o/ | 18:19 |
harlowja | jeblair not sure, ha | 18:20 |
pabelanger | just waiting for rax-ord ubuntu-precise image to come online, if that happens we are good | 18:20 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool: Delete builds when diskimage removed from config https://review.openstack.org/400421 | 18:21 |
jeblair | Shrews: can you take a quick look at https://review.openstack.org/407124 ? | 18:22 |
Shrews | jeblair: seems harmless | 18:23 |
Shrews | jeblair: fyi, 407736 to pluralize things is fine, but it isn't backwards compatible | 18:26 |
Shrews | like, pabelanger couldn't just use a new builder with that. the current ZK nodes would need to change, or else start fresh | 18:26 |
jeblair | Shrews: yeah; i could do a bunch of backwards compat code, or we could just manually fix the production zk db, or start over | 18:27 |
jeblair | i assume we're still the only production db :) | 18:27 |
jeblair | i'd be happy to shephard that through | 18:27 |
Shrews | i don't mind either way. just mentioning it | 18:27 |
jeblair | i'll leave a note on the review to let me approve it and i'll fix up the db when it lands | 18:28 |
openstackgerrit | Merged openstack-infra/nodepool: Sort images and providers in zookeeper https://review.openstack.org/407124 | 18:28 |
openstackgerrit | Merged openstack-infra/nodepool: Merge branch 'master' into feature/zuulv3 https://review.openstack.org/407923 | 18:34 |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool: Don't use taskmanagers in builder https://review.openstack.org/405663 | 18:34 |
openstackgerrit | Merged openstack-infra/zuul: Add reset of watchdog timeout flag https://review.openstack.org/408194 | 18:43 |
openstackgerrit | Merged openstack-infra/nodepool: Fix zookeeper config in test fixture https://review.openstack.org/407632 | 18:47 |
pabelanger | i'll restart nodepool-builder here shortly, once the git repo is updated on disk | 18:56 |
pabelanger | a side from that, I don't think I have any outstanding issues right now | 18:57 |
pabelanger | \o/ | 18:57 |
jeblair | pabelanger: did you want to perform more zk kill tests? | 18:58 |
pabelanger | jeblair: not yet, if we are good with first round, I'll start another build then -9 zookeeper | 18:58 |
pabelanger | let me pick up lastest code first | 18:58 |
Shrews | pabelanger: lol, you're exception cleanup change failed in all sorts of fantastic ways | 19:01 |
Shrews | none of which were related to your patch | 19:01 |
pabelanger | nice, didn't see that | 19:01 |
jeblair | yeah was just going through those -- regular tests failed with the mysql db timeout thing | 19:02 |
pabelanger | oh, pep8 failed because of bhs1 mirror issue from this morning | 19:02 |
jeblair | (it's not as obvious now, but i think it's that issue because it hit during a lockfile action in the cleanup) | 19:02 |
pabelanger | nods | 19:03 |
Shrews | is that NodeDeleter exception common? | 19:03 |
jeblair | and i think the coverage job failed with two similar timeouts | 19:04 |
SpamapS | jeblair: fungi Can we talk about PTG space for Zuul some time soon? | 19:04 |
jeblair | Shrews: the FakeError? | 19:04 |
SpamapS | Or has it already been arranged? | 19:04 |
jeblair | er "Fake Error" | 19:04 |
* SpamapS isn't sure where to look. | 19:04 | |
Shrews | jeblair: yeah | 19:04 |
Shrews | jeblair: oh, that one is expected, i guess | 19:05 |
jeblair | Shrews: i think that's a test simulation | 19:05 |
Shrews | yeah | 19:05 |
* Shrews should have looked at the name of the test more closely | 19:05 | |
fungi | SpamapS: we have ptg space for infra, and i'm happy for much of that to be devoted to working on zuul | 19:06 |
fungi | it's by far our most complex deliverable now | 19:06 |
jeblair | that seems useful to me | 19:06 |
jeblair | i *hope* we'll be at a point where we'll be ready to do some work on planning the openstack rollout. if we're not there, then we will still have plenty of work to do. | 19:08 |
fungi | yeah, either way we won't be done with this undertaking by time for the ptg | 19:09 |
fungi | so it's safe to say there will be plenty of zuulishness to be had there | 19:10 |
fungi | given how far along it's likely to be, i expect we'll want it to be the primary focus for our infra days anyway | 19:10 |
Shrews | fungi: your optimism is amusing to me | 19:13 |
Shrews | (but hopefully it *will* be far along) | 19:13 |
jeblair | SpamapS: are you looking for setting an agenda now, or agreement that "yeah, we do enough zuul things to warrant attending"? | 19:13 |
fungi | it's a survival mechanism | 19:13 |
SpamapS | jeblair: I'm justifying travel budgets right now. | 19:14 |
SpamapS | Which means I need to make sure people have space and something important to do while there. :) | 19:14 |
fungi | SpamapS: i will make sure if you come you can spend as much time collaborating on zuul as you want (at least for the horizontal team days where infra gets a space) | 19:14 |
SpamapS | It will help if agendas start to show zuul when there are, in fact, agendas. :) | 19:15 |
mordred | Shrews: I figure the node launcher work is gonna take like ... a week, right? | 19:15 |
Shrews | mordred: your optimism is even MORE amusing than fungi's | 19:15 |
fungi | SpamapS: ttx seems to want us to play a little fast and loose with "agendas" for this, but to the degree that we can declare the things we intend to work on while we're there i'll make certain zuul gets a very prominent spot on that list | 19:16 |
fungi | that is unless i'm replaced as ptl before the ptg anyway ;) | 19:17 |
pabelanger | fungi: by a robot fungi? | 19:17 |
fungi | in which case i will merely strongly advise that it should be a major focus | 19:17 |
* fungi thought he _was_ the robot fungi | 19:18 | |
pabelanger | doh | 19:18 |
openstackgerrit | Merged openstack-infra/nodepool: Clean up exception message to use image / provider name https://review.openstack.org/408239 | 19:25 |
*** rcarrillocruz has joined #zuul | 19:47 | |
pabelanger | okay, nodepool-builder restarted on nb01.o.o / nb02.o.o | 19:51 |
jeblair | Shrews: https://review.openstack.org/405663 is ready now | 19:52 |
jeblair | wait maybe not | 19:52 |
jeblair | just noticed the nv tests are failing | 19:52 |
*** hashar has joined #zuul | 19:53 | |
openstackgerrit | James E. Blair proposed openstack-infra/nodepool: Don't use taskmanagers in builder https://review.openstack.org/405663 | 19:58 |
*** adam_g_ has joined #zuul | 20:06 | |
pabelanger | okay, all the leaked checksum files have been manually cleaned up | 20:38 |
*** harlowja has quit IRC | 20:41 | |
jeblair | pabelanger: is now a good time for me to stop the builders and do the manual zk work? | 20:54 |
pabelanger | jeblair: yes, feel free | 20:56 |
jeblair | pabelanger: looks like i stopped nb01 in the middle of a dib run | 20:57 |
jeblair | dib is still running, however... i think we normally expect it to stop | 20:58 |
pabelanger | indeed, looks like fedora-24 is doing checksums now | 20:58 |
jeblair | it's stopped now | 20:59 |
jeblair | maybe it just needed to finish the checksum programs | 20:59 |
pabelanger | ya | 20:59 |
jeblair | "cd nodepool" "cp image images true" "rmr image" is what i'm doing for the move | 21:02 |
openstackgerrit | Merged openstack-infra/nodepool: Pluralize zk nodes with children https://review.openstack.org/407736 | 21:02 |
jeblair | the 'true' in the cp command means 'recursive' | 21:02 |
Shrews | jeblair: hmmm, wonder how that affects sequence node numbers | 21:02 |
Shrews | if at all | 21:02 |
jeblair | Shrews: good question, we should look for that in the next build | 21:03 |
pabelanger | I tried using dstat logs on https://review.openstack.org/#/c/405663 but didn't see much difference with a single provider. But +2'd | 21:05 |
jeblair | pabelanger: hrm, it should remove about 250 threads | 21:06 |
jeblair | pabelanger: oh, *single provider* | 21:07 |
pabelanger | ya | 21:07 |
jeblair | pabelanger: yes, there will be very little difference in that case. | 21:07 |
pabelanger | nods | 21:07 |
pabelanger | that is what I figured | 21:07 |
jeblair | pabelanger: threads = providers * workers | 21:07 |
pabelanger | k | 21:08 |
jeblair | mordred just approved that so i'll wait till it lands to restart builders | 21:08 |
pabelanger | Yay | 21:08 |
jeblair | (which works well since i have more zk moves to do) | 21:08 |
jeblair | pabelanger: fedora-24/builds/00..10 and 11 are empty but still exist | 21:09 |
jeblair | do you know the story there? | 21:09 |
pabelanger | jeblair: I wonder if that is a result of me stopping nodepool-builder during builds too | 21:09 |
pabelanger | let me check the logs | 21:10 |
mordred | jeblair: double plus bonus points for use of boartty as a library in your storyboard script | 21:10 |
jeblair | 15 and 16 are empty too | 21:10 |
jeblair | mordred: i am lazy :) | 21:10 |
Shrews | jeblair: hrm, i think we'll have problems | 21:10 |
jeblair | Shrews: because the new nodes will have reset version numbers? | 21:11 |
Shrews | jeblair: quick test, using the method you outlined, makes a subsequent create fail with NodeExistsError | 21:11 |
Shrews | jeblair: and i don't know why | 21:11 |
Shrews | (this is a hacked kazoo script i'm using, not the builder) | 21:11 |
pabelanger | jeblair: fedora-24-00..16 is empty too? | 21:12 |
jeblair | pabelanger: yep 10,11,15,16 all empty | 21:12 |
pabelanger | jeblair: if so, that must be a result of us stopping nodepool-builder when we have builds in progress | 21:13 |
pabelanger | because that 0016 was the build that was just running | 21:13 |
jeblair | pabelanger: 13 14 exist, 12 does not. | 21:13 |
jeblair | pabelanger: well, that one is fine -- there's nothing to clean it up right now | 21:13 |
jeblair | pabelanger: but the others may be an error | 21:13 |
pabelanger | yes, 13 and 14 are valid builds which are ready | 21:13 |
mordred | jeblair: 407135 has 2x+2 but I think should go in th elist of jeblair reviews | 21:14 |
jeblair | mordred: yeah, i'm going to switch to zuul soon | 21:14 |
jeblair | thanks | 21:14 |
mordred | jeblair: I've been trying to clear out things that don't need your attention | 21:15 |
jeblair | mordred: oh thanks | 21:15 |
pabelanger | I'll also work on more test coverage around stopping nodepool-builder during a diskimage build | 21:16 |
pabelanger | should be easy to expose the issue | 21:16 |
Shrews | jeblair: http://paste.openstack.org/show/591725/ | 21:17 |
jeblair | Shrews: hrm, that wfm | 21:20 |
Shrews | O.o | 21:20 |
jeblair | ii zookeeper 3.4.5+dfsg-1 all High-performance coordination service for distributed applications | 21:20 |
Shrews | jeblair: check zk-shell listing. do you have a 'junks2' | 21:21 |
Shrews | ? | 21:22 |
jeblair | Shrews: no, just junk and junks | 21:22 |
Shrews | i missed a step in that paste... i deleted the first sequence node, then re-ran k.py | 21:22 |
jeblair | (i also did an attempt where i 'rmr junk' after cping. that also worked) | 21:22 |
jeblair | oh i'll try that | 21:23 |
Shrews | but, if i don't delete that, i get a junks2 | 21:23 |
Shrews | perhaps i'm using an older version of zookeeper | 21:23 |
openstackgerrit | Merged openstack-infra/zuul: Add roadmap to README https://review.openstack.org/407213 | 21:23 |
openstackgerrit | Merged openstack-infra/zuul: Re-enable TestScheduler.test_rerun_on_error https://review.openstack.org/406416 | 21:24 |
openstackgerrit | Merged openstack-infra/zuul: Re-enable test_rerun_on_abort https://review.openstack.org/407000 | 21:24 |
openstackgerrit | Merged openstack-infra/nodepool: Delete builds when diskimage removed from config https://review.openstack.org/400421 | 21:24 |
jeblair | Shrews: aha, if i delete the initial sequence znode i get the error | 21:24 |
Shrews | jeblair: if those tests work against the production zookeeper, then go for it. we might be seeing version differences | 21:24 |
openstackgerrit | Merged openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/404487 | 21:24 |
Shrews | jeblair: neat | 21:24 |
Shrews | so, that's not going to work | 21:25 |
Shrews | most likely | 21:25 |
jeblair | fascinating :) | 21:25 |
Shrews | jeblair: we are learning more ZK things | 21:26 |
jeblair | yay us! | 21:26 |
Shrews | fyi, i have zk 3.4.8, which is newer | 21:29 |
jeblair | i like that the kazoo docs explicitly state this will never happen | 21:31 |
Shrews | lol! link? | 21:31 |
jeblair | http://kazoo.readthedocs.io/en/latest/api/client.html#kazoo.client.KazooClient.create | 21:31 |
jeblair | Note that since a different actual path is used for each invocation of creating sequential nodes with the same path argument, the call will never raise NodeExistsError. | 21:31 |
Shrews | something about the 'cp' probably makes it lose some sequence attributes | 21:33 |
jhesketh | Morning | 21:35 |
jeblair | Shrews: i think i understand -- i think it uses the cversion stat to get the next seqno | 21:35 |
jeblair | the new node starts with cversion at 0 and then increments with each new child | 21:35 |
Shrews | so that's the attribute it loses :) | 21:36 |
jeblair | so if you had 4 children, you would have cversion=4. then delete 1, you still have cversion=4. copy that to a new node, cversion=3. | 21:36 |
Shrews | ah | 21:37 |
jeblair | so in production, we're going to have cversion=2 on most of these things; we'll probably end up having sequence numbers start at 3. they'll work for a while until they hit a collision with 11 or whatever we're up to. unless 11 gets deleted in time. which it probably would. except for fedora-24. | 21:38 |
jeblair | Shrews: so i think we can fix this with a script that looks at all the sequence number containers, and adds/removes children until cversion==max(sequence number) | 21:38 |
jeblair | (even though this is ridiculous, i'm glad we're learning this now before we had to learn it for something that's actually important :) | 21:39 |
Shrews | yeah | 21:40 |
jeblair | i guess we want cversion to be max(sequence number)+1 | 21:40 |
jeblair | i'll write a script to do that real quick | 21:41 |
*** harlowja has joined #zuul | 21:41 | |
*** jamielennox|away is now known as jamielennox | 21:49 | |
jeblair | Shrews: experimentally, it seems the next sequence number is cversion-numChildren+1 | 21:54 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Update waitForBuildDeletion() to protect against delete race https://review.openstack.org/408324 | 21:54 |
pabelanger | jeblair: Shrews: should fix the race condition in: http://logs.openstack.org/63/405663/6/gate/gate-nodepool-python27-ubuntu-xenial/91a1e7d/console.html | 21:54 |
jeblair | Shrews: okay i ran this: http://paste.openstack.org/show/591733/ | 22:06 |
jeblair | i think we should be ready to restart now | 22:06 |
jeblair | wait what we decided to go with the activate virtualenv thing? | 22:07 |
jeblair | mordred: fyi https://review.openstack.org/403966 | 22:10 |
jeblair | mordred: you may want to read the entire discussion there, and on the linked change | 22:12 |
jeblair | pabelanger, Shrews: i'm going to restart the builders now | 22:13 |
Shrews | jeblair: okie dokie. will keep my fingers and toes crossed | 22:13 |
jeblair | Shrews: i apparently missed something: http://paste.openstack.org/show/591734/ | 22:15 |
Shrews | ugh | 22:15 |
jeblair | Shrews: hrm, actually, any chance that's normal for a lock collision? | 22:16 |
Shrews | jeblair: nope. they have their own exceptions | 22:16 |
Shrews | but that's failing on the sequence node create anyway | 22:17 |
Shrews | jeblair: heading out to meet ansible type folks (hi rbergeron!) but will try to pay attention to my phone pings if you need me | 22:22 |
jeblair | Shrews: ok. i think the situation corrected itself after some deletions or something anyway | 22:24 |
jeblair | the other builder got it. :) | 22:24 |
openstackgerrit | Merged openstack-infra/nodepool: Don't use taskmanagers in builder https://review.openstack.org/405663 | 22:40 |
jeblair | SpamapS: in 407000, i don't understand why the test is changing. i can't think of a reason for it to do so, and anyway, the final assertion in that test is basically confirming that we're running one less job than before. | 22:50 |
jeblair | SpamapS: looking at the test output from master, "Launch job project-test1" shows up 5 times, and attempts is set at 4. that's correct because the 5th attempt is the one that returns RETRY_LIMIT -- so from the user POV, we tried to run the job 4 times. with 407000 the v3 branch shows that job launching 4 times, and it's the 4th that returns RETRY_LIMIT (so the user told us to run it 4 times and sees it run 3). | 22:51 |
SpamapS | jeblair: yeah, I wasn't sure about that the test changing was the right call. | 22:55 |
SpamapS | jeblair: I thought what I saw was just that we ended up not recording one of the tries anymore. | 22:55 |
SpamapS | but I may have misunderstood how things are recorded. | 22:55 |
jeblair | SpamapS: yeah -- that's what i went looking for, but i figured "Launching job project-test1" was a pretty good proxy for "how many times did we run this job" since that's emitted by the pipeline manager when it tells the launcher to run a job (so it shouldn't be as affected by things changing around the launcher) | 22:57 |
openstackgerrit | Merged openstack-infra/zuul: Re-enable test_client_get_running_jobs https://review.openstack.org/407135 | 23:04 |
jeblair | jhesketh: https://review.openstack.org/406699 | 23:06 |
* jhesketh looks | 23:07 | |
jhesketh | oh, I reviewed that last night... why don't I have a vote | 23:07 |
jeblair | jhesketh: did you use gertty? | 23:07 |
jhesketh | nope, I suspect it was a human error | 23:07 |
* jhesketh tries to regain state | 23:08 | |
jeblair | well there's your problem! ;) | 23:08 |
jhesketh | heh, that's a default for me ;-) | 23:08 |
SpamapS | jeblair: so then it's possible the logic on retries changed and we're actually doing it wrong. | 23:11 |
jeblair | SpamapS: yeah, nothing about what might cause that comes immediately to mind, so it may require spelunking. | 23:12 |
jhesketh | jamielennox: left a question on 406699 if you have a moment | 23:14 |
SpamapS | jeblair: worth doing. I'll take a deeper look. | 23:15 |
*** harlowja has quit IRC | 23:16 | |
jeblair | SpamapS: thanks. note that the change has merged. | 23:16 |
jamielennox | jhesketh: yea, it's probably easier to just make them seperate git repos on the filesystem, they're already seperate pipelines and everything | 23:16 |
SpamapS | jeblair: indeed, I think I just glossed over a bug, rather than introduced one. :-/ | 23:17 |
jamielennox | i'll spin a new one today | 23:17 |
jeblair | SpamapS: yeah, odds are i introduced it. but we've lost our tracking of it, so we might want to do something (revert, propose a todo note, or file a story) | 23:18 |
jhesketh | jamielennox: cool, mostly curious about the second comment though as I think we need that to make sure it's behaving as expected | 23:19 |
openstackgerrit | Merged openstack-infra/zuul: Re-enable merge-mode config option and add more tests https://review.openstack.org/406361 | 23:19 |
SpamapS | jeblair: I'll add a story with the v3 tag and assign myself | 23:22 |
SpamapS | https://storyboard.openstack.org/#!/story/2000827 | 23:24 |
openstackgerrit | James E. Blair proposed openstack-infra/zuul: Remove v3 project template test https://review.openstack.org/395722 | 23:26 |
*** jamielennox is now known as jamielennox|away | 23:27 | |
*** jamielennox|away is now known as jamielennox | 23:28 | |
*** hashar has quit IRC | 23:36 | |
*** hashar has joined #zuul | 23:39 | |
*** Cibo_ has quit IRC | 23:50 | |
*** hashar has quit IRC | 23:52 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!