*** threestrands has joined #zuul | 00:14 | |
SpamapS | why on earth are the drivers doing this btw? | 00:29 |
---|---|---|
SpamapS | Just ocurred to me that this seems like a generic problem for all providers. | 00:29 |
clarkb | SpamapS: my guess is beacuse figuring out usage is driver specific | 00:30 |
SpamapS | But not at the max-servers level | 00:30 |
clarkb | we could push that into the drivers and then have ya that | 00:30 |
SpamapS | That's entirely nodepool's problem. | 00:30 |
clarkb | prior to actually checking quota it was entirely in the scheduler iirc | 00:31 |
SpamapS | There's "how many are running in the cloud vs. what you want to launch" and then theres' "how many are in nodepool's database vs. what you want to run?" | 00:31 |
clarkb | nodepool used its local db to calculate usage | 00:31 |
SpamapS | Yeah, seems like both should happen. | 00:31 |
clarkb | but now it uses a combo of both | 00:31 |
SpamapS | seems like nodepool should handle max-servers itself. | 00:31 |
clarkb | SpamapS: I think this is largely just an overlooked item in the split out of drivers from onyl having an openstack driver | 00:31 |
SpamapS | makes sense | 00:32 |
*** pwhalen has quit IRC | 00:32 | |
*** tobiash has quit IRC | 00:32 | |
*** timburke has quit IRC | 00:32 | |
clarkb | but ya nodepool could ask the driver of a provider for its reported usage, grab usage from the db and compare the two at a higher level | 00:33 |
clarkb | then drivers would only need to implement the provide reported usage data | 00:33 |
*** tobiash has joined #zuul | 00:33 | |
*** timburke has joined #zuul | 00:34 | |
*** mgoddard has quit IRC | 00:34 | |
SpamapS | That's also good, but I need even less. | 00:35 |
SpamapS | Nodepool could look at the number of nodes a provider owns, and if it's >= max-servers, reject the request. | 00:35 |
SpamapS | No cloud request needed for that. | 00:35 |
clarkb | ya I actually think the openstack driver does that short cut | 00:35 |
SpamapS | I actually assumed it did that | 00:35 |
clarkb | but you'd still need the check usage for the general case | 00:36 |
SpamapS | and luckily I have a 30 instance limit.. or I might have eaten up my entire AWS budget yesterday. ;) | 00:36 |
clarkb | I have max-server headroom, do I actually have quota | 00:36 |
SpamapS | another broken assumption from the openstack driver that we accidentally carried into AWS is that this provider owns every instance it can see. | 00:37 |
*** mgoddard has joined #zuul | 00:37 | |
SpamapS | AWS definitely does not work like that. | 00:37 |
SpamapS | (well, it can, but only with a really carefully constructed IAM policy) | 00:37 |
SpamapS | so I'm adding a tag system so listNodes filters out nodes that don't belong to nodepool. | 00:38 |
clarkb | oh it already does the thing I said via hasRemainingQuota | 00:38 |
clarkb | but aws driver doesn't implement that | 00:38 |
SpamapS | That shortcut should be moved to a concrete function in the parent class that gets called before hasREmainingQuota. | 00:39 |
SpamapS | something like if self.pool.max_servers is not None: ... | 00:39 |
SpamapS | anyway... for now.. just adding some filtering on tags and counting and it should work | 00:39 |
clarkb | and it doesn't shortcut like I thought | 00:40 |
clarkb | SpamapS: I think updating the base class hasRemainingQuota() to check local db against max* would get you what you want | 00:41 |
clarkb | then drivers can selectively do more by completely overriding the base class method or do both by supercalling it and then doing more | 00:41 |
SpamapS | clarkb: max*? Are there cpu counts hiding in there too? | 00:41 |
clarkb | SpamapS: ya ram, cpu, and instances | 00:41 |
clarkb | looks like | 00:41 |
clarkb | hrm I suppose you'd really only be able to do instances that way | 00:42 |
clarkb | beacuse you won't know ram or cpu usage | 00:42 |
clarkb | but still an improvement | 00:42 |
SpamapS | ah ok, interesting.. AWS actually does have a price list API that has some of what the openstack flavors API has... so we could actually do this | 00:44 |
SpamapS | https://docs.aws.amazon.com/aws-cost-management/latest/APIReference/API_pricing_GetProducts.html | 00:44 |
SpamapS | In fact with that API, one could actually set up a run-rate as a limit | 00:47 |
SpamapS | but yeah, for now, I just want max-servers :-P | 00:47 |
* SpamapS working through just querying EC2. | 00:47 | |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool master: Add simple max-server sanity check to base handler class https://review.openstack.org/651676 | 01:01 |
clarkb | SpamapS: ^ I haven't tested that but it may be as simple as that for your needs | 01:01 |
*** jamesmcarthur has joined #zuul | 01:34 | |
*** jamesmcarthur has quit IRC | 01:48 | |
*** jamesmcarthur has joined #zuul | 01:49 | |
*** jamesmcarthur has quit IRC | 02:25 | |
*** bhavikdbavishi has joined #zuul | 02:42 | |
*** bhavikdbavishi has quit IRC | 03:05 | |
*** jamesmcarthur has joined #zuul | 03:25 | |
*** jamesmcarthur has quit IRC | 03:41 | |
*** bhavikdbavishi has joined #zuul | 03:44 | |
*** bhavikdbavishi1 has joined #zuul | 03:49 | |
*** bhavikdbavishi has quit IRC | 03:49 | |
*** bhavikdbavishi1 is now known as bhavikdbavishi | 03:49 | |
*** jamesmcarthur has joined #zuul | 04:01 | |
*** jamesmcarthur has quit IRC | 04:19 | |
*** bjackman_ has joined #zuul | 04:25 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/nodepool master: Add python-path option to node https://review.openstack.org/637338 | 04:44 |
*** mhu has quit IRC | 04:45 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul-jobs master: Add ansible-lint job https://review.openstack.org/532083 | 04:46 |
*** quiquell|off is now known as quiquell|rover | 05:32 | |
*** gouthamr has quit IRC | 05:56 | |
*** gouthamr has joined #zuul | 06:00 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul-jobs master: Add ansible-lint job https://review.openstack.org/532083 | 06:12 |
*** quiquell|rover is now known as quique|rover|brb | 06:26 | |
*** gtema has joined #zuul | 06:57 | |
*** quique|rover|brb is now known as quiquell|rover | 07:06 | |
*** threestrands has quit IRC | 07:30 | |
*** jpena|off is now known as jpena | 07:36 | |
*** hashar has joined #zuul | 08:18 | |
*** yolanda_ has joined #zuul | 08:29 | |
*** mhu has joined #zuul | 09:26 | |
*** electrofelix has joined #zuul | 09:40 | |
*** bhavikdbavishi has quit IRC | 10:12 | |
*** yolanda_ has quit IRC | 10:40 | |
*** yolanda_ has joined #zuul | 10:52 | |
*** jpena is now known as jpena|lunch | 10:56 | |
quiquell|rover | tristanC: do you know if a job that depends on a non-voting job will start even if dependant fails ? | 11:01 |
quiquell|rover | nhicher: ^ do you know ? | 11:02 |
quiquell|rover | I am going to test that | 11:03 |
*** bhavikdbavishi has joined #zuul | 11:08 | |
AJaeger | I hope it does not ;) | 11:12 |
quiquell|rover | testing it but our pipeline looks like | 11:20 |
quiquell|rover | AJaeger: openstack-periodic at https://softwarefactory-project.io/zuul/t/rdoproject.org/status | 11:20 |
quiquell|rover | AJaeger: the dependant is the buildah job and it's failing | 11:20 |
quiquell|rover | AJaeger: testing it here https://review.rdoproject.org/r/#/c/20143/ | 11:21 |
quiquell|rover | AJaeger: nah is working as expected https://review.rdoproject.org/r/#/c/20143/ | 11:31 |
AJaeger | good ;) | 11:34 |
quiquell|rover | AJaeger: then I don't know why our jobs are running | 11:35 |
quiquell|rover | :-/ that's worse | 11:35 |
quiquell|rover | AJaeger: do you have some brain cycles for me ? | 11:35 |
quiquell|rover | AJaeger: this is the stuff I am talking about https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/zuul.d/tripleo.yaml#L45-L47 | 11:40 |
AJaeger | sorry, not enough brain right now to dig into that ;( | 11:41 |
*** rlandy has joined #zuul | 11:57 | |
*** rlandy is now known as rlandy|ruck | 11:58 | |
*** jamesmcarthur has joined #zuul | 12:13 | |
*** jpena|lunch is now known as jpena | 12:26 | |
pabelanger | clarkb: mordred: tobiash: care to add https://review.openstack.org/163922 to your review queue, that should be a small improvement on zuul-merger with stacked commits | 12:29 |
*** jamesmcarthur has quit IRC | 12:34 | |
*** jamesmcarthur has joined #zuul | 12:47 | |
*** quiquell|rover is now known as quiquell|lunch | 12:59 | |
*** bjackman_ has quit IRC | 13:04 | |
*** jamesmcarthur has quit IRC | 13:04 | |
nhicher | quiquell|lunch: did you find the issue with your non-voting job ? | 13:06 |
quiquell|lunch | nhicher: there were no issue just a brain fart from my part | 13:07 |
quiquell|lunch | thanks anyways | 13:07 |
nhicher | quiquell|lunch: ok =) | 13:08 |
*** webknjaz has joined #zuul | 13:09 | |
webknjaz | Hello everyone! | 13:10 |
* webknjaz wonders whether this is the right place to criticise the usage of CherryPy in Zuul... | 13:10 | |
*** bhavikdbavishi has quit IRC | 13:17 | |
Shrews | webknjaz: perhaps you mean "discuss" rather than "criticise"? but yes, this is the correct channel, though the person that switched us to that is on vacation at the moment | 13:17 |
*** pcaruana has quit IRC | 13:20 | |
webknjaz | @Shrews: yeah... The thing is that I've been exposed to the source code which made me a bit unhappy :) | 13:22 |
webknjaz | I'm a CherryPy maintainer (along with other things like aiohttp, ansible core). | 13:22 |
webknjaz | So I just wanted to point out how to do a few things cherrypy-way. I'm especially interested in GitHub Apps now. | 13:22 |
webknjaz | Here's a cleaner example of having event handlers from one of my PoCs, for example: https://github.com/webknjaz/ansiwatch-bot/blob/58246a8/ansiwatch_bot/apps/github_events.py#L136-L157 | 13:22 |
webknjaz | Oh, and I'm currently developing a framework for writing github apps and actions — https://tutorial.octomachinery.dev | 13:22 |
webknjaz | @Shrews: so I just wanted to say that if interested parties want some feedback, maybe you could tell them to ping me once they are back? | 13:23 |
Shrews | webknjaz: we'd LOVE feedback on how to do things better, or where we do things in not the best way. anything you can share to help us improve is much appreciated. No need to wait for that particular dev to return. | 13:30 |
webknjaz | Is this channel logged? Will it better to share it in some better discoverable place? | 13:31 |
Shrews | webknjaz: yes, it is logged: http://eavesdrop.openstack.org/irclogs/%23zuul/ | 13:32 |
*** quiquell|lunch is now known as quiquell|rover | 13:37 | |
*** pcaruana has joined #zuul | 13:42 | |
webknjaz | I think that some parts of http://git.zuul-ci.org/cgit/zuul/tree/zuul/web/__init__.py#n647 could use plugins like from here: https://github.com/cherrypy/cherrypy/blob/b648977/cherrypy/process/plugins.py#L484-L580 | 13:51 |
webknjaz | This class also looks like it should be a plugin: http://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/github/githubconnection.py#n101 | 13:53 |
webknjaz | people don't seem to realize that the core of CherryPy is a pubsub bus which they can actually use as well | 13:54 |
*** bjackman_ has joined #zuul | 13:56 | |
webknjaz | https://www.irccloud.com/pastebin/1QzyW2Aj/ | 13:59 |
*** bjackman_ has quit IRC | 14:01 | |
webknjaz | But I think that this is still a wrong approach. Architecturally, it's way nicer to customize the routing layer. CherryPy has pluggable and replaceable request dispatchers. | 14:02 |
webknjaz | For example, here's what I've done: https://github.com/webknjaz/ansiwatch-bot/blob/master/ansiwatch_bot/request_dispatcher.py | 14:02 |
webknjaz | This allows me to have per-event methods: https://github.com/webknjaz/ansiwatch-bot/blob/58246a8/ansiwatch_bot/apps/github_events.py#L136-L157 | 14:02 |
*** swest has quit IRC | 14:03 | |
webknjaz | You can have a decorator to unpack all webhook payload keys as handler method arguments: https://github.com/webknjaz/ansiwatch-bot/blob/master/ansiwatch_bot/tools.py | 14:03 |
*** swest has joined #zuul | 14:04 | |
webknjaz | Overall working with GitHub can be offloaded to plugins: https://github.com/webknjaz/ansiwatch-bot/blob/master/ansiwatch_bot/plugins.py | 14:05 |
Shrews | webknjaz: i'm not sure how much sense it makes for us to use the cherrypy dispatchers. we have other parts of the app (non-cherrypy based) that we are communicating with via gearman, which is what you see there in the payload() method | 14:13 |
webknjaz | what do you mean? does gearman act as a proxy? | 14:13 |
*** ianychoi has quit IRC | 14:13 | |
*** ianychoi has joined #zuul | 14:14 | |
Shrews | we have several pieces/daemons to "zuul" that all communicate with each other (submitting jobs/requests/etc) using gearman | 14:14 |
mordred | yeah. it's a network bus/ job worker system | 14:14 |
pabelanger | webknjaz: https://zuul-ci.org/docs/zuul/admin/components.html might help explain the different parts of zuul | 14:14 |
Shrews | pabelanger: ah yes, was just looking for that. thx | 14:15 |
pabelanger | np! | 14:15 |
mordred | so this http://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/github/githubconnection.py#n1556 is the only bit running on the zuul-web server, the other bits are running in other processes probably on other machines that don't do anything with http requests | 14:15 |
webknjaz | so it's vice versa | 14:16 |
webknjaz | anyway, it still needs to have validation in a tool | 14:16 |
mordred | yeah- the web interaction is largely "receive payload, validate signature, put on gearman bus" - but yeah, I think doing the validate as a decorator like you mention could be potentially nice | 14:17 |
webknjaz | for interfacing with jobs it may be better to have a CherryPy plugin | 14:17 |
webknjaz | because otherwise implementation details of the rpc leak into the web handler layer | 14:18 |
mordred | like a cherrypy plugin that does the submit job? that's an interesting idea | 14:18 |
webknjaz | yep | 14:18 |
webknjaz | you'd use pub-sub to interface with it | 14:19 |
webknjaz | https://www.irccloud.com/pastebin/MlVJDNcG/ | 14:20 |
mordred | nod. I could see that plugin being potentially nice - passing things onto the gearman bus is a common pattern | 14:20 |
webknjaz | or without that helper func it's: https://github.com/webknjaz/ansiwatch-bot/blob/master/ansiwatch_bot/apps/github_events.py#L93-L94 | 14:21 |
mordred | webknjaz: https://opendev.org/openstack-infra/zuul/src/branch/master/zuul/web/__init__.py#L263-L270 ... is an example of one of the simpler but common patterns of "turn this http request into a gearman call" - is there a better way to set the CRSF header more generally than just grabbing the response and setting the header directly like we're doing there? | 14:25 |
webknjaz | I believe you could use another tool | 14:26 |
webknjaz | http://docs.cherrypy.org/en/latest/extend.html#hook-point | 14:26 |
webknjaz | Try using `before_finalize` hook point | 14:27 |
webknjaz | After looking at http://git.zuul-ci.org/cgit/zuul/tree/zuul/driver/github/githubconnection.py#n241 I think that you really need a custom dispatcher... | 14:28 |
webknjaz | maybe you could event use a WSGI app interface "on the other end" of gearman | 14:29 |
*** quiquell|rover is now known as quiquell|off | 14:34 | |
*** hashar has quit IRC | 15:03 | |
*** bhavikdbavishi has joined #zuul | 15:13 | |
*** pcaruana has quit IRC | 15:42 | |
*** jamesmcarthur has joined #zuul | 15:49 | |
*** jamesmcarthur_ has joined #zuul | 15:50 | |
*** jamesmcarthur has quit IRC | 15:50 | |
openstackgerrit | Matthieu Huin proposed openstack-infra/zuul master: [DNM] admin REST API: docker-compose PoC, frontend https://review.openstack.org/643536 | 15:55 |
*** bhavikdbavishi has quit IRC | 16:01 | |
*** chandankumar is now known as raukadah | 16:01 | |
*** quiquell|off has quit IRC | 16:14 | |
*** ianychoi has quit IRC | 16:20 | |
SpamapS | Shrews:you around? I have a question about NodeRequestHandler.decline_request ... it fails if all launchers have declined.. but.. don't we want to retry it again at some point? | 16:26 |
SpamapS | Like, if all the providers are just busy... we want that request to queue, right? | 16:26 |
SpamapS | But it seems like that will just lead to NODE_FAILURE | 16:27 |
clarkb | when they are busy they stop processing requests | 16:29 |
clarkb | so it shouldnt decline unless it actually failed up to the retry count or the provider does not have the label | 16:30 |
SpamapS | clarkb:oh? how do they know that? | 16:30 |
clarkb | SpamapS: the hasremainingquota check | 16:30 |
SpamapS | Mine hits hasProviderQuota first, and fails the request. | 16:30 |
SpamapS | So there's a window right as you get busy, where requests fail. | 16:31 |
clarkb | it is possible this is another openstack specific behavior that should be at a higher level | 16:31 |
SpamapS | Or it's just a rare thing and you don't see it that often? | 16:32 |
SpamapS | Like, to make it super careful, you'd need to call hasRemainingQuota before accepting requests. | 16:32 |
SpamapS | And ideally that would reserve resources, so you don't accept two and then one gets failed. | 16:33 |
*** zbr has quit IRC | 16:34 | |
clarkb | yes I think it gets the request and checks has remaining quota and if not block until resources are freed | 16:34 |
clarkb | that serialization prevents sending back inappropriate declines | 16:34 |
SpamapS | Well in the test suite what I have is one active request, a quota of 1, and when I send another request for 1, it's failed. | 16:39 |
SpamapS | I'll push up what I have now, and maybe you can spot the assumptions. | 16:39 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Implement max-servers for AWS driver https://review.openstack.org/649474 | 16:40 |
*** zbr has joined #zuul | 16:40 | |
SpamapS | clarkb:I would very much appreciate your eyes when you have some time. THanks ^^ | 16:40 |
clarkb | SpamapS: https://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/driver/__init__.py#n480 | 16:41 |
clarkb | that is the code that pauses when has remaining quota isn't true | 16:41 |
SpamapS | clarkb:yeah, in my test runs I never see that "Pausing" | 16:43 |
SpamapS | so there is at least a narrow window between accepting of the request that puts you at quota, and pausing, that results in a failed request. I wonder if I add a sleep if that will remain true | 16:46 |
SpamapS | In fact, the request gets declined, which is why we don't go to waitForNodeSet. | 16:49 |
SpamapS | Not sure how to get through to that next loop iteration that pauses things once we're at quota actually. Seems like everything will bounce off as declined. | 16:51 |
* Shrews catches up | 16:52 | |
SpamapS | Seems to me that you shouldn't ever decline an over quota request, you should just pause. | 16:52 |
*** jamesmcarthur has joined #zuul | 16:52 | |
Shrews | SpamapS: i suspect part of the problem may be that you are sharing resources across provider pools, and we don't really support that right now | 16:55 |
*** jamesmcarthur_ has quit IRC | 16:56 | |
Shrews | so your check for quota should never race with another pool thread trying to use that same quota pool | 16:56 |
Shrews | (if you follow that configuration rule, that is) | 16:56 |
Shrews | i realize that's probably not ideal with AWS in its current form | 16:56 |
SpamapS | This is a test | 16:56 |
SpamapS | 1 pool | 16:56 |
SpamapS | 1 provider | 16:56 |
SpamapS | Also no, the quota check I put in place scopes to pool. | 16:57 |
Shrews | hrm | 16:57 |
SpamapS | But in this case, the problem is the algorithm by which we pause. I believe it has a race condition in it, where you will decline a request if you are already exactly at quota. | 16:58 |
SpamapS | And that may not happen often in OpenStack because of the many quotas. | 16:58 |
SpamapS | Some min-ready comes along and unwedges things by slipping under the quotas. | 16:58 |
SpamapS | But with just max-servers quota.. it's always going to be the case that we walk right up to the quota. And then there's no path I can see through the code to self.paused = True | 16:59 |
SpamapS | Every subsequent request to that pool will be declined until the quota is released. | 16:59 |
SpamapS | The comment at the top of _waitForNodeSet I think calls out the race at the driver level, but it's actually deeper. | 17:01 |
*** gtema has quit IRC | 17:04 | |
SpamapS | I'm actually having trouble figuring out how the pause in _waitForNodeSet is ever reached. | 17:04 |
*** pcaruana has joined #zuul | 17:05 | |
Shrews | SpamapS: if you have to launch a node, but you are out of quota, then you reach that pause. | 17:05 |
SpamapS | Shrews:not in the test case here: https://review.openstack.org/649474 | 17:06 |
SpamapS | You can't even get to "need to launch a node" because you're already failing hasProviderQuota. | 17:06 |
SpamapS | Or maybe I misunderstood what hasProviderQuota is supposed to do. | 17:07 |
SpamapS | I wonder if OpenStack gets around this because of the estimatedNodepoolQuota .. it maybe passes hasProviderQuota in that next request instance. | 17:09 |
SpamapS | Yeah, the caching probably hides this. | 17:09 |
SpamapS | Keeping the window open just long enough to slip down to _waitForNodeSet. | 17:10 |
*** jpena is now known as jpena|off | 17:10 | |
Shrews | SpamapS: hasProviderQuota, iirc, is supposed to be "can this provider handle the nodes the request, regardless of what is available now". hasRemainingQuota is the answer to "what is available now" | 17:12 |
SpamapS | Shrews:ok, so hasProviderQuota should *not* take running machines in to account, but rather total capacity of said provider? | 17:12 |
SpamapS | That does make sense. | 17:13 |
Shrews | SpamapS: i *think* so.... been a while | 17:13 |
Shrews | it was originally for things like "request wants 50 nodes, but provider only has 40 total" | 17:13 |
SpamapS | yeah that would make sense | 17:14 |
SpamapS | If that's the case, I think we should make the comment in the abstract class more clear. I'll take that up. | 17:14 |
Shrews | SpamapS: but that morphed with tobiash's quota changes, so that's harder to see just from looking at it to try to remember :) | 17:14 |
SpamapS | Yeah, I think in reading the openstack one I missed that there's an `estimatedNodepoolQuota` and `estimatedNodepoolQuotaUsed` ... | 17:14 |
clarkb | Shrews: semi related to this is https://review.openstack.org/#/c/651676/ | 17:14 |
SpamapS | I just saw them as the same call. | 17:14 |
Shrews | SpamapS: "Checks if a provider has enough quota to handle a list of nodes. This does not take our currently existing nodes into account." | 17:15 |
SpamapS | yep that were it | 17:16 |
Shrews | SpamapS: i mean, that seems to say what i just said. what's unclear? | 17:16 |
SpamapS | http://paste.openstack.org/show/749204/ <-- this makes the test work the way I expected. | 17:17 |
SpamapS | Shrews:I think I missed the "does not" part. | 17:17 |
SpamapS | "psshhh... details." | 17:17 |
Shrews | SpamapS: happy to +2 any changes that make that more clearerer :) | 17:19 |
* Shrews wonders if <blink> works in rst.... | 17:19 | |
* SpamapS deserved that | 17:20 | |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Implement max-servers for AWS driver https://review.openstack.org/649474 | 17:21 |
SpamapS | durn pep8 | 17:23 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Implement max-servers for AWS driver https://review.openstack.org/649474 | 17:23 |
SpamapS | Shrews:^ ok, this I think actually implements correctly. :) | 17:23 |
SpamapS | whoops, spotted a bug | 17:24 |
openstackgerrit | Clint 'SpamapS' Byrum proposed openstack-infra/nodepool master: Implement max-servers for AWS driver https://review.openstack.org/649474 | 17:24 |
Shrews | clarkb: what's the impetus for that change? Not saying we shouldn't do it, just curious what brought that about | 17:30 |
clarkb | Shrews: if we had that then aws driver would've worked as is for SpamapS quota issues I think | 17:31 |
clarkb | Shrews: very few of our drivers implement hasRemainingQuota so they all ignore the common max-servers directive | 17:31 |
clarkb | this should enforce that directive accurately as long as the "cloud" doesn't leak instances | 17:32 |
*** bhavikdbavishi has joined #zuul | 17:32 | |
SpamapS | clarkb:+++ I like yours. :) | 17:33 |
*** jamesmcarthur has quit IRC | 17:35 | |
Shrews | clarkb: my ownly concern with that, atm, is the method used to count current nodes. that may not be accurate since it includes all node states. | 17:38 |
clarkb | Shrews: I believe you need to include all node states as errored/ready/in-use/and deleting but not yet deleted nodes all consume quota | 17:39 |
Shrews | clarkb: INIT and ABORTED do not | 17:40 |
clarkb | do nodes go aborted only prior to asking $cloud to make them? | 17:40 |
Shrews | it's the latter one that concerns me the most. i know it is transient, but don't remember how long that sticks around (or how many may result due to a single launch) | 17:40 |
clarkb | it should be easy enough to filter out by state if we can nail that down to states that never represent resources in a cloud | 17:42 |
clarkb | but I thought they all basically did. Maybe init doesn't | 17:42 |
Shrews | you'll need to at least filter those two states | 17:43 |
Shrews | and maybe FAILED | 17:44 |
clarkb | ya aborted is only set when we hit a quota error? so in theory that shouldn't count against your quota | 17:45 |
clarkb | failed does count against your quota in openstack | 17:45 |
clarkb | I don't know if it will in all clouds but conservative there seems fine? | 17:45 |
Shrews | failed doesn't always count, but being conservative seems logical | 17:46 |
Shrews | failing to launch a node after X retries will result in a FAILED state | 17:46 |
Shrews | but so can launching a node and losing the ZK connection before we could record it | 17:47 |
*** jamesmcarthur has joined #zuul | 17:47 | |
*** sshnaidm_ has joined #zuul | 17:52 | |
openstackgerrit | Clark Boylan proposed openstack-infra/nodepool master: Add simple max-server sanity check to base handler class https://review.openstack.org/651676 | 17:53 |
*** sshnaidm has quit IRC | 17:53 | |
clarkb | Shrews: ^ that look better? I added DELETED as well since that should mean the node is completely gone from the cloud and is now just accounted in our db | 17:53 |
manjeets | Hi guys has anything changed for upstream untrusted project ? | 17:54 |
*** jamesmcarthur has quit IRC | 17:54 | |
manjeets | we have a job pipeline that works fine on ci-sandbox | 17:54 |
manjeets | but same jobs are not getting triggered on actual project | 17:55 |
manjeets | https://review.openstack.org/#/c/647960/2 | 17:55 |
manjeets | look at comment Intel SriovTaas CI | 17:55 |
manjeets | but I run same job on ci-sanbox it gets triggered | 17:56 |
*** jamesmcarthur has joined #zuul | 17:56 | |
clarkb | manjeets: the ability to merge is repo specific | 17:56 |
manjeets | clarkb, we don't have merge in the pipeline actually | 17:56 |
manjeets | using zuul's docker-compose thing | 17:57 |
clarkb | zuul has to merge your commit against the target branch to test it | 17:57 |
clarkb | that is failing according to the error message | 17:57 |
clarkb | I would check your merger logs | 17:57 |
*** jamesmcarthur has quit IRC | 17:58 | |
manjeets | clarkb, merge where ? it should never merge to repo anyway > | 17:58 |
*** jamesmcarthur has joined #zuul | 17:59 | |
clarkb | manjeets: one of the fundamental design choices of zuul is that it intends to test what your code would look like if it merged to the actual repo. This means before testing a change zuul merges it locally against the target branch and tests the resulting commit. All of this is in zuul managed git repos. If tests pass then later zuul can ask gerrit to merge them to to canonical repo | 17:59 |
clarkb | this way developers don't have to manually rebase everytime a new commit merges to get accurate test results. Zuul does that for you and you know when you ask zuul to merge the code that it should work to a high degree of confidence | 18:00 |
clarkb | the error message on that change indicates zuul failed to do this local merge. I would check the zuul merger process's logs | 18:00 |
Shrews | clarkb: yah. good call on DELETED too | 18:01 |
*** jamesmcarthur has quit IRC | 18:01 | |
*** jamesmcarthur has joined #zuul | 18:02 | |
*** sshnaidm_ is now known as sshnaidm|off | 18:03 | |
openstackgerrit | Fabien Boucher proposed openstack-infra/zuul master: WIP - Pagure driver - https://pagure.io/pagure/ https://review.openstack.org/604404 | 18:03 |
*** electrofelix has quit IRC | 18:19 | |
*** jamesmcarthur has quit IRC | 18:26 | |
*** jamesmcarthur has joined #zuul | 18:28 | |
pabelanger | tobiash: I cannot remember, but were maybe you discussing the ability to support all forms of merge that github supports? (eg: merge / squash / rebase) | 18:29 |
tobiash | pabelanger: yes, I was part of the discussion | 18:31 |
pabelanger | tobiash: do you happen to know when that was again, so I can find the irclogs? | 18:31 |
pabelanger | I'd like to refesh myself on that topic | 18:32 |
manjeets | clarkb, I get it do you mean it merges the patchset to repo cloned locally for testing ? | 18:32 |
clarkb | manjeets: yes | 18:32 |
tobiash | pabelanger: hrm, no idea, could be months | 18:33 |
pabelanger | kk | 18:33 |
*** hashar has joined #zuul | 18:57 | |
SpamapS | do we not run any coverage reports for nodepool tests? | 18:59 |
clarkb | hrm we did before the zuulv3 rewrite | 19:01 |
clarkb | I don't see it now | 19:01 |
Shrews | removed that long ago | 19:01 |
Shrews | https://review.openstack.org/#/c/608688/ | 19:02 |
*** pcaruana has quit IRC | 19:02 | |
clarkb | hrm fwiw I found it really useful when improving tests and debugging races | 19:05 |
clarkb | (I added it and the functioanl jobs way back when to tackle nodepools frequent breakages) | 19:05 |
Shrews | you should still be able to run it on demand | 19:05 |
clarkb | ya or run it locally. I think we ended up stabilizing to the point where it wasn't as useful as often so not really objecting. Just pointing out that it can be valuable | 19:06 |
mordred | Shrews: zuul-preview seems to be running super slow | 19:06 |
clarkb | the functional tests go a long way for asnity checking | 19:06 |
mordred | Shrews: if you check out https://review.openstack.org/#/c/651219/ and click on inaugust-build-website - it'll just sit there spinning | 19:07 |
mordred | I started looking in to it - but haven't gotten very far | 19:07 |
mordred | obviously it's not _urgent_ as it's not really a fully production thing yet - but thought I'd mention it | 19:07 |
Shrews | mordred: we didn't merge my proposed changes yet, did we? | 19:08 |
mordred | Shrews: no - I donm't think so | 19:09 |
Shrews | not that it would improve anything, but wondering if i broke something | 19:09 |
Shrews | no, we didn't | 19:09 |
mordred | Shrews: although I did +2 them | 19:09 |
mordred | I think we're getting crawled | 19:11 |
mordred | docker logs --since 30m -f 3310c96209ef on the host shows a bunch of activity - none of it useful | 19:12 |
Shrews | mordred: cache overload? | 19:12 |
mordred | maybe? although I think this: | 19:12 |
mordred | [Thu Apr 11 19:09:20.617020 2019] [proxy_http:error] [pid 2916:tid 139831896164096] (70007)The timeout specified has expired: [client 174.143.130.226:55828] AH01102: error reading status line from remote server 174.143.130.226:80 | 19:13 |
mordred | oh - nevermind - I thought it was timing out remotely | 19:13 |
Shrews | mordred: i don't even know where to access zuul-preview. it's a container then? what host? | 19:13 |
mordred | yeah - zuul-proxy.opendev.org | 19:14 |
Shrews | that does not resolve for me | 19:15 |
*** bhavikdbavishi has quit IRC | 19:20 | |
Shrews | mordred: zp makes requests to http://zuul.openstack.org/api/tenant which is not doing anything for me except displaying "Fetching info" | 19:21 |
Shrews | so maybe the issue is in zuul-web | 19:22 |
Shrews | something is hammering zuul-web for change 651910 | 19:27 |
Shrews | mordred: is it normal to see so many of the same GET requests for a change? That doesn't seem right | 19:31 |
Shrews | e.g., GET /api/tenant/openstack/status/change/651910,1 | 19:32 |
Shrews | happening multiple times per second | 19:33 |
pabelanger | that is a tripleoci patch | 19:33 |
Shrews | so is 651912 which pops up a lot too, but not as much as the other one | 19:34 |
pabelanger | I wonder if that is somebody running a zuul-runner like too | 19:34 |
pabelanger | I know tripleo is trying to do things like scrape API to run zuul jobs locally | 19:35 |
pabelanger | rlandy|ruck: ^by chance, are you aware of any tooling that would scrap zuul.o.o api? | 19:35 |
Shrews | pabelanger: so maybe something of theirs is behaving badly | 19:35 |
pabelanger | Shrews: do you have an IP where the traffic is coming from? | 19:37 |
Shrews | no, only 127.0.01 (guessing the real ip isn't known?). the client signature is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0" | 19:38 |
Shrews | so maybe not an automated tool | 19:38 |
Shrews | i'm very curious why we don't have the incoming IP now. some cherrypy-ism? | 19:40 |
webknjaz | ? | 19:41 |
*** weshay has joined #zuul | 19:41 | |
webknjaz | if it's behind a reverse proxy you should use a proxy tool | 19:41 |
rlandy|ruck | pabelanger: we looked into but we never went that way | 19:41 |
rlandy|ruck | too complicated error prone | 19:42 |
pabelanger | Oh | 19:42 |
pabelanger | there is a triplo CI tool | 19:42 |
pabelanger | that generates reports in granafa | 19:42 |
pabelanger | for that, I think they scrape api from zuul-web | 19:42 |
pabelanger | rlandy|ruck: do you happen to know where that is done? | 19:43 |
clarkb | yes it is behind apache so need to check the apache logs | 19:44 |
rlandy|ruck | zbr: ^^ maybe asking about your work | 19:45 |
rlandy|ruck | this was not reproducer-related | 19:45 |
Shrews | clarkb: indeed. thx | 19:46 |
rlandy|ruck | pabelanger: or this: http://dashboard-ci.tripleo.org/d/cEEjGFFmz/cockpit?orgId=1? | 19:46 |
pabelanger | rlandy|ruck: yes, that is it thankyou | 19:47 |
pabelanger | talking in #tripleo now | 19:47 |
Shrews | source IP is 2a02:8010:61a9:33::1dd7 | 19:47 |
rlandy|ruck | pabelanger: k - sorry - thought you were after reproducer work | 19:47 |
pabelanger | Shrews: that isn't rdocloud, no ipv6, so that is good | 19:49 |
Shrews | we've been getting hit hard ever since that change was submitted, so maybe find the author? | 19:50 |
pabelanger | Shrews: yah, talking to him now | 19:51 |
clarkb | whois says broadband service in the uk | 19:51 |
pabelanger | tripleo is digging into it now | 19:54 |
pabelanger | Shrews: clarkb: just looking at apache logs, there might be a few tripleo script scraping the API, 38.145.34.111 is another and that is in rdo-cloud | 20:04 |
clarkb | any idea what tripleo is trying to learn? | 20:04 |
clarkb | (I wonder if this indicates some new api endpoints that we might want to add) | 20:05 |
pabelanger | clarkb: most monitoring health of their jobs | 20:05 |
pabelanger | eg: http://dashboard-ci.tripleo.org/d/cEEjGFFmz/cockpit?orgId=1 is something they built | 20:05 |
Shrews | most likely indicates the need for rate limiting :) | 20:05 |
pabelanger | but weshay best to ask | 20:05 |
clarkb | hrm we have that data in graphite already? | 20:06 |
pabelanger | Shrews: ya, good idea | 20:06 |
pabelanger | IIRC, zbr also | 20:06 |
clarkb | eg we shouldn't need to scrape zuul's api and can pull that data from what should be cheaper quicker sources like graphite | 20:06 |
*** jamesmcarthur has quit IRC | 20:10 | |
weshay | clarkb ya.. monitoring all the things to keep tripleo upstream healthy | 20:11 |
clarkb | weshay: acn we stop doing this https://review.openstack.org/#/c/567224/ ? | 20:11 |
clarkb | and set up a periodic pipeline instead? | 20:11 |
weshay | yup! it's on the to-do list | 20:12 |
mordred | Shrews: wow - I hop on a call for a bit and miss all the fun | 20:14 |
Shrews | mordred: how "convenient" :-P | 20:15 |
clarkb | weshay: thanks | 20:16 |
*** jamesmcarthur has joined #zuul | 22:01 | |
*** jamesmcarthur has quit IRC | 22:14 | |
*** hashar has quit IRC | 22:29 | |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul-jobs master: Add tox-py37 job https://review.openstack.org/651938 | 22:30 |
pabelanger | clarkb: fungi: mordred: tobiash: ianw: noticed we did't have tox-py37 job^ | 22:30 |
pabelanger | doh | 22:30 |
pabelanger | see bug | 22:30 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul-jobs master: Add tox-py37 job https://review.openstack.org/651938 | 22:31 |
pabelanger | there | 22:31 |
pabelanger | also, spacex launch in 3mins | 22:32 |
fungi | which pad this time? | 22:33 |
clarkb | https://www.youtube.com/watch?v=TXMGu2d8c8g is the official youtube stream | 22:33 |
fungi | ahh, from kennedy | 22:34 |
fungi | wish they'd schedule more for wallops now that they've gotten the pad back together and cleaned up after the explosion | 22:35 |
clarkb | I think they need the saturn 5 pad for this particular configuration | 22:36 |
fungi | the announcer is a little enthusiastic | 22:36 |
clarkb | or maybe its the old shuttle pad? basically its huge so few options | 22:36 |
pabelanger | fungi: Yah, he is awesome | 22:36 |
fungi | rocket ballet | 22:41 |
clarkb | the size of large buildings | 22:42 |
clarkb | I love the kerbal style graphics | 22:43 |
pabelanger | Yay | 22:45 |
pabelanger | 3 for 3 | 22:45 |
fungi | payload orbital insertion and relanding all first stage boosters in 10 minutes flat | 22:46 |
clarkb | one better than last time | 22:46 |
pabelanger | indeed, so cool | 22:46 |
mordred | ++ | 22:46 |
clarkb | pabelanger: linting job doesn't like the py37 job addition (I haven't looked at why yet) | 22:46 |
pabelanger | yah, looking now | 22:46 |
fungi | my inner moon landing conspiracy theorist says the life feed cutting out for the third booster touchdown was strategic ;) | 22:47 |
fungi | er, live feed | 22:47 |
mordred | pabelanger: do we want to add it to zuul so that we run 37 tests too? | 22:47 |
clarkb | to followup on the docker ipv6 stuff I haven't heard anything back on either the issue or the PR yet | 22:47 |
clarkb | fungi: ha | 22:47 |
fungi | "no really, we landed it, i swear" | 22:47 |
mordred | fungi: yeah. I'm torn on whether to go conspiracy or 'streaming video sucks' | 22:47 |
fungi | i'm pretty sure streaming video sucks | 22:47 |
mordred | yeah | 22:48 |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul-jobs master: Add tox-py37 job https://review.openstack.org/651938 | 22:48 |
pabelanger | think that is our fix | 22:48 |
mordred | and I can imagine that broadcasting video WHILE a rocket lands on top of you might be even more suck | 22:48 |
fungi | on an unmanned drone, even | 22:48 |
mordred | pabelanger: oh - ignore me - my brain was somehow thinking you were adding this to the zuul repo :) | 22:48 |
pabelanger | mordred: can we pull python37 from bionic? I've been using fedora-29 myself | 22:48 |
mordred | pabelanger: dunno. I install python with pyenv myself | 22:49 |
pabelanger | but, we could if people want | 22:49 |
* mordred is such a terrible distro-company employee | 22:49 | |
clarkb | fungi: you can probably look outside with binoculars to check right? | 22:49 |
SpamapS | bionic has 3.7.1 and may not see many updates since it's in universe. | 22:49 |
clarkb | pabelanger: check out the work coreyb has done for openstack py37 testing | 22:50 |
clarkb | but ya its basically install it from universe and go | 22:50 |
pabelanger | clarkb: cool | 22:50 |
clarkb | SpamapS: I get some sense that canonical/coreyb are interested in it. TO be see how up to date it gets though | 22:51 |
fungi | SpamapS: though to be fair, it released with something like 3.7.0 beta 2 | 22:51 |
fungi | so they have at least updated it once already | 22:51 |
clarkb | pabelanger: the job should also work on fedora and tumbleweed | 22:54 |
clarkb | but tumbleweed will eventually stop having a python3.7 (when it gets 3.8) and fedora will eol in a few months | 22:54 |
pabelanger | clarkb: Yup, was mostly curious is we wanted to add another distro into the mix stick with debuntu | 22:55 |
*** rlandy|ruck is now known as rlandy|ruck|bbl | 23:04 | |
openstackgerrit | Paul Belanger proposed openstack-infra/zuul-jobs master: Don't run zuul_debug_info_enabled under python2.6 https://review.openstack.org/650880 | 23:12 |
pabelanger | dmsimard: ^update | 23:12 |
pabelanger | clarkb: mordred: ^might have thoughts too | 23:12 |
openstackgerrit | Merged openstack-infra/zuul-jobs master: Add tox-py37 job https://review.openstack.org/651938 | 23:13 |
SpamapS | if people are interested in it, 3.7 will stay up to date. | 23:13 |
clarkb | pabelanger: while ansible itself may support python2.6 Im not sure zuul can (no testing) | 23:15 |
*** paladox has quit IRC | 23:16 | |
*** paladox has joined #zuul | 23:16 | |
clarkb | if we do that check we should log a warning when we skip the info | 23:16 |
pabelanger | clarkb: yah, agree. This is mostly on the remote side of the node from nodepool, I have network images that are still python26 :( | 23:16 |
clarkb | so that it doesnt silently disappear | 23:16 |
pabelanger | that was part of my reason for adding a switch, to avoid hiding the magic | 23:17 |
*** rlandy|ruck|bbl has quit IRC | 23:21 | |
*** tobiash has quit IRC | 23:21 | |
*** corvus has quit IRC | 23:21 | |
*** jlvillal has quit IRC | 23:21 | |
*** mgoddard has quit IRC | 23:24 | |
*** mgoddard has joined #zuul | 23:27 | |
*** rlandy|ruck|bbl has joined #zuul | 23:27 | |
*** tobiash has joined #zuul | 23:27 | |
*** corvus has joined #zuul | 23:27 | |
*** jlvillal has joined #zuul | 23:27 | |
fungi | alternative is to build a python 3.something in an alternate path in your images and tell ansible to use that for ansible things | 23:50 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!