16:00:05 <gibi> #startmeeting nova 16:00:05 <openstack> Meeting started Thu May 7 16:00:05 2020 UTC and is due to finish in 60 minutes. The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:09 <openstack> The meeting name has been set to 'nova' 16:00:14 <gibi> o/ 16:00:38 <artom> ~o~ 16:00:42 <bauzas> \o 16:00:50 <gmann> o/ 16:00:51 <dansmith> . 16:01:23 <melwitt> o/ 16:01:35 <gibi> #topic Last meeting 16:01:41 <gibi> #link Minutes from last meeting: http://eavesdrop.openstack.org/meetings/nova/2020/nova.2020-04-30-16.00.log.html 16:01:52 <gibi> is there anything to bring back from the last meeting? 16:02:00 <dansmith> I keep seeing that topic and getting falsely excited that *this* is the last of these meetings :) 16:02:24 <gibi> :) no it is not 16:02:37 <gibi> #topic Bugs (stuck/critical) 16:02:42 <gibi> No Critical bugs 16:02:49 <gibi> #link 31 new untriaged bugs (-7 since the last meeting): https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 16:03:08 <bauzas> thanks gibi 16:03:08 <gibi> we are still on a downward trend but slowing down 16:03:16 <bauzas> I will help next weerk 16:03:26 <gibi> I want to reach 0 in the next couple of weeks if possible 16:03:33 <bauzas> we have a PTG discussion for this 16:03:35 <gibi> bauzas: thanks 16:04:07 <gibi> I'm not tracking any RC critical bug at the moment 16:04:13 <gibi> #link https://bugs.launchpad.net/nova/+bugs?field.tag=ussuri-rc-potential 16:04:28 <gibi> anything bug we need to discuss today? 16:05:21 <gibi> #topic Release Planning 16:05:31 <gibi> We cut RC2 this week to include the fix https://review.opendev.org/#/q/topic:bug/1875418+(status:open+OR+status:merged) 16:05:59 <gibi> I don't see anyithing that is blocking a GA now so I assume RC2 will be the GA code 16:06:21 <gibi> please raise any issue with the ussuri release basically now as the RC deadline is today 16:06:53 <gibi> anything else to discuss about the release? 16:06:54 <bauzas> #link https://releases.openstack.org/ussuri/schedule.html 16:07:07 <bauzas> GA is next week 16:07:11 <gibi> yepp 16:07:21 <bauzas> so unless we have a very large regression, I think we can hold 16:07:29 <gibi> and next week there will be a community call to present Ussuri for the world 16:07:53 <gibi> I will talk 5 minutes about what we did in the last cycle, like a really mini project update 16:09:11 <gibi> #link http://lists.openstack.org/pipermail/openstack-discuss/2020-May/014676.html 16:09:21 <gibi> this is the details of the community call ^^ 16:09:35 <gibi> #topic Stable Branches 16:09:50 <gibi> I did not see any major event on the stable branch 16:10:00 <gibi> lyarwood: if you are around, do you have any news? 16:11:47 <gibi> I guess he is not around 16:12:01 <gibi> #topic Sub/related team Highlights 16:12:06 <gibi> API (gmann) 16:12:24 <gmann> i have not checked the APi related spec for V cycle yet 16:12:32 <gmann> one things going on is healthcheck #link https://review.opendev.org/#/c/724684/ 16:12:59 <gibi> gmann: do we need a bp for that? 16:13:00 <gmann> i have added this to discuss in PTG also, discussion going in review too. 16:13:39 <gmann> i asked for spec to have a complete things we can do now and later at least we know we want to do later so that we can design this not breaking when we add other things later 16:13:55 <gmann> like unauth, enable/disable options ect 16:13:57 <gmann> etc 16:14:02 <gibi> spec is even better especially if there are multiple steps 16:15:14 <gmann> yeah. we can ship it a minimum things for now and i am checking if adding things is possible as config option or not 16:16:04 <gmann> main concern is when we add new things, we can add it in compatible way. like on-demand deeper checks 16:16:04 <artom> "we can ship it a minimum things for now" + 1 to that 16:16:45 <artom> Are we discussing this in detail now? One idea I had was make it authenticatable from the start, but for now just return the basic 200 OK for everything, authenticated or not 16:16:59 <gmann> but i have not checked with poc yet is that work with oslo.middleware or we need to add extra filter for that. 16:17:00 <artom> And then we can spec out the "deep status" healthcheck 16:17:33 <melwitt> yeah I wanted to ask gmann if starting out unauth'ed and then upgrading to auth later, would that pose an issue from the API perspective? 16:17:53 <artom> And zigo makes a good point in the review that it needs to be fast, because haproxy will be hitting it every second 16:18:02 <dansmith> presumably this isn't going to be versioned as strictly as the rest of the API right? 16:18:13 <gmann> melwitt: it will as many load balancer use without auth and if they need token then it will break them 16:18:16 <artom> So it's probably a bad idea to try authentication on every request 16:18:34 <artom> There should be a "were authentication headers sent? No --> quick 200 OK" mechanism 16:18:40 <gibi> artom: nothing heavy on the agenda so I think it is OK to have a sneak-peak of the feature to draw attention 16:19:02 <bnemec> I don't think things like haproxy are going to be able to auth, so if we add auth we still need to have a basic healthcheck that is unauth'd. 16:19:18 <dansmith> we could pretty easily build the healthcheck data from authenticated requests 16:19:55 <gmann> true 16:19:55 <dansmith> unauth'd healthchecks include very coarse information, which may be up to date if there are auth'd requests keeping it fresh, and if not, it's no worse than a basic check 16:20:04 <zigo> Not even *one* haproxy hitting it every second, but in most case, 3, so 3 queries per second, constantly. 16:20:13 <artom> bnemec, almost like we need different URLs, one for load balancers, one for humans or other more advanced monitoring solutions 16:20:36 <gmann> zigo: yeah, default of helthcheck can be a fast responding things. 16:20:47 <dansmith> zigo: ack, yeah and if we have three cells, that's five databases per check, three mqs per check, which is a good reason to build that information in a cache and just return it from healthchecks 16:20:48 <gmann> anyways all these things to discuss so spec can be better 16:20:49 <bnemec> artom: That would probably be the simplest. 16:21:33 <gibi> feels like we have plenty of things for the spec. lets continue there 16:21:45 <gibi> gmann: any other API releated thing you want to mention? 16:22:01 <gmann> that's all for today from me 16:22:04 <gibi> cool, thanks 16:22:06 <gibi> Libvirt (bauzas) 16:22:46 <zigo> Do everyone agree that the current healthcheck can still be approved, in the mean while? 16:23:18 <zigo> *does 16:23:39 <gibi> zigo: we need to know that our future plans with the healthcheck as an extension of the current simple API 16:23:49 <gibi> are viable 16:24:17 <gmann> zigo: yeah so that we do not need to change the current proposed. healthcheck usage 16:24:36 <gmann> i mean discuss in spec first and then do current proposed one 16:24:46 <dansmith> definitely discuss in spec first 16:25:26 <bnemec> For reference, there was a previous healthcheck spec with a bunch of discussion: https://review.opendev.org/#/c/531456 16:25:54 <dansmith> yeah, I remember, 16:25:54 <gmann> bnemec: thanks that will be good ref to check too 16:26:12 <dansmith> plenty of fodder there for needing a wider discssion 16:26:59 <zigo> FWIW: the same type of patch has already been approved for Neutron, Heat and Cinder, so it's kind of weird that we aren't getting things cross-project this way. 16:27:00 <bnemec> Oh, this also has a great list of previous discussions: https://storyboard.openstack.org/#!/story/2001439 16:27:28 <dansmith> zigo: omg, I'm convinced.. best argument ever 16:27:43 <zigo> :) 16:27:46 <dansmith> :P 16:27:52 <bauzas> gibi: sorry was off 16:28:03 <gibi> bauzas: no worries I call you again 16:28:04 <bauzas> nothing to say, but aarents asked for some changes 16:28:11 <bauzas> https://etherpad.opendev.org/p/nova-libvirt-subteam 16:28:16 <bauzas> will try to review them soon 16:28:33 <gibi> bauzas: cool thanks 16:28:37 <bauzas> that's it 16:28:48 <bauzas> kashyap also has a point about q35 but he's not around 16:29:13 <gibi> lets quickly finish the agenda and then we can get back to the healtcheck discussion in the Open 16:29:21 <gibi> #topic Stuck Reviews 16:29:40 <gibi> nothing on the agenda. Does anybody have a stuck review to bring up? 16:30:53 <gibi> #topic Virtual PTG planning 16:31:00 <gibi> Current nova schedule is on the top of the etherpad #link https://etherpad.opendev.org/p/nova-victoria-ptg 16:31:09 <gibi> Cyborg also wants to talk with us about SmartNic and that discussion is now scheduled for June 5 Friday 14:00 UTC - 15:00 UTC 16:31:34 <gibi> anything else about the virtual PTG ? 16:32:10 <gmann> do we want to move healthcheck topic with oslo as cross project? 16:32:29 <gmann> i added at L 179 for now 16:33:30 <artom> Not sure it's olso crossproject... It's already merged in other projects (ex: https://review.opendev.org/#/c/724676/), so if we want cross-project uniformity (which I think is important), our hands are kinda tied in that sense 16:33:33 <gibi> gmann: If you feel bnemec or other folks from oslo would be good to join to that discussion then lets try to have some dedicated time for an oslo-nova cross session 16:34:00 <gibi> bundled with the policy discussion 16:34:01 <artom> Like, making it authenticatable and future-proof are important, but it'd be bad form to go off and do our own thing entirely. 16:34:27 <gmann> ok 16:34:50 <bnemec> I think it's important to keep in mind that there are two things here: enabling the existing simple healthcheck, and designing the next-gen fancy healthcheck 16:34:57 <bnemec> The latter should not block the former IMHO. 16:35:51 <gibi> #topic Open discussion 16:36:07 <artom> bnemec, agreed. I guess the point is, if we want to have the same on the same URL (which is debatable in my mind), we need to build in things the latter might need from the start 16:36:08 <gibi> we can continue the healthcheck discussion now in the Open 16:36:20 <artom> *have them both on the same URL 16:36:20 <gibi> (as nothing else on the agend for Open) 16:36:56 <dansmith> artom: yeah, that's the thing I'd want to know 16:37:13 <dansmith> I don't want to have /healthcheck, /useful_healthcheck, /no_serously_this_one, etc 16:37:34 <bnemec> If having them both on the same URL blocks having any healthcheck for the next two years then I think that's a bad approach. 16:37:36 <artom> dansmith, well, yeah, but realistically how many are we going to have? 16:37:53 <bnemec> I note that https://storyboard.openstack.org/#!/story/2001439 mentioned possibly different behavior for GET vs HEAD. 16:37:58 <artom> dansmith, one simple, unauthenticated, unversioned, one "fancy", authenticated, versioned 16:38:00 <dansmith> we've already identified several levels.. 16:38:12 <bnemec> I have no idea if that's an API no-no though. 16:38:29 <bauzas> honestly, I co-contributed to this change, but I'm not opiniated a single bit. 16:38:33 <gmann> i think it should be doable with same url with extra 'backends' to check for oslo? but need to try 16:38:37 <dansmith> artom: honestly, what does the simple unauth'd one tell you? that apache and mod_wsgi is working right? 16:38:59 <dansmith> artom: is there any difference between hitting that check vs just the version manifest? 16:39:07 <artom> dansmith, there isn't 16:39:07 <gmann> extra configured 'backends' 16:39:22 <artom> dansmith, the argument from operators is having every project have a common URL for that 16:39:37 <artom> And not nova with /versions, neutron with /healthcheck, cinder with /status or whatever 16:40:02 <artom> (I made up the last one) 16:40:09 <bauzas> honestly, if we have different URLs between services, we don't need the healthcheck one 16:40:23 <dansmith> can't you hit the / on everyone's api and get the same result? 16:40:38 <artom> dansmith, I dunno, can you? 16:40:40 <gmann> not all service has / (versions) url ? 16:40:46 <artom> zigo ^^ ? 16:40:54 * zigo reads the backlog 16:40:56 <dansmith> I don't really know what the oslo base bit gives us... I thought we could provide a function to generate the report or something. is that the case or not? 16:41:36 <dansmith> gmann: don't they all redirect to something like the version doc? anyway, I'm not really suggesting that as an alternative, I'm just saying a "hello world" seems pointless to me 16:41:51 <bauzas> dansmith: the only thing that would be nice for ops is that they can disable the healthcheck on their wishes 16:42:05 <dansmith> bauzas: sorry, what? 16:42:15 <zigo> dansmith: You wont get the same result, no, you get a "300 multiple choice", that's not what operators need. 16:42:21 <zigo> We need a "200 ok" ... 16:42:25 <bauzas> dansmith: the healthech API can return 'sorry, 503' if a file is provided 16:42:32 <gmann> dansmith: we can implement extra plugins (than default one of file existence check) to generate the report and add in olso to check all plugins added for healthcheck app 16:43:07 <dansmith> okay I don't understand either of those fully 16:43:07 <bauzas> that's the only single bit that can help HAProxy more than just checking a port 16:43:24 <gmann> dansmith: current default plugins are file checks. 16:43:28 <bauzas> but honestly, as a support engineer years ago, I wasn't trusting healthchecks 16:43:33 <zigo> And the idea behind the file is so one can turn off the API in a nice way: tell Haproxy, I'm going to turn off the API... then really do it. 16:43:35 <gmann> and yes, port with file 16:43:41 <dansmith> bauzas: right because they tell you nothing?:) 16:43:43 <bauzas> I preferred homemade checks based on logics 16:43:52 <bauzas> for my haproxy backends 16:44:24 <dansmith> if the goal is really to have a completely pointless not-really-health-related common url across all projects then whatever 16:44:51 <zigo> dansmith: The point is having something to query for haproxy, nothing more, nothing less. 16:45:18 <gmann> exactly, it should be 'yes healthy' means your request should be success (as per general checks we did for minimum required things) 16:45:24 <zigo> If we're capable of providing more than that, great, but this shouldn't wait for spec, design, doc, test, implementation, etc. 16:46:17 <zigo> My original patch barely activated a feature we already have... 16:46:18 <gmann> zigo: and if providing more leads to change the existing used one then also fine ? 16:46:32 <zigo> Yeah, great too ! :) 16:46:38 <bnemec> Also worth noting that the /healthcheck endpoint is already enabled for some services, so even if we decide to completely redesign it we can't ignore the existing one. 16:46:40 <zigo> If it becomes more reliable, that's bonus points. 16:47:09 <dansmith> gmann: for zigo's use case, but people will write nagios plugins and other monitoring infra against this of course, so while zigo and others only care about the "200 OK" the devil is in the details, like it always is 16:47:53 <zigo> dansmith: Operators do know that this is not enough for monitoring. 16:48:03 <zigo> I could send you my scripts if you like! :) 16:48:09 <dansmith> super unfortunate that we called it /healthcheck don't you think? 16:48:16 <artom> dansmith, which why documenting what this actually is and its limitations is important, but I don't see that as a reason to not do it. It's an unobtrusive chance. 16:48:18 <artom> *change 16:48:40 <dansmith> artom: of course, the code isn't the obtrusive part :) 16:48:50 <zigo> We can call it "/my-http-api-server-is-alive-and-haproxy-can-query-it" but that's a bit long to type ... 16:48:58 <artom> What is? The time we're spending debating this? ;) 16:49:46 <artom> dansmith, plus, it means you'll get to write another massively influential blog about about /healthcheck vs /ping vs /status, like your evacuate one ;) 16:50:17 <zigo> dansmith: For the monitoring, what we do with nova-api is actually querying https://${HOSTNAME}:8774/v2.1/servers and see if the monitoring instance is in the list for that project. 16:50:32 <zigo> That's much better than just checking /healthcheck of course. 16:50:35 <dansmith> I think what artom is saying is that any change that has few lines of code isn't worth discussing regardless of the actual impact 16:50:49 <gibi> my opinion consistency across sevices are good so I'm +1 on /healthcheck as of today returning a plain 200 OK. But have a agreement in a spec that if we want to extend that 200 OK with more information then how we extend the /healthcheck API. I'm now OK to have the unauthed vs authed switch between simple 200 OK and complex healthcheck result 16:50:51 <artom> dansmith, that's completely false and you know :P 16:50:52 <artom> *know it 16:51:35 <artom> This is *adding* and *independant* thing that operators can use or not, at their leisure 16:51:40 <artom> *an *independant* 16:51:56 <bnemec> I guess I don't understand the huge drawback of having people write monitoring checks against a /healtcheck designed for such a thing versus them writing hacky checks against / that doesn't behave the way they want. 16:52:05 <artom> Though I'll grant that the concern about evolving it is a valid one 16:54:19 <melwitt> if it's called /healthcheck, operators are going to expect it to check health to some extent. and not just be a liveness check (like checking for an open port or something) 16:54:49 <melwitt> so if that was not the intention, I agree the name choice is unfortunatel 16:54:56 <melwitt> -l 16:54:57 <zigo> melwitt: In simple words: *no* ! :) 16:55:06 <gmann> yeah and that is what i thought it was when i first saw. i was not aware of previous oslo spec disucssion. 16:55:30 <gmann> or until i saw the olso code 16:55:32 <melwitt> zigo: what are you saying "no" about? 16:55:42 <zigo> As an operator, we do all sorts of things to check if everything is up, not just checking /healthcheck. If that is your concern, then we can further document that this is not (yet?) what it is for. 16:56:24 <artom> melwitt, so put a .. warning:: in the documentation saying this is just making sure that the HTTP service is operational 16:56:43 <zigo> artom: Right. 16:56:44 <zigo> :) 16:56:52 <gibi> 3 mintes left, lets try to warp it up here but continue it on #openstack-nova and/or in a spec 16:56:54 <melwitt> right, and so what is /healthcheck giving you beyond other checks like whether something is listening on port 8774 or that nova-api responds to http request? 16:57:23 <melwitt> well, anyway, I think my point is clear. we can wrap it 16:57:40 <artom> melwitt, the '200 OK' status - / (or /versions?) is "300 multiple choice" 16:57:41 <zigo> melwitt: If you don't give haproxy some URL to query, it's going to connect to the port, then disconnect, which is very ugly. 16:58:08 <zigo> So we got to give it an URL, and that URL must reply "200 ok". 16:58:14 <zigo> That's what the /healthcheck is for ... 16:58:30 <melwitt> I understand that, just saying if this is not a healthcheck a different name would have been more appropriate 16:58:42 <melwitt> this is implying that health is being checked, obviously 16:59:05 <artom> melwitt, fair point 16:59:08 <gmann> how about /healthcheck -> all deeper checks and /healthcheck?https-only-check -> minumum check as proposed 16:59:09 <zigo> "Thu May 7 16:58:59 2020 - SIGPIPE: writing to a closed pipe/socket/fd (probably the client disconnected) !!!" 16:59:23 <zigo> That's what I get constantly in my logs if I don't activate healtcheck stuff. 16:59:53 <artom> melwitt, but in the interest of cross-project uniformity, and because we can't go back in time and other projects have merged this (for better or worse), our hands are kinda tied 17:00:09 <gmann> and default is former one, do all deeper checks as this endpoint name suggest 17:00:16 <zigo> (in my case, that's when using uwsgi) 17:00:17 <gibi> OK. thank you folks. continue it on #openstack-nova 17:00:25 <gibi> #endmeeting