04:01:47 <samP> #startmeeting masakari 04:01:47 <openstack> Meeting started Tue Apr 21 04:01:47 2020 UTC and is due to finish in 60 minutes. The chair is samP. Information about MeetBot at http://wiki.debian.org/MeetBot. 04:01:48 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 04:01:51 <openstack> The meeting name has been set to 'masakari' 04:02:00 <samP> Hi all 04:02:07 <samP> Let's start 04:02:36 <samP> #topic Critical patches and bugs 04:02:37 <suzhengwei> hi 04:02:47 <samP> suzhengwei: hi 04:03:26 <tpatil> LP bug : 1873736 04:03:28 <samP> Any patches need to merge before we tag the RC for U 04:03:37 <tpatil> Patch : https://review.opendev.org/#/c/721174/1 04:04:43 <tpatil> Our FT doesn't use tls_proxy=True 04:05:00 <tpatil> Zuul job : masakari-functional-devstack-multinode 04:05:38 <suzhengwei> https://review.opendev.org/#/c/702328/ 04:06:04 <suzhengwei> I will add unit test today, would you like to give a review? 04:07:04 <tpatil> suzhengwei: I will review this patch 04:07:18 <suzhengwei> thank you 04:07:58 <samP> tpatil: suzhengwei: thanks 04:08:26 <suzhengwei> https://review.opendev.org/#/c/720623/ 04:08:46 <samP> tpatil: about the TLS issue, should we use tls_proxy? 04:09:17 <samP> not understand the issue correctly 04:09:26 <suzhengwei> samp: you have give some comments. If you have time, we can talk about it later 04:09:28 <tpatil> I think we should set to True, but still we don't have any FT that will evacuate vm 04:10:46 <tpatil> If we set tls_proxy=True, then I expect that FT written for vm notification should fail. 04:11:14 <tpatil> The changes made in the patch are simple but I will need to verify it first before merging this patch 04:12:28 <suzhengwei> yes, we can talk about it with nova project developers. 04:13:57 <samP> In this patch, it pass the ca cert file to keystone auth session. if the session goes through the tls proxy then it will use it. Otherwise, it will be neglected. 04:14:05 <samP> I think.. 04:14:33 <samP> since we dont have FT for this, can 04:14:46 <samP> can't tell for sure, 04:15:04 <suzhengwei> we can refer to other project how to deal with this common issue. 04:15:40 <samP> suzhengwei: yes, let's look around.. 04:16:51 <tpatil> samP: NotificationVMTestCase->test_create_notification will invoke the changes made in the above patch. 04:18:45 <tpatil> samP: but you are right, we don't have the mechanism to set nova_ca_certificates_file config option, but still in that case the said FT should fail. 04:20:19 <tpatil> currently, there is no way to verify the changes in gate, we will need to do manually by ourself 04:20:29 <samP> FT pass because of the nova_ca_certificates_file is NULL, isn't it might be the case 04:20:31 <tpatil> but the changes looks simple and pretty straightforward. 04:21:34 <tpatil> if tls_proxy, it would expect you to set nova_ca_certificates_file otherwise I'm expecting it should raise SSLError 04:22:42 <tpatil> if tls_proxy is set to True, I expect nova_ca_certificates_file config option should be set properly otherwise I'm expecting it raises SSLError 04:23:34 <samP> correct. 04:24:03 <tpatil> one thing is clear, we cannot test it in gate, need to verify manually. I will see if I can make changes to the gate job to verify these changes. 04:24:39 <tpatil> I mean masakari gate job "masakari-functional-devstack-multinode". 04:25:05 <samP> tpatil: thanks. Can you please comment to this patch with your findings 04:26:03 <tpatil> Simply need to verify, no findings. changes looks good to me. 04:26:25 <samP> in the meantime, lets look around how other projects are dealing with the issue. 04:26:36 <tpatil> Are you asking me to add a comment to make changes to the devstack plugin in order to verify this change in gate? 04:27:11 <samP> tpatil: no, I mean if you need to change smt in this patch 04:27:28 <samP> if it is good, then it is ok 04:27:54 <tpatil> ok, got it. 04:28:21 <tpatil> Need patch we were talking about is : https://review.opendev.org/#/c/720623 04:29:28 <samP> tpatil: suzhengwei: thanks for bringing this up 04:29:29 <tpatil> I mean next patch.... 04:29:41 <samP> got it 04:31:43 <tpatil> First question is : Do we recommend to run masakari-engine on multiple hosts? 04:32:06 <tpatil> I don't think so.. 04:32:53 <samP> masakari-engine will not work correctly in that case 04:33:11 <tpatil> samP: Correct, that's my understanding too 04:33:32 <suzhengwei> actually, we have a large scale cloud env. three controllers with three masakari-api/engine. 04:33:53 <suzhengwei> they works correct. 04:34:26 <samP> are you divide them to separate clusters? 04:35:29 <samP> masakari-monitors only talk to one masaka-api and that masakari-api pass messages to one masakari-engine 04:35:37 <suzhengwei_> actually, we have a large scale cloud env. three controllers with three masakari-api/engine. it works correct. 04:35:39 <samP> am I correct? 04:36:20 <tpatil> there will be no issues except the periodic task 04:36:44 <tpatil> _process_unfinished_notifications 04:36:44 <suzhengwei_> yes 04:37:24 <tpatil> There is possibility all three engine process same notification due to concurrency issue 04:37:28 <samP> Then it work as 3 different masakari clusters, nothing shared in between the clusters. 04:38:16 <suzhengwei_> there are sharing messing by amqp. 04:39:17 <samP> OK then, what tpatil saying could happen 04:39:28 <suzhengwei_> like other projects, masakari can runs with muti workers. 04:41:22 <samP> I need to check this, but my understanding is same with tpatil. 04:42:08 <samP> masakari can run multi workers, but you might face some concurrency issues 04:43:05 <suzhengwei_> I didn't face some concurrency issue. 04:43:48 <suzhengwei_> except for the report bug. 04:46:35 <tpatil> suzhengwei_: Need time to think about this new periodic task and all possible scenarios associated with running masakari-engine on multiple machine because the likelihood of setting FAILED status for already running notifications are high. 04:47:18 <suzhengwei_> ok. 04:47:38 <tpatil> if notification is in RUNNING status, that means engine is processing that notiification. 04:47:51 <tpatil> if it error's out, then the status would be set to ERROR 04:48:09 <tpatil> which then would be picked up by periodic task for second retry 04:48:20 <suzhengwei_> for low risk, the default expired is set 86400,= one day. 04:49:19 <tpatil> but still this inconsistency issue could occur at the edge.. 04:50:57 <suzhengwei_> most times, the notifications have remian so long tme, the env has changes strongly, it has high risk to process it again. 04:52:14 <tpatil> so in such cases, you don't want these notifications to be processed, correct? 04:52:30 <suzhengwei_> yes, not processed again. 04:53:01 <tpatil> so in that case, then you must ignore notification in RUNNING status atleast 04:53:42 <samP> and you might need to alert them to operator. 04:53:56 <suzhengwei_> failure is allowed. continue operation should not be blocked. 04:57:15 <tpatil> Can it be based on notification updated_at instead of notification.generated_time? 04:58:04 <samP> that should be more appropriate.... I think 04:58:22 <samP> Anyway, I will add my comments to patch. 04:59:02 <suzhengwei_> no. generated_time is the exact time when the failure comes by. 04:59:18 <samP> since, we dont have much time, send email to ML if you need any patch to merge for U 05:00:17 <samP> sorry to cut off the meeting. We dont have more time. 05:00:29 <samP> Lets discuss further in the gerrit. 05:00:35 <tpatil> ok 05:00:37 <suzhengwei_> If the failure hasn't been processing for long time, i think there is no need to process it again. 05:00:55 <suzhengwei_> ok 05:01:01 <suzhengwei_> bye 05:01:03 <samP> Thank you all for coming! 05:01:09 <samP> #endmeeting