03:02:27 #startmeeting openstack-cyborg 03:02:29 Meeting started Thu Apr 16 03:02:27 2020 UTC and is due to finish in 60 minutes. The chair is Sundar. Information about MeetBot at http://wiki.debian.org/MeetBot. 03:02:30 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 03:02:32 The meeting name has been set to 'openstack_cyborg' 03:02:38 welcome to songwenping_ 03:03:01 from beijing university? 03:03:04 Welcome, songwenping_ . Would you like to introduce yourself? 03:03:14 Sundar: Good. How's it going with your new job? 03:03:14 welcome songwenping 03:03:25 Hi all 03:03:28 welcome songwenping 03:03:45 welcome 03:03:55 I am from Inspur. 03:04:15 Good. Brin has company. :) 03:04:27 do the nova-cyborg-interaction work. 03:04:42 songwenping_ mainly work on nova-cyborg interaction, and his take some patch in this meeting 03:04:54 Sundar, Yumeng, start meeting? 03:05:04 Interesting. What aspect of Nova interaction? 03:05:34 welcome 03:05:37 brinzhang,songwenping_ : Are you planning to use cyborg in production, right? 03:05:40 GPU, and others 03:05:58 Yumeng, considering 03:05:59 brinzhang: The meeting has started. We'll get to the patches after hearing a bit more about the Nova interaction. 03:06:22 Is it about more instance ops like shelve and suspend? 03:07:32 songwenping_, brinzhang: ^ 03:08:04 Sundar: now follow you patch https://review.opendev.org/#/c/716185/6/api-guide/source/accelerator-support.rst 03:08:12 not do these operators 03:08:28 Sundar: I'd like to know what operation could be add in next release, shelve/unshelve, like what else? And is there any gap here to complete. Because I have started to investigate it and hope add more operations in Nova in next release. 03:09:11 Sundar, we also blocked these operations now, maybe we can open that after enough develop 03:09:34 hi all 03:09:55 xinranwang,brinzhang,Sundar and all: here is the etherpad for V release. https://etherpad.opendev.org/p/cyborg-victoria-goals we can add topics now. 03:09:58 Sundar: we are still in testing 03:10:40 xinranwang: The API guide that brinzhang posted above lists the unsupported operations. Many opes there are important for operators. Sean Mooney from Nova is also working on some of these ops for V release, I think. You may want to check with him. 03:11:00 brinzhang: Good to know. Thanks. 03:11:15 Yumeng: sean-k-mooney works on evacuate now 03:11:20 Yumeng: Thanks for posting the V release etherpad. 03:11:35 https://review.opendev.org/#/c/715326/ 03:12:04 Hi Sundar, do you mean sean will implement some virtual machine operations in the next version? 03:12:22 sean-k-mooney: Is there already some patches from sean mooney? I'd like to have a look. 03:12:28 chenke: Yes, he says he intends to do that 03:12:28 Sundar: 03:12:46 xinranwang: Please see https://review.opendev.org/#/c/715326/ 03:13:15 Sundar: thanks. 03:13:15 Okay, this is good news, for both cyborg and nova》 03:13:51 Yes.Shall we get started on our patches? https://review.opendev.org/#/q/status:open+project:openstack/cyborg+branch:master 03:14:03 I am reviewing https://review.opendev.org/#/c/696089/ 03:14:33 Sure. 03:14:55 Sundar https://review.opendev.org/#/c/696089/, how about this patch? need your check 03:15:25 except some nits inline, others looks good to me 03:15:36 brinzhang: Yes, that's what I am looking at. Will complete after this meeting. 03:16:13 BTW, Release candidate 1 for Ussuri is due Apr 23 https://releases.openstack.org/ussuri/schedule.html 03:16:36 So, 8 days left. What are the highest priority patches to merge before that? 03:17:11 brinzhang: Thanks for posting 696089! 03:17:22 Programming support https://review.opendev.org/698190 ? 03:18:18 link: https://etherpad.opendev.org/p/cyborg-victoria-goals 03:18:32 I have add this program API in the etherpad 03:18:42 this is really needed in product env to adopt cyborg 03:18:43 Sundar: from my side. I think Zuul check bandit failure can be merged before Ussuri is due. https://review.opendev.org/#/q/topic:fix-bandit-check-failures+(status:open+OR+status:merged) 03:18:56 Yumeng: agreed 03:19:23 But it is failing the bandit check :) 03:19:32 Yumeng, I found the bandit are not be voted, so you should enable it in zuul task 03:19:59 and that should be works fine, that we can consider merge it. 03:20:31 a questions, what is bandit to do? I am know few of it 03:20:32 Sundar, brinzhang: yes. After we fixed all failure, we can set that to voting. and at that time, bandit check will not fail again. 03:21:09 IMO, firstly, you should enable this task in zuul task 03:21:29 brinzhang: Bandit is a tool to detect security flaws. https://pypi.org/project/bandit/ 03:21:32 than fix the failure, that we can ensure that correctly 03:22:20 Sundar: ack, thanks, will see it later 03:22:23 brinzhang: bandit is now a non-voting job. We need to fix the issues there and then make it voting. Otherwise, other open patches will be affected. 03:22:48 https://review.opendev.org/#/c/696089/ The rafactor arq api is also important, I think. 03:22:48 bandit is for code security check, which can check operations like if shell operations in cyborg code is safe 03:22:49 Ye. I think that's the right way 03:23:07 So, the programming API https://review.opendev.org/698190 is not for Ussuri? 03:23:10 Sundar: I mean, we should enable it as the base patch, and then fix the issue, right? 03:24:00 Bandit will scan the python code and gen AST to check security issues 03:24:49 Sundar , shaohe_feng : As I mentioned in the program patch , as reply for the Sundar's comment (https://review.opendev.org/698190) , I would like to take discussion the interface of PATCH API or not. IMO, that effects the schedule of merge. 03:25:15 we had better let programming API in cyborg as soon as possible 03:25:34 so we should try do our best to make it ready 03:26:04 this is a very important feature 03:26:16 s_shogo thank you for post this feature. 03:26:19 brinzhang: IMHO, it is better to have a patch series that fixes the open bandit issues and then make it a voting job. We need to merge other open patches for Ussuri. They will be affected if we make it a voting job too soon. Agreed? 03:26:24 shaohe_feng:I also prefer to use update method 03:26:45 update for firmware update? 03:27:04 do you means a new update API? 03:27:52 Sundar: we just enable the vote in zuul, but cannot merged, that cannot impact any open patches, I think 03:27:59 Sundar,brinzhang : Agreed. We should fix bandit failures first, then update bandit job as voting in zuul check. otherwise, if we change voting first, other patches will get Zuul -1. 03:28:19 let's welcome a new fresher haibin-huang 03:28:38 if we just fix that bandit failures, we cannot ensure it work fine after we enabled bandit vote in zuul 03:28:40 brinzhang: I will post all patches for bandit today. 03:28:44 she will try to help verify the program API in product evn. 03:29:17 s/she/he 03:29:28 I am very glad to join cyborg family. 03:29:29 Welcome haibin-huang . Would you like to introduce yourself? 03:29:38 sorry, he :] 03:29:48 welcome haibin-huang! :) 03:29:53 haibin-huang you can help to review this new API 03:29:58 welcome haibin-huang 03:30:03 welcome haibin again. since beijing hackson. 03:30:17 welcome haibin-huang 03:30:18 welcome haibin-huang23 03:30:26 welcome haibin-huang too 03:30:41 #link https://review.opendev.org/#/c/698190/ 03:30:59 Yumeng: ok, sound good, we shuold ensure the bandit run works fine, than star merge the open fix bandit patches 03:31:04 I am from Intel. I work on ONAP in past years. I focus on multicloud in ONAP project 03:31:22 thank you everyone 03:31:55 Good to know you, haibin-huang23 03:32:24 shaohe_feng, Sundar: the program API, if we use update instead of patch, is it ok? 03:33:51 brinzhang: The HTTP verbs are PUT, POST and PATCH. When you say update, which one do you mean? 03:34:22 shaohe_feng, Sundar: ... PUT, soory 03:34:33 IMO, PATCH is the one that is closest in semantics to the programming operation. We are also using it for programming during ARQ binding. 03:35:06 brinzhang you can read feilding's paper 03:35:42 try to find which is more accurate 03:36:33 I just think it's not very popular, if you all think that ok, I will be not concern this ^ 03:36:49 https://review.opendev.org/#/c/718584/ 03:36:55 PUT should be idempotent. PATCH need not be. The programming operation is not idempotent -- it can fail once and succeed the next time 03:37:26 but the patch seems is newer method, I'm not sure feilding have discuss the patch in his paper. 03:38:09 brinzhang: I think you pointed me to some Nova oprrations. But Nova developers themselves criticized my patches when I followed their precedent. :) 03:38:20 *operations with PUT 03:39:01 Sundar: I see that, and I have not looked into 03:39:11 pls see https://review.opendev.org/#/c/718584/ 03:40:46 and https://review.opendev.org/#/c/718584, that revert device and deployable when resource provider create fail 03:41:25 Does everybody think that the right approach is to delete/revert Cyborg objects when a Placement update fails? 03:41:30 Sundar: did you review this patch? 03:41:37 brinzhang why do you prefer PUT? 03:42:18 i am sorry my network is not good 03:42:45 songwenping: Sundar said Does everybody think that the right approach is to delete/revert Cyborg objects when a Placement update fails? 03:43:36 xinranwang, Yumeng, s_shogo, shaohe_feng, all: Re. https://review.opendev.org/#/c/718584, Does everybody think that the right approach is to delete/revert Cyborg objects when a Placement update fails? 03:44:43 IMO, It is reasonable. 03:45:16 Otherwise, the conductor will not notice the diff and will now report to placement correctly. 03:45:20 Let me think about it 03:45:46 which will cause inconsistency 03:46:03 Ok. if you all decide that the best approach is to revert Cyborg db updates, then why not do Placement update first and update Cyborg db only if that succeeds? 03:46:18 xinranwang:right 03:47:02 yes, there will be disaccord between placement and cyborg DB 03:47:48 Personally, I'd prefer to update Cyborg db always, and mark the objects that failed to sync with Placement. That way, operators will know that Cyborg did its job correctly and Placement failed. 03:48:03 Sundar our design is do placement update first. 03:48:28 we have discuss this issue in Dan smith's patch. 03:49:00 Sundar: I saw you comment, I think your suggestion will be compex, but it is also an alternative way 03:49:20 s/compex/complex 03:49:30 do Placement update first and update Cyborg db only if that succeeds is the easiest way to keep consistent 03:49:46 we have discuss it before. 03:50:26 Ok, I'll support your decision. Has it been tested fully in various scenarios, including Placement service down for some time and then comes up? 03:50:39 bet another things: if placement DB successfully, but cyborg DB failed 03:50:51 then what should cyborg do? 03:51:28 Sundar: please see chenke's patch here: https://review.opendev.org/#/c/711912/2 03:51:32 Sundar:no problem 03:51:41 shaohe_feng: that will be raise > 500, I rembered chenke done this 03:52:06 i have test this scenario 03:52:10 The commit message has explain the root reason of data imconsistency 03:53:02 so I prefer make sure cyborg DB successfully first. 03:53:48 agree we do the revert to handle the exception. and Let's add reconsidering placement report as a V release goal. whether we need to decouple placement report or not. 03:55:27 If placement succeed and cyborg DB failed. The next time cyborg conductor do diff, it finds a diff, writes db and update placement with same info. I think the placment will return like "resourve provider already exists" something like this, But it will not crash the service. What do you think? 03:55:40 sundar's comment: Personally, I'd prefer to update Cyborg db always, and mark the objects that failed to sync with Placement. That way, operators will know that Cyborg did its job correctly and Placement failed. 03:56:03 this way just reduce the date inconsistence's scenario, that cannot cover all scenarios. 03:56:39 actually. no one can cover all scenriaos. 03:56:43 I also discussed the Placement-Cyborg interaction in enable/disable API , with xinranwang. 03:57:03 we have issues(same to sudar's comments) this in Dan's patch, let us check, why not we give this? 03:57:08 let me check it. 03:57:51 another things: 03:58:31 chenke's patch has fix several secenarios already. But as we said, it is difficult to cover all sceanrios. 03:58:32 we need More accelerators drivers support 03:59:03 I think we can revert like this patch does, and discuss the placement and db update decoupling in V. 03:59:20 agree. 03:59:21 if we add a flag to mark the data, I think we also need to add a period task to check this flag, and then do resource consistency 03:59:49 Our philosophy should be that Cyborg db is the source of knowledge about accelerators in any OpenStack cloud. If we let Placement take precedence, that reduces the value of Cyborg. That's just my opinion. As brinzhang said, it may be complex to make those changes. 04:00:10 brinzhang: we already have a periodc task from the agent 04:01:35 we discuss it in: https://review.opendev.org/#/c/708726/ 04:02:02 Sundar agree. 04:02:07 cyborg db should firstly. 04:03:18 but I see brinzhang agree Dan comments. 04:03:47 yeah, I agree this way, but now we should do? 04:03:59 shaohe_feng: Dan Smith's suggestion may not apply well for Cyborg. For example, when an instance terminates, Placement resources get released. ARQs are unbound and deleted, but the unbinding may take some time if we do any device cleanup in the future. There is a period of time when the device is still not free (which Cyborg knows) but Placement 04:04:00 doesn't know. 04:04:22 maybe we can make this as a plan in V release 04:04:37 yes. This is a known and controversial issue 04:04:43 I agree with you. 04:04:47 songwenping's just make this as a bug fix 04:04:52 brinzhang and all: first, is it an important issue to fix in Ussuri? 04:04:57 and I have discuss with xinranwang for it. 04:05:28 we are all same options with you. 04:06:24 make cyborg db correctly firstly? 04:06:40 brinzhang agree? 04:07:29 or you can list some pros and cons to make the conclusion 04:07:49 The root reason is that now we coupling placement and db update, and we can not avoid this dependency. IMHO, we should discuss the decoupling in V. Now, we just make sure that there is less risk to have data inconsistency. 04:08:31 +1 04:08:32 I am not sure whether is good for me, when would like to fix this issue? shaohe_feng 04:09:28 xinranwang: agree +1 04:09:47 OK, let's xinranwang to make a good design firstly 04:09:59 xinranwang: agree +1 04:10:06 xinranwang: are you voting to merge the patch now, and revisit it later? 04:10:34 xinranwang's mean is to agree to merge this patch, then in V we should do Sundar suggestion, right? 04:10:40 maybe a new spec for it firstly, and we can discuss it base on the new spec. 04:11:26 I think we can merge it to let U have less risk of data inconsistency 04:12:13 all ok with that? 04:12:32 agree +1 04:12:53 shaohe_feng, s_shogo ^ 04:13:03 agree, +1 04:13:15 some advice inline . 04:13:25 https://review.opendev.org/#/c/718584/9/cyborg/conductor/manager.py 04:13:26 Yumeng, xinranwang: please add this to V release plan 04:13:27 I have thought about the decoupling a bit, there's lots of things to discuss,several gaps as I mentioned in last meeting. We should think more and discusss more to have a solution of decoupling. 04:13:32 I give up vote. 04:13:55 stay neutral 04:14:05 Ok, great. Please mention this in commit message of this patch. 04:14:36 if we merge it, please leave details note in the patch 04:14:47 xinranwang: can you leave comments in this patch? that songwenping can update that 04:14:57 such as FIXME or NOTE in it. 04:15:14 That TODO(), I think 04:15:17 Anything else, folks? We are over the time. 04:15:19 shaohe_feng: yes, understand, it is a hard decision, we have a long way to go lol... 04:15:30 brinzhang: sure 04:15:47 xinranwang: thanks 04:16:07 it's time to launch, end meetting? 04:16:13 OK. 04:16:28 more drivers are welcome this new release right? 04:16:36 Great. Thanks a lot, everybody! Great progress this cycle. We have another week to make even more contributions. Take care and stay safe! 04:16:39 #endmeeting