#openstack-cyborg log

03:01:17 <Li_Liu> #startmeeting openstack-cyborg
03:01:18 <openstack> Meeting started Wed Mar 13 03:01:17 2019 UTC and is due to finish in 60 minutes.  The chair is Li_Liu. Information about MeetBot at http://wiki.debian.org/MeetBot.
03:01:19 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
03:01:21 <openstack> The meeting name has been set to 'openstack_cyborg'
03:01:24 <Li_Liu> Let's get started
03:01:30 <Li_Liu> #topic Roll Call
03:01:36 <Li_Liu> #info Li_Liu
03:01:48 <Coco_gao> #info Coco_gao
03:01:54 <xinranwang> #info xinranwang
03:02:18 <Li_Liu> are sundar and zhenghao here yet?
03:02:40 <Li_Liu> #topic Code Freeze Status Update
03:03:17 <Li_Liu> https://review.openstack.org/#/q/status:open%20project:openstack/cyborg
03:03:21 <Coco_gao> I will update my patch according to the comments these two days.
03:03:38 <Li_Liu> Coco_gao, thanks
03:04:08 <Li_Liu> Why https://review.openstack.org/#/c/574075/ this one is not merged yet?
03:04:18 <Li_Liu> strange ><||
03:04:52 <Coco_gao> That's depend on my patch
03:05:03 <Coco_gao> because my patch are not merged.
03:05:13 <zhipeng> Zuul not started
03:05:21 <Li_Liu> I see
03:06:33 <Li_Liu> By the hard dead line of code freeze, please add UT to the features you own
03:06:53 <wangzhh> hi all
03:06:58 <Li_Liu> Hi wangzhh
03:07:07 <wangzhh> Sorry for late.
03:08:26 <Coco_gao> I will add my UTs.
03:08:30 <Sundar> #info Sundar
03:08:38 <Li_Liu> Hi Sundar
03:08:43 <Sundar> Sorry for the delay
03:08:47 <Sundar> Hi Li_Liu
03:08:48 <Coco_gao> Hi Sundar
03:08:58 <Sundar> Hi Coco_gao and all
03:09:44 <Li_Liu> I think are in a good shape so far
03:10:30 <zhipeng> Any luck to have xilinx driver lol ?
03:10:40 <Li_Liu> no updates tho
03:11:10 <Li_Liu> I can follow up with Chuck later
03:11:42 <Li_Liu> but prob not gonna make to the deadline
03:12:43 <Li_Liu> zhipeng, we can still refine our docs after the deadline right?
03:13:23 <zhipeng> Need to do it before the RC
03:14:16 <Sundar> What is needed for docs?
03:14:38 <Li_Liu> RC1 is Mar 18 - Mar 22
03:15:54 <Li_Liu> https://docs.openstack.org/cyborg/latest/#developer-documentation
03:16:14 <Li_Liu> yumeng already added quite some stuff there
03:16:25 <Sundar> Li_Liu: We have feedback that we should improve our API docs. I can document the current v1 API, if nobody else volunteers.
03:16:31 <Li_Liu> we need to keep improving it
03:16:41 <Li_Liu> Sundar, sure thanks a lot
03:17:20 <Li_Liu> I will take some time to work on the python-clinet
03:17:41 <Li_Liu> as least make it align with our docs and APIs
03:18:30 <Coco_gao> Hi Sundar, still one thing about driver ovo. Is deployable name unique? why is that?
03:18:35 <Coco_gao> Thanks a lot.
03:19:08 <Sundar> Coco_gao: I think so, because it will be used as resource provider name in Placement, and that must be unique AFAIK
03:19:43 <Li_Liu> How are we going to guarantee deployable's name's uniqueness
03:20:05 <Li_Liu> are we doing the check when the resource is reported?
03:20:25 <Coco_gao> we can't if we set the name field in the drivers.
03:20:33 <Sundar> Coco_gao: I don't see explicit documentation that it must be unique. I will check and get back.
03:21:31 <Sundar> Coco_gao and all: why can't Cyborg agent construct the name from other fields like vendor, type, etc., and add a unique id?
03:21:55 <Sundar> E.g. 'INTEL_FPGA_PAC_CARD_ID1'
03:22:08 <Li_Liu> is this ID1 a uuid?
03:22:26 <Sundar> Li_Liu: I was thinking a simple integer
03:22:38 <Sundar> Oh, wait
03:22:50 <Sundar> There is a convention for naming nested RPs
03:22:59 <Sundar> It is based on compute node name
03:23:07 <Sundar> I will check and send email.
03:23:22 <xinranwang> now the deployable name is the filename in /sys/class/fpga, it's unique
03:24:01 <Sundar> xinranwang: It is unique within a compute node
03:24:10 <Sundar> The same name can repeat across nodes
03:24:16 <Coco_gao> Sundar, I agree we'd better do that in agent.
03:25:01 <Sundar> We are not reporting anything ti Placement yet, right?
03:25:04 <Coco_gao> xinranwang, that's the problem when across nodes, name maybe same right?
03:25:07 <Li_Liu> how about when we report the deployable to placement API, we concate name+uuid
03:25:53 <Sundar> Li_Liu: good idea. I'll get back with the name convention for nested RPs
03:26:16 <wangzhh> xinranwang, what if different node has same device? Is it unique?
03:26:33 <xinranwang> if we support NRP, we can identify which host the deployable locate, should it be ok to have same deployable name in different compute node ?
03:27:13 <Coco_gao> xinranwang, that will be ok, i think.
03:27:33 <shaohe_feng_> the fpga devices name is generated by the kernel.
03:27:37 <wangzhh> xinranwang, Not really, Now it is global unique.
03:27:43 <shaohe_feng_> the name is unique
03:28:33 <shaohe_feng_> it does not mater if different node has same device
03:28:40 <Coco_gao> the reason why we need to keep unique from the aspect of driver ovo is that we need to identify the deployable. But driver ovo is compared in the same node, so, the name need only to be unique in one host.
03:29:06 <shaohe_feng_> Coco_gao: yes.
03:29:33 <wangzhh> Coco, so we should change db, it is global unique now.
03:29:33 <shaohe_feng_> for device@host is unique
03:29:39 <xinranwang> so i think it's ok to have same deployable name in different compute node,  in placement side.  But name should be unique on same compute node.
03:30:17 <shaohe_feng_> the name is not used to identify a device
03:30:26 <Coco_gao> wangzhh, Sundar and all, maybe we need to change the db constrains on the deployable table, name field.
03:30:46 <Coco_gao> do you argree if I modify that?
03:31:03 <Li_Liu> what constrain?
03:31:07 <shaohe_feng_> just a Prompt for human
03:31:21 <Coco_gao> the name field is unique in deployable table.
03:31:33 <Li_Liu> ah, ok
03:31:42 <Li_Liu> go ahead
03:31:50 <Li_Liu> no problem on my side
03:31:55 <shaohe_feng_> I agree
03:31:58 <Coco_gao> OK, thank you are for the advice.
03:32:05 <Coco_gao> all
03:32:07 <wangzhh> Of course. But how to handle device like gpu, <device_name>_<address>?
03:32:24 <Sundar> Coco_gao: I think it is ok to make it unique because: there  is some proposed convention to name nested RPs like '<hostname>_<numaNode>_<x>' and x must be unique within a node anyway for us.
03:32:25 <shaohe_feng_> just keep id/uuid unique. it it machine readable.
03:33:13 <shaohe_feng_> unique in a node is ok.
03:33:18 <wangzhh> shaohe, when driver report a device, it does not have a uuid.
03:33:31 <shaohe_feng_> not need global
03:33:49 <shaohe_feng_> wangzhh: agent gen one for it.  :)
03:34:05 <shaohe_feng_> bus is also unique.
03:34:17 <shaohe_feng_> bus is also machine  readable.
03:34:30 <Coco_gao> Sundar, the problem is how to generate x to make sure same card is using the same x when reporting.
03:34:56 <wangzhh> shaohe, agent will generate the uuid every time?
03:35:08 <shaohe_feng_> wangzhh: no. just once.
03:35:16 <shaohe_feng_> wangzhh: it need to check the bus.
03:35:35 <shaohe_feng_> wangzhh: on a node, bus is used for machine read .
03:35:51 <shaohe_feng_> on a cluster, uuid is used for machine read
03:36:06 <Sundar> There may not be a PCI bdf in all hypervisors.
03:36:09 <wangzhh> shaohe, I suppose you mean to generate it at first  time.
03:36:23 <shaohe_feng_> Coco_gao: the x can be generated  by the bus.
03:36:33 <shaohe_feng_> Coco_gao: let me show you an example
03:36:37 <shaohe_feng_> wangzhh: yes.
03:36:44 <Coco_gao> thanks shaohe
03:36:51 <Li_Liu> if there's no bdf, can we use uuid?
03:37:35 <shaohe_feng_> Li_Liu:  there's another identification without bdf
03:37:36 <shaohe_feng_> for
03:37:38 <wangzhh> But agent doesn't know which time it is.
03:37:51 <wangzhh> shaohe_feng_
03:38:13 <shaohe_feng_> wangzhh: it need to check.  if the bus not in the db, then it is the first time.
03:38:14 <Li_Liu> shaohe_feng_, sure that also works
03:38:38 <shaohe_feng_> seems mdev has a uuid.
03:38:48 <shaohe_feng_> and usb has it own bus.
03:38:53 <Sundar> The driver should report a unique id within the node for each device. It could be PCI bdf for libvirt or whatever is unique for PowerVM and others
03:38:54 <wangzhh> If so, agent should query db first. do something like diff?
03:39:08 <Sundar> Then that could be the x factor
03:39:37 <Sundar> wangzhh: No, agent should not query db. For 2 reasons: scaling, upgrades can change db schema
03:39:46 <shaohe_feng_> wangzhh: yes. wen agent start. ti should sync with db firstly
03:39:47 <wangzhh> +1
03:39:50 <shaohe_feng_> when
03:40:31 <wangzhh> shaohe_feng_ agent doesn't query db now.
03:40:33 <shaohe_feng_> Sundar: no, it should sync when it start. and can keep the info in cache.
03:40:57 <Sundar> Agent should not keep state. Even if it reads db at startup, it cannot assume that it will remain in sync, because operator can update config
03:41:25 <Sundar> No cache, please. We will hit all kinds of issues with stale caches, aging, etc.
03:41:28 <Li_Liu> shaohe_feng_, is the cache only containing the information related to the node?
03:41:39 <shaohe_feng_> yes.
03:41:41 <wangzhh> Agree with sundar at this part. :)
03:41:49 <shaohe_feng_> it's own node info.
03:42:12 <shaohe_feng_> let me show you what I do.
03:42:20 <Li_Liu> Sundar, I think it should be ok if it only holds its own information in cache
03:42:53 <wangzhh> shaohe_feng_ haha  talk is cheap, show me your code. :)
03:42:54 <Sundar> Li_Liu: The operator may want to disable or enable specific devices, or do other config.
03:43:21 <shaohe_feng_> wangzhh: yes, I do show you code
03:43:28 <Coco_gao> before diff, the agent should get  the old driver ovo, is that from db or cache?
03:43:30 <wangzhh> Cool.
03:43:34 <shaohe_feng_> wangzhh: I have implemented it.
03:43:43 <Sundar> Li_Liu: Then we have to propagate such changes to each agent, ensure that it has received it, etc. The agent doesn't need any state for discovery -- just add a unique field that driver reports.
03:43:53 <shaohe_feng_> I report the placement by: device_name@host this is  unique
03:44:07 <shaohe_feng_> and I just pud the device_name in cyborg db
03:44:27 <shaohe_feng_> it can works well, no any conflict,
03:44:28 <Coco_gao> I agree with shaohe.
03:44:51 <shaohe_feng_> for placement use device_name@host for index.
03:45:02 <Sundar> Coco_gao: Again, there are some conventions proposed for nested RP names. I am still trying to find the spec/doc where I saw that.
03:45:06 <shaohe_feng_> but cyborg does not use device_name for index.
03:46:54 <shaohe_feng_> Sundar: that's 2 things, but if you want to keep it same. it is OK.
03:47:50 <Sundar> shaohe_feng: There's no point in making them different. The only reason why we have a deployable name is to report to placement
03:47:51 <shaohe_feng_> the big problems it not this.
03:48:20 <Li_Liu> Sundar, please help to find out the conventions. shaohe_feng_, could you share you code with us?
03:48:37 <shaohe_feng_> the big problems is enumeration.
03:48:41 <xinranwang> maybe keep deployable name unique on same compute, and add hostname like "@host"  when report to placement.
03:48:45 <Li_Liu> It seems we need some further discussion on this issue, we can discuss it in tomorrow's zoom sync
03:48:55 <xinranwang> i believe that's what shaohe_feng_  did.
03:49:37 <shaohe_feng_> Li_Liu: if restart, and some change on the host. the enumeration may change the bus of a same device.
03:50:05 <Li_Liu> shaohe_feng_, yea, I know
03:50:15 <shaohe_feng_> I means cloud provider may resize the hardware on the host
03:50:24 <shaohe_feng_> so that's we really need to care
03:51:08 <shaohe_feng_> after all, the machine need bus to identify a device not the name.
03:51:56 <shaohe_feng_> Li_Liu: yes.  maybe we care the same thing.
03:52:32 <Li_Liu> shaohe_feng_, driver should do the mapping from bus to device name/id I believe
03:53:01 <shaohe_feng_> Li_Liu: yes, that's what we need to improve.
03:54:40 <Li_Liu> ok
03:55:12 <Sundar> @all: Please look at https://git.openstack.org/cgit/openstack/nova-specs/tree/specs/stein/approved/numa-topology-with-rps.rst?h=refs/changes/24/552924/14#n163
03:55:28 <Coco_gao> shaohe_feng, that will be ok if name change, the conductor will delete the old device with name1 and add new device to db with name2. But actually, the db is exactly the same with the real situation.
03:55:43 <Li_Liu> Since it's pretty late for me here. Let's close it up and discuss more detail in tomorrow's sync up
03:56:05 <shaohe_feng_> Li_Liu: OK.
03:56:33 <Coco_gao> but that name change for one device is not supposed to be frequent.
03:56:50 <shaohe_feng_> Coco_gao: yes. not frequently.
03:57:17 <shaohe_feng_> seldom resize the baremetal
03:59:41 <Li_Liu> Alright, let's call the meeting for today. Have a good night/day where ever you are
03:59:45 <Li_Liu> #endmeeting