13:01:49 <baoli> #startmeeting PCI Passthrough 13:01:50 <openstack> Meeting started Thu Jan 16 13:01:49 2014 UTC and is due to finish in 60 minutes. The chair is baoli. Information about MeetBot at http://wiki.debian.org/MeetBot. 13:01:51 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 13:01:53 <openstack> The meeting name has been set to 'pci_passthrough' 13:03:01 <heyongli> hi 13:03:07 <irenab> hi 13:03:10 <ijw> yo 13:03:40 <BrianB_> hi 13:04:53 <ijw> Had a long coversation with baoli offline yesterday - I think the summary was that the proposal I put forward is more complicated than a group based proposal but we should do it if we're happy that nothing simpler will do the job and if the network issue is not a problem (for which irenab proposed a solution on the ML). Are we in agreement? 13:05:33 <heyongli> one question, for baoli's concern, 13:05:54 <ijw> (Fwiw the answer is obviously the solution that's no more complicated than it absolutely has to be, as far as I'm concerned.) 13:05:58 <heyongli> if the group as tag solve baoli's live migiration problem? 13:06:29 <ijw> No, it doesn't work with his networking method of attaching devices - and in fact it can't work if a device can be in two flavors. 13:06:51 <heyongli> i don't see why 13:07:32 <ijw> Because you make the networks in advance, and I'm not sure you could put a PCI device in two networks simultaneously. Also, you're offloading the choice of the device that's mapped to libvirt if you do it that way. 13:07:32 <heyongli> if you still can attach device a tag, say, group, why not? 13:08:02 <ijw> irenab's proposal would work fine, though, as far as I can see (I would like baoli's opinion on that though) because it treats PCI devices individually. 13:08:43 <heyongli> i don't see irenab's proposal when i leave office, sorry 13:09:23 <irenab> heyongli: I suggested to rename the device to logical name based on neutron port UUID 13:09:25 <ijw> It renames the ethernet device to the same name as is in the libvirt.xml, so you can attach the migrated VM to the 'same' device while still getting to choose which device that is. 13:09:33 <baoli> Irena, can you put together a complete describe on how it works? My guess is that you would name the interface with the same names on both the target node and the originating node. 13:09:48 <ijw> That's how I understood it 13:09:58 <irenab> baoli: yes 13:10:28 <ijw> Quite a cool idea, actually ;) 13:10:34 <irenab> baoli: it would remane the device once it should be used as vNIC, not inadvance 13:10:59 <baoli> Irena, then we need to do a through study on that. Because that involves nova to coordinate that 13:11:25 <baoli> Have you tried it yourself? 13:11:36 <heyongli> yeah, i'm not sure i understand the problem and the solution right 13:11:43 <irenab> baoli: I think we can take it later for advanced case, since we were OK with no live migration for now :-) 13:12:12 <irenab> baoli: yes, tried that 13:12:14 <baoli> Irena, I'm afraid that we need to do migration. But I agree that we can put it aside for now 13:12:27 <ijw> The issue I would see is whether we have the opportunity to do the rename before the migration begins. 13:12:41 <irenab> baoli: I can share the details of the trial as a follow-up email 13:12:53 <baoli> Irena, that sounds great 13:13:01 <baoli> and Thanks for that 13:13:12 <irenab> baoli: no problem 13:14:15 <ijw> So are we all in agreement that we should do it this way? If we need more in the way of documentation I can add to the spec 13:15:32 <irenab> ijw: I inserted the comments, still didn't check the answers. sorry 13:15:38 <baoli> ijw, let's go over a practical deployment case where only pci flavor can do the job 13:15:46 <heyongli> i would like to see the live migration problem and possible solution in doc or mail thread. 13:15:52 <baoli> And then let's try to map it to SRIOV 13:15:58 <ijw> irenab: I did most of them this morning,there's a couple I answered rather than fixing it 13:16:18 <ijw> heyongli: good point, stick a comment on there so I don't forget 13:16:23 <irenab> ijw: thanks, wil go over, may add more 13:16:23 <baoli> Sorry that I haven't got a chance to look at your comments 13:16:37 <baoli> Irena, you want to go over your comments? 13:16:56 <ijw> They were largely 'needs more text' so for those I've just done as instructed and resolved the comment. 13:17:15 <irenab> baoli: I prefer to go over deployment case as you suggested 13:17:18 <ijw> The only one that's a bit awkward is explaining in more detail where the changes would go and which processes they would affect. 13:17:31 <ijw> I've done what I can but it's hard to structure. 13:17:31 <baoli> Irena, agreed 13:17:49 <irenab> ijw: I think by covering this we can see if all items has owners and do the coding 13:17:59 <ijw> Yep, I agree 13:18:07 <heyongli> agree 13:18:22 <ijw> We'll come back to this, let's do the example 13:18:46 <ijw> baoli: I think the point you had issue with is the changing of a flavor after initially setting them up is what drives most of the complexity in the solution. 13:19:12 <baoli> ijw, please address my request directly 13:19:29 <ijw> baoli: that being the case a worked example wants to include that 13:19:34 <irenab> ijw: can we start with case that no change is required, just initial setup 13:20:36 <ijw> Let's start with irenab's because those are the easy cases. The ones I would work through are the current case (selection by device/vendor) and the groups case (selection by backend marking). We'll cover provider networks in a few minutes too because that's got open issues 13:21:46 <ijw> So, the current case assumes we set up 'flavors' - except they're not objects in their own right at the moment, they're matching expressions on the nova flavor - by expressing a device/vendor match 13:22:11 <ijw> In the proposal, we would use pci_information to do what the whitelist does now, and we wouldn't add any extra_attrs 13:22:48 <ijw> Where now we have extra_specs on the nova flavor with a set of matching expressions, we would instead have the administrator create a flavor with that matching expression and a friendly name and then use the flavor name and device count in a flavor. 13:23:24 <ijw> The pci attributes set for selection would be device and vendor ID. 13:23:54 <baoli> ijw, an example that is expressed with those things would be better than words, I would suggest 13:24:31 <ijw> OK, so say all our compute nodes contain 1 GPU with vendor:device 8086:0001 13:24:32 <irenab> ijw: is it SR-IOV case? 13:24:48 <ijw> No, this one's not going to be yet and we'll come to networking (which is the main SR-IOV driver) in a mo 13:26:29 <ijw> So on the compute nodes, pci_information would be { { device_id: "0x8086", vendor_id: "0x0001" }, {} } 13:26:40 <ijw> (yes that format is awful, make suggestions on the document) 13:27:09 <ijw> I forget the name of the config item for goruping, but let's say pci_attrs=device_id, vendor_id 13:27:21 <ijw> Sorry, I got the device/vendor the wrong way around, but you see the idea 13:27:46 <ijw> So, the compute node would ask the scheduler process for the groups to send and would be told 'device_id,vendor_id' from that config item 13:28:12 <ijw> The compute node would get the pci_infomation and run it over the PCI device tree to find matching devices and get a list of those devices 13:28:44 <ijw> The compute node would make buckets by device and vendor and file the devices into buckets (which I think it does now, but not on a variable list of attrs) 13:29:20 <ijw> The compute node would report pci_stats objects back to the scheduler - one stat: device_id=0x0001, vendor_id=0x8086, free count=1 13:29:26 <ijw> That would come in from every compute node. 13:29:29 <ijw> Good so far? 13:29:48 <heyongli> pci_attrs=device_id, vendor_id should be golobal, right? 13:30:26 <baoli> I think that you forgot the compute node provisions and stats report based on what you have described so far 13:30:40 <ijw> Yes - but I'm concerned that if we put it onto both control and compute config then a cocked up configuration would result in a really screwed up system. The suggestion above that we tell the compute node what it is makes for better consistency. 13:30:51 <ijw> baoli: the pci_devices table, you mean? 13:31:33 <ijw> Yes, as things stand it would be pushing out the PCI device list to the conductor, just as it does now - I think that code remains the same as far as this case is concerned 13:32:02 <ijw> So at this point we have a live cloud on which we can run things. 13:32:25 <ijw> The administrator sets up a 'gpu' flavor with a matching expression { device_id -> 1, vendor_id -> 8086 } 13:32:32 <ijw> PCI flavor even 13:33:02 <ijw> Then he sets up a nova flavor saying 'gpu:1' in the extra_specs for PCI device requirements. At this point end users can run machines with direct mapping. 13:33:21 <ijw> Finally, the user attempts to start a machine with that flavor 13:33:53 <ijw> The scheduler as it is, PCI filter excluded, attempts to locate a list of machines and filters down the list of compute nodes to a subset of the whole list satisfying other requirements (CPU, RAM etc.) 13:34:17 <ijw> Then the PCI filter kicks in. And this is the horrible bit. 13:34:37 <ijw> We take the device filter from the PCI flavor required and find all the buckets that satisfy the request 13:34:46 <ijw> For each machine, we attempt to find a bucket with available devices 13:34:58 <baoli> ijw, can you define the buckets and how they are populated? 13:35:14 <ijw> baoli: these 'buckets' are the pci_stats rows fed up from the compute node. 13:35:19 <ijw> In the first part of the explanation. 13:35:29 <heyongli> the stats pool, now, it calledd 13:35:35 <ijw> If we find one, then we can schedule on that machine. If we can't, the machine is dropped from the available list. 13:36:00 <ijw> Then, schedule filtering done, we have a preference list of machines (we haven't affected the preference order) and we make our scheduling attempts as usual. 13:36:20 <ijw> Finally, when we schedule the machine, the nova-compute process is RPCed with the instance object. 13:36:44 <ijw> The instance object has the pci_stats row that we plan on using. That contains the device and vendor attributes. 13:37:08 <ijw> In its own memory, it has the equivalent record, it decreases the available count by 1 13:37:27 <ijw> And it picks a device to allocate. 13:37:42 <ijw> We spawn the machine, and the PCI device handed to the spawn command. 13:37:59 <ijw> During spawn (with libvirt) we add XML to direct map the device into memory. 13:38:02 <ijw> And voila. 13:38:52 <ijw> From here, we should look at the differing things we might be trying to do. Case 2 we'll call 'grouping' 13:39:08 <ijw> In this, pci_attrs is 'group' and we're adding a 'group' attr at the pci_information stage 13:39:48 <ijw> So, pci_information={ { device_id: 1, vendor_id: 0x8086}, { group => 'gpu' } } 13:40:13 <baoli> what is device_id? 13:40:25 <ijw> The device ID of the whitelisted PCI device 13:40:49 <ijw> Do I mean device ID? 13:41:13 <ijw> Yes, that's the right term 13:41:21 <baoli> but we don't have a device_id 13:41:36 <heyongli> product_id , i think it is 13:41:57 <heyongli> the standard pci property 13:42:03 <ijw> OK, sorry - I'm looking at a (non-Openstack) page which calls it device ID, my apologies 13:42:28 <ijw> So, compute host gets 'group' for attrs when it requests it 13:43:06 <ijw> And makes one PCI stat per 'group' value, rather than unique (vendor_id, product_id) tuple and reports that to the scheduler in this case 13:43:32 <ijw> PCI flavor is going to be name gpu match expression { group: 'gpu' } 13:43:39 <ijw> Scheduling is as before 13:44:16 <ijw> Sorry, it will be 'e.group': 'gpu' per the spec about marking extra info when we use it 13:44:34 <ijw> Difficult cases. 13:44:58 <ijw> Scheduling is harder when PCI flavors are vague and match multiple pci_stats rows, and particularly when they overlap 13:45:47 <ijw> So, if I have two flavors, one is vendor:8086, device: 1 or 2 and one of which is vendor: 8086, device: 1 then you have more problems and the worst case is when you use both flavors in a single instance 13:46:12 <ijw> So, say I have the above two flavors, 'vague' and 'specific' for the sake of argument, and I want to start an instance with one device from each 13:46:18 <irenab> ijw: can we go over networking case taking into account phy_net connectivity? 13:46:30 <ijw> OK, in a sec, I'll finish this case first 13:46:52 <irenab> ijw: afraid to be out of time 13:46:52 <ijw> When I try and schedule, let's assume I have one product_id of each type 13:47:40 <ijw> Then there's a case that I can't use the device '1' for 'vague' because I would not be able to allocate anything for 'specific'. The problem is hard, it's not insoluble, you just have to try multiple combinations until you succeed. 13:47:47 <ijw> physical networking cases: 13:48:21 <ijw> Important here is I have two network devices with different physical connections, so I need tobe able to distinguish and I can't do that by vendor/product 13:48:33 <irenab> ijw: yes 13:48:52 <irenab> ijw: assume neutron ML2 please 13:49:03 <irenab> physical netowrks/segments 13:49:10 <ijw> Right, and here we get into the grey area 13:49:37 <ijw> I'm not sure anything we've proposed to date solves the problem where a networking device can connect to some neutron networks but not to others. 13:50:01 <irenab> ijw: just assume provider_network is enough 13:50:26 <irenab> all neutron networks on certain provider_network can use same devices 13:51:17 <ijw> irenab: the only solution to that problem that we have for provider networks at present is to ignore Neutron and label the devices on the provider networks with an extra_attr corresponding to the provider network. Then you make a flavor per provider network and explicitly use the flavor. But that doesn't work well with --nic arguments. 13:51:33 <ijw> Or macvtap, for that matter. 13:52:04 <baoli> ijw, we proposed to correspond a pci group to a provider network. that solved the issue elegantly. 13:52:24 <irenab> ijw: so going forward with the proposal you have, we still do not have a solution for networking? 13:52:50 <ijw> baoli: elegantly-ish. If you do it that way then you would say --nic pci-flavor=provider_network_flavor,net-id=xxx and you can't check that the flavor and the network match easily so you can get yourself into trouble. 13:53:20 <ijw> I think this is fixable by revisiting the code after we get the other cases working, it seems like we can do something in just the Neutron plugin to deal with the checks 13:53:26 <irenab> ijw: agree, but it will do the work ... 13:53:31 <baoli> ijw, I dont' want to go back to those arguments 13:53:48 <ijw> baoli: no, that's fine, just highlighting the outstanding issue. I don't want to solve it now. 13:53:59 <ijw> Networking in general - where the network device is programmable 13:54:04 <baoli> ijw, we can also associate a PCI group with a network if that would cover all the use cases 13:54:24 <ijw> baoli: yes, that's about it, it's just the issue of making it usable from the bit of code in Neutron. 13:54:55 <ijw> We have a flavor containing SRIOV network devices, defined in pci_information as (presumably) something like { pf: ... device path ... } 13:55:21 <ijw> * I don't think we have enough detail here so yes, this is vague. Criticisms on the spec document please 13:56:00 <ijw> However we set it up we end up with a PCI flavor that contains network devices we intend to use for neutron, which I shall call 'pcinet' 13:56:20 <irenab> ijw: I am confused, what is the approach with neutron programmable device? 13:56:35 <ijw> That's what I'm covering now 13:56:43 <ijw> So we have a PCI flavor 'pcinet' 13:57:20 <irenab> ijw: how is it defined? 13:57:32 <ijw> which 'it'? 13:57:34 <irenab> product, vendor id + ? 13:57:53 <irenab> it = PCI flavor 13:58:18 <irenab> we have 2 mins.... 13:58:38 <irenab> Can we cover neutron requirements on Monday meeting? 13:58:38 <ijw> If it were me, I would probably tag it up on the backend - but wouldn't just vendor + product be enough if you wanted? 13:58:48 <ijw> On nova boot, we specify (my memory fails me here, correct the arguments as we go) nova boot --nic vnet-type=macvtap|passthrough,net-id=mynet,pci-flavor=pcinet ... 13:59:36 <ijw> PCI device selection as usual (NIC device gets added to requirements before scheduling) but we need an extra bit of information to go to the compute node to point out the PCI device here has been allocated in response to --nic for interface 1 13:59:50 <irenab> ijw: what do you mean by 'tag it up on the backend '? 13:59:59 <ijw> extra_info 14:00:29 <baoli> guys, times up 14:00:42 <ijw> Apologies, apparently I can't quite type fast enough 14:00:56 <ijw> irenab: can we take this up on the list, I'm not sure I'm getting your concern. 14:01:06 <baoli> I can't imagine that we have a proposal that doesn't take care of the most of our concerns 14:01:11 <irenab> baoli: all: can we talk about neutron requirements next meeting? 14:01:21 <irenab> ijw: sure 14:01:24 <heyongli> sure, irena 14:01:29 <baoli> I believe it's solvable, but I just don't get it 14:01:32 <baoli> Irena, sure 14:01:36 <baoli> #endmeeting