11:00:10 <oneswig> #startmeeting scientific_sig
11:00:11 <openstack> Meeting started Wed Feb 14 11:00:10 2018 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
11:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
11:00:14 <openstack> The meeting name has been set to 'scientific_sig'
11:00:21 <oneswig> hi all
11:00:30 <oneswig> #link Agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_February_14th_2018
11:00:33 <ttsiouts_> hi!
11:00:43 <belmoreira> o/
11:01:00 <oneswig> Got a few good things to cover today, let's get going!
11:01:06 <daveholland> morning
11:01:17 <oneswig> #topic CFPs and conferences
11:01:26 <b1airo> evening
11:01:36 <oneswig> I saw a couple.  Anyone else with one to announce - please do!
11:01:38 <oneswig> hi b1airo
11:01:41 <oneswig> #chair b1airo
11:01:42 <openstack> Current chairs: b1airo oneswig
11:01:56 <b1airo> a'lo oneswig
11:02:18 <oneswig> Tim Randles passed on details of the ScienceCloud workshop in Arizona, June 11th
11:02:33 <oneswig> #link ScienceCloud workshop https://sites.google.com/site/sciencecloudhpdc/
11:03:00 <oneswig> The HPCAC workshop is coming up in Lugano, 9-12 April
11:03:21 <oneswig> #link HPCAC conference https://www.cscs.ch/publications/press-releases/swiss-hpc-advisory-council-conference-2018-hpcxxl-user-group/
11:03:27 <mpasserini> welcome to Lugano!
11:03:31 <oneswig> Any others to report?
11:03:43 <zioproto> I will be in Lugano as well
11:03:44 <oneswig> Hi mpasserini - good to see you.  Ready for HPCAC?
11:03:58 <oneswig> excellent, I hope to return as well.
11:04:36 <oneswig> 1 question - does everyone also attend HPCXXL on the 4th day or is that restricted attendance?
11:04:37 <b1airo> this is all starting to sound like an excuse for why i need to go
11:04:57 <oneswig> get on it b1airo
11:05:35 <b1airo> isn't HPCXXL IBM specific?
11:05:35 <mpasserini> I'll join the XXL, I'm not sure about how the restriction works, I can ask
11:06:07 <b1airo> sounds like DellXL/DellHPC but for IBM/Lenovo
11:06:18 <oneswig> be interesting to know how the two events fit together.
11:06:35 <oneswig> b1airo: interest starting to fade...
11:06:46 <b1airo> :-)
11:07:29 <oneswig> b1airo: you can probably call in on your way home from Dell in Austin, if you're going that way!
11:08:06 <oneswig> OK, are there other events to announce, if not lets move on
11:08:31 <oneswig> #topic Ironic for bare metal infrastructure management
11:08:56 <oneswig> OK - there's quite a bit going on here.
11:09:06 <oneswig> CERN team, would you like to describe your project?
11:09:26 * johnthetubaguy picks up his ear trumpet
11:09:31 <makowals> Hi all
11:09:39 <oneswig> Hi makowals, welcome
11:09:51 <makowals> Since quite some time now we are running Ironic for our users
11:10:07 <makowals> At the moment we have ~600 physical nodes in use
11:10:56 <makowals> They cover various use cases - we have "autonomous" users whose only request is to get physical resource, but we are also running Ironic for ourselves, what means we are deploying hypervisors using ironic
11:11:10 <makowals> And then offer virtual machines in a standard way to the users
11:11:28 <oneswig> Is that a Bifrost servicde?
11:11:41 <makowals> Don't know what "bifrost" is, so I would guess no
11:12:00 <oneswig> Ah, it's basically Ironic run without OpenStack
11:12:18 <makowals> Then no, we have Ironic fully integrated with the rest of OpenStack
11:12:20 <johnthetubaguy> I guess the VM cloud is a totally separate cloud from the ironic one?
11:12:33 <priteau> Are you using TripleO?
11:12:38 <b1airo> or just different cells?
11:12:55 <makowals> We have only separate cells for the ironic nodes
11:13:05 <makowals> But it still runs as one cloud
11:13:15 <makowals> This design works well for us at this moment
11:13:25 <johnthetubaguy> are you doing anything to help users understand what images work where?
11:13:38 <zioproto> makowals: do you have a link to a blog post or to a documentation site where  who is interested can go into the details ?
11:14:17 <makowals> johnthetubaguy: Our current strategy with the images is to tell our users "you can use the same images as you are used to use with VMs, but now they also work with the physical machines"
11:14:24 <makowals> zioproto: No, no any blog post yet
11:14:41 <oneswig> would be great to see something on openstack-in-production
11:14:59 <ildikov> oneswig: +1
11:15:00 <johnthetubaguy> makowals: ah, interesting, that works I guess
11:15:12 <johnthetubaguy> oneswig: +1
11:15:18 <makowals> Yes, the use cases I described work correctly, users are happy at the moment
11:15:22 <priteau> We do the same thing on Chameleon: we generate images with diskimage-builder that work both on KVM and bare-metal
11:15:36 <johnthetubaguy> priteau: GTK
11:15:40 <makowals> We have also one more use case in hand which is containers-on-bare-metal. We don't expose it yet, but the works are in quite advanced state
11:16:03 <b1airo> makowals: for your "undercloud" Ironic cells (those you use to provision your hypervisors), are you just relying on policy and cells scheduler restrictions to stop end-users accessing these?
11:16:09 <oneswig> makowals: You've got something like 20 models of server in the cloud, right?  Does that cause issues for creating bare metal images or are they comprehensive for all hardware?
11:16:48 <makowals> b1airo: We are using both cell and flavor separation to keep users from accessing the nodes they shouldn't touch
11:17:11 <makowals> oneswig: No, so far we did not have any issues regarding images compatibility between different hardware models
11:17:24 <johnthetubaguy> cells are quite nice for that (at least in the v1 world)
11:17:48 <makowals> However one big problem for now is lack of software RAID support in Ironic, this is the biggest thing our users complain on now
11:18:07 <makowals> There is a RFE open for that though
11:18:07 <makowals> https://bugs.launchpad.net/ironic/+bug/1590749
11:18:09 <openstack> Launchpad bug 1590749 in Ironic "RFE: LVM/Software RAID support in ironic-python-agent" [Wishlist,Confirmed]
11:18:20 <b1airo> makowals: also curious about security - you must have quite a few users with admin various levels of admin permission. any issues managing that within this all-encompassing environment?
11:18:56 <makowals> b1airo: Do you mean "admin permissions" for the operators running the cloud, or user-side?
11:19:07 <johnthetubaguy> makowals: ah software raid rather than the existing clean steps that configure hardware raid? or different RAID config based on flavor?
11:19:27 <b1airo> makowals: just wait another couple of years and software raid will be effectively obsolete with everything running on NVMe
11:19:53 <makowals> johnthetubaguy: First step would be software raid instead of the current hardware raid setup, but as a next improvement it would be nice to be able to configure raid dynamically, according to the user's request
11:20:03 <b1airo> yes, admin permissions across a large set of operators
11:20:23 <makowals> b1airo: Our set of operators is quite small at this moment, so it's one team running all the Openstack services
11:20:23 <oneswig> makowals: we are very keen on that level of capability too!
11:20:42 <johnthetubaguy> makowals: I am glad mark and I are working on that with traits :)
11:20:52 <makowals> Regarding SW/HW raid, it's more political, there are constant fights between these two camps ;)
11:21:23 <johnthetubaguy> makowals: we should point you at the specs to make sure they work for your use case, sounds identical
11:21:32 <priteau> makowals: Sounds like more generally you need custom partitioning which would include LVM and SW RAID
11:21:43 <makowals> priteau: Yes, exactly this
11:21:51 <oneswig> makowals: There was also some work relating to automating the on-boarding of new hardware, right?
11:22:07 <makowals> oneswig: Ahhh yes, this is the other side of the project
11:22:26 <makowals> So at CERN we have another team responsible for the process of onboarding the hardware arriving on site
11:22:39 <makowals> They are using multiple in-house made tools and databases to handle this process
11:23:40 <makowals> Recently we started to work with them to possibly first -- integrate their tools into our deploy image, next -- move their "onboarding" scripts into Ironic "inspection" phase, and last -- to merge their database with Ironic's one
11:24:17 <priteau> Nice. What does onboarding include in this context? Burn-in and hardware inventory?
11:24:18 <makowals> At this moment what we would like to see in Ironic is ability to add more states into the "node's lifecycle graph"
11:24:34 <makowals> priteau: Yes, burn-in/testing and getting the inventory
11:25:06 <oneswig> makowals: same instance of Ironic as the production environment, I guess?
11:25:13 <makowals> At this very moment we are trying to move their "stuff" into the "inspection" state in ironic
11:25:29 <makowals> oneswig: I did not get the last question, sorry
11:26:02 <oneswig> as in, you're using the same Ironic service for on-boarding new hardware as for managing bare metal deployments in the production cloud?
11:26:20 <makowals> Yes, that's the same Ironic deployment
11:26:44 <zioproto> what do you mean with "burn-in" ?
11:26:55 <oneswig> That's interesting to know, thanks
11:27:00 <makowals> zioproto: CPU, memory, disk and network tests
11:27:15 <makowals> Also running some benchmarks to check if the node's performance is as expected
11:27:27 <makowals> And to detect any discrepancies inside nodes from the same delivery
11:27:48 <oneswig> makowals: this is like an extended version of the hardware benchmarks you can enable in inspection, I guess?
11:27:58 <priteau> makowals: Adding more states to the lifecycle graph would be quite intrusive into the Ironic codebase. Have you considered adding tags to your nodes to describe the "sub-state" of the inspection?
11:28:46 <makowals> oneswig: Yes, but we don't want to run it all the time when we do "node-inspect", that's why our need for a new state or something providing this functionality
11:29:03 <makowals> priteau: How would that work with tagging the node?
11:29:16 <johnthetubaguy> seems a bit like a clean step, would that work?
11:29:33 <johnthetubaguy> i.e. clean before it becomes available
11:29:46 <makowals> johnthetubaguy: Yes, but then we don't want it to be run every time the instance is deleted as our burn-in takes 1-2 weeks
11:29:58 <oneswig> makowals: I wonder if you could select by changing which deploy image is associated with the node?
11:30:07 <makowals> In fact we are getting into our next issue which is cleaning, repurposing and retirement
11:30:25 <priteau> If your deploy could query the Ironic API (might need to embed some credentials), it could to a `node show` to check which tags have been added to the node object. If there is e.g. a tag saying "ram:ok", don't run RAM tests
11:30:36 <makowals> oneswig: I don't think that would do the job, as it's more custom hardware driver describing what's done during the process
11:31:07 <makowals> priteau: Oh yes, something like this would work perfectly, it's just the issue with doing "ironic node" from the deploy image which is problematic in terms of credentials
11:31:36 <makowals> But thanks for the idea, this is definitely something to investigate more
11:31:46 <priteau> create a user with only read-only privileges via policy.json
11:32:39 <oneswig> makowals: what were the issues with cleaning?
11:32:42 <b1airo> you could even pass in a short-lived pre-baked token
11:32:55 <makowals> Moving forward, the issue we see with cleaning is similar. For the standard cleaning we only want to delete data from the disk to have it clean. But there are more complicated cases where for example the machine is going to be used by a different user and we need to make a thorough cleaning of the machine
11:32:58 <johnthetubaguy> the new keystone application credentials stuff should make that easier
11:33:19 <makowals> So something like different cleaning paths depending on what is going to happen with the machine afterwards
11:33:42 <oneswig> At cleaning time, who can predict the next user?
11:34:04 <b1airo> the egg oneswig
11:34:07 <makowals> Me as an operator, I request the user to delete his instances because I know I'm going to repurpose the machine
11:34:08 <priteau> Regarding cleaning, anyone ever used those disk encryption keys to make all data unreadable very quickly?
11:34:38 <makowals> At this moment we only play with erasing disk metadata and shreding
11:34:40 <johnthetubaguy> I think we are doing that on the SKA AlaSKA machine
11:34:56 <oneswig> priteau: I believe we do - I've just come across blkdiscard for the same result (on SSD)
11:35:19 <johnthetubaguy> I think its already built into the clean steps in Ironic
11:35:42 <makowals> Yes, it's just matter of configuring it to use exactly what you need, but the implementation is done
11:35:55 <b1airo> should call a 2 minute warning here for us to move on to the next topic...
11:36:07 <makowals> Ok
11:36:27 <makowals> So for our use case the current cleaning could stay as it is, but we would appreciate something to be able to execute a "clean better" procedure
11:36:28 <b1airo> i'm curious how cells v2 might shake things up (read: break the world) here?
11:36:29 <johnthetubaguy> ATA secure erase was the phrase I couldn't remember
11:36:59 <johnthetubaguy> the scheduling is global in the v2 world, i.e. placement has all hosts in its DB
11:37:00 <oneswig> makowals: an effective erase is instant on SSDs - and secure erase works for hdds - is there a need for anything quicker?
11:37:07 <makowals> I think we don't foresee bigger problems with going to cells v2, correct me belmoreira if I'm wrong
11:37:20 <belmoreira> b1airo: scheduling will be more challenging
11:37:25 <makowals> oneswig: Secure erase has to be supported by the disk
11:37:36 <oneswig> ah, ok, got it
11:37:51 <makowals> That's our problem with having such a bit granularity with the hardware arriving
11:37:56 <b1airo> i'm sure you are up to the challenge belmoreira!
11:38:15 <oneswig> we had a related discussion - came by on the openstack-dev list - what do people do in Ironic for managing firmware updates?
11:38:54 <oneswig> our policy has been to check during cleaning and fail cleaning if the firmware version does not match an expected version string.  Then manually upgrade from maintenance mode.
11:39:01 <makowals> On our side, we did not start to implement anything in this are yet, but we are aware this problem should be solved
11:39:05 <makowals> s/are/area
11:39:35 <makowals> At the same time we also want to focus on regenerating ipmi/bmc credentials
11:39:37 <oneswig> priteau: what happens on chameleon?
11:40:17 <priteau> oneswig: on Chameleon we have a separate API and client which can check that various hardware details (including BIOS firmware version) match what's expected. But it's not integrated in the ironic workflow.
11:41:36 <oneswig> #link this is the widget for Dell BIOS upgrade checks in node cleaning https://github.com/stackhpc/stackhpc-ipa-hardware-managers
11:42:13 <priteau> oneswig: Thanks! It looks like we could use this :)
11:42:21 <oneswig> Share and enjoy :-)
11:42:28 <oneswig> Let's move on - final thoughts on this?
11:42:55 <oneswig> #topic SWITCH - Kubernetes-as-a-Service
11:43:02 <oneswig> Hi zioproto - ready?
11:43:03 <zioproto> hello all
11:43:04 <zioproto> yes
11:43:13 <b1airo> zioproto: how much for that? :-)
11:43:27 <zioproto> so at SWITCH we are working on Kubernetes, do understand how far we can go for a K8s as a service
11:43:43 <zioproto> we are running k8s on openstack instances
11:43:47 <zioproto> we deploy with this ansible playbook
11:43:52 <zioproto> #link https://github.com/zioproto/k8s-on-openstack
11:44:02 <oneswig> kargo, or your own?
11:44:03 <zioproto> the playbook will create the instances for you
11:44:11 <b1airo> watch the stars on that repo go up
11:44:16 <zioproto> oneswig: forked from an existing project of a company called infraly
11:44:21 <zioproto> see fork history on github
11:44:35 <zioproto> it uses kubeadm in the code
11:44:52 <zioproto> for the users
11:45:05 <zioproto> the idea is to give them just the ~/.kube/config file
11:45:10 <zioproto> The key idea is to reuse the existing openstack username and password to login into K8s.
11:45:11 <johnthetubaguy> is there a quick way to compare this to Magnum creating the k8s clusters?
11:45:11 <belmoreira> zioproto: is this a shared cluster for all the users?
11:45:52 <zioproto> johnthetubaguy: Magnum has a dependency on Heat, because we are running always several releases behind, we decided not to use that
11:46:05 <zioproto> belmoreira: it is possible, hold on let me explain the rest
11:46:12 <johnthetubaguy> zioproto: cool, makes sense
11:46:19 <zioproto> thanks to Dims that developed this code
11:46:27 <zioproto> #link https://github.com/dims/k8s-keystone-aut
11:46:41 <zioproto> is possible to delegate to keystone the authentication and the authorization
11:46:49 <zioproto> but we use only the authentication part
11:47:04 <oneswig> #link I think it was https://github.com/dims/k8s-keystone-auth
11:47:24 <zioproto> so keystone just tells k8s, that I am really that user, and that my token is really scoped to a certain group
11:47:34 <zioproto> I can create a RBAC rule un k8s
11:47:41 <zioproto> where kind: Group is a keystone project
11:47:52 <zioproto> so I can use the same cluster for all the users
11:47:59 <zioproto> where every user see just a namespace
11:48:09 <zioproto> or all users of a keystone project see 1 namespace
11:48:10 <zioproto> BUT
11:48:13 <zioproto> there is a bug BUT
11:48:19 <zioproto> depending on the network solution you have
11:48:27 <zioproto> you have to implement your self the multi-tenancy at the network level
11:48:35 <zioproto> at the moment in our solution all pods are on the same network
11:48:42 <zioproto> so users are confined with actions to 1 namespace
11:48:54 <zioproto> but I can for example connect over the network to a pod of a different name space
11:49:01 <zioproto> at the moment we fixed the IPv6 support
11:49:13 <zioproto> #link https://github.com/kubernetes/kubernetes/pull/59749
11:49:21 <zioproto> we are planning to deploy our Horizon
11:49:27 <zioproto> on top of this Kubernetes
11:49:39 <zioproto> and expose horizon on IPv4 and IPv6 using the nginx-ingress-controller
11:49:50 <zioproto> that runs as a docker container with host networking on the controller
11:49:55 <zioproto> so can expose both IPv4 and IPv6
11:50:02 <zioproto> the cluster internally is IPv4
11:50:06 <zioproto> and we use the Neutron integration
11:50:11 <zioproto> so each pod is assigned a subnet
11:50:17 <zioproto> and the k8s master talks to neutron
11:50:23 <zioproto> and injects routes into the neutron router
11:50:31 <zioproto> so that each pod subnet so routed to the right VM
11:50:34 <zioproto> questions so far ?
11:50:53 <oneswig> What are SWITCH users doing with it?
11:50:59 <zioproto> #link https://cloudblog.switch.ch/2017/11/15/deploy-kubernetes-v1-8-3-on-openstack-with-native-neutron-networking/
11:51:18 <zioproto> oneswig: not yet released to users. At the moment we are running our staging Horizon deployment
11:51:32 <zioproto> as soon as the IPv6 patch I linked in merged in k8s, we go in production with horizon
11:51:41 <zioproto> then we will be ready to offer this to the users
11:51:42 <b1airo> so dogfood first?
11:51:48 <zioproto> b1airo: right
11:52:35 <zioproto> we still have a problem capturing the Horizon python logs
11:52:39 <zioproto> because
11:52:41 <zioproto> our docker container
11:52:47 <zioproto> is based on Apache + WSGI
11:52:56 <zioproto> Docker intercept STDERR and STDOUT of Apache
11:53:03 <armstrong> Why do you need docker containers on kubernetes?
11:53:04 <zioproto> but WSGI is yet a new process in the container
11:53:25 <zioproto> armstrong: I mean, a pod is collection of docker containers
11:53:50 <zioproto> we wrote the Dockerfile for our Horizon deployment
11:54:14 <belmoreira> zioproto: so it's you (operators) managing the cluster size for all users
11:54:15 <b1airo> all sounds like great work that lots of others will want to replicate
11:54:18 <zioproto> armstrong: we used Kubernetes with Docker containers, I am not aware of other container platforms supported by k8s
11:54:31 <zioproto> belmoreira: right
11:54:42 <zioproto> armstrong: maybe I did not understand the question ?
11:54:43 <belmoreira> belmoreira: In this case you do you charge?
11:54:48 <armstrong> @zioproto Ok I see
11:54:58 <zioproto> when you do
11:55:03 <zioproto> kubectl logs podname
11:55:12 <zioproto> basically is a wrapper on docker logs containername
11:55:12 <b1airo> interesting that we have heard about two distinct OpenStack use-cases this evening where the control-plane is merging into the production cloud
11:55:46 <zioproto> to conclude, I hope we go to production in the next weeks with the Horizon deployment
11:55:47 <oneswig> b1airo: good point.  The stratification blurs!
11:55:54 <zioproto> we will run Horizon Pike on top of our Openstack Newton cluster
11:56:30 <b1airo> on that note, do other deployments have a notion of tiers of OpenStack services?
11:56:31 <zioproto> once we learn more about the platform, we will offer it first to other teams inside SWITCH, and then to our cloud users
11:56:53 <oneswig> Did you consider running Heat from Pike and Magnum as an alternative?
11:57:26 <zioproto> oneswig: no, and I can explain why
11:57:51 <zioproto> oneswig: we prefer keeping the Openstack footprint small. As few projects as possible. This is because Upgrades are difficult
11:57:56 <b1airo> i had the same question next too
11:58:03 <zioproto> oneswig: we want a solution indipendent from a Openstack upgrade
11:58:21 <zioproto> when I upgrade openstack, I dont want to upgrade magnum too
11:58:51 <b1airo> fair point zioproto, though there are no longer many upgrade dependencies across services
11:59:14 <oneswig> Final comments, the hour is upon us
11:59:15 <johnthetubaguy> stable rest APIs has helped with the inter service stuff
11:59:31 <belmoreira> zioproto: do you have a way to enforce quotas to users?
11:59:42 <zioproto> belmoreira: I dont know yet
11:59:53 <zioproto> I guess it is possible if you can enforce quota to namespaces
12:00:02 <zioproto> in k8s namespaces I mean
12:00:24 <zioproto> join on slack #sig-openstack to get updates on this
12:00:37 <zioproto> I think is slack.k8s.io
12:00:41 <zioproto> I would have to check the URL
12:00:51 <b1airo> practical example of mixed version cloud: https://trello.com/b/9fkuT1eU/nectar-openstack-versions
12:01:09 <oneswig> We are out of time, thanks everyone
12:01:24 <zioproto> thank you !
12:01:24 <armstrong> Thanks @oneswig
12:01:29 <makowals> Thanks, cheers
12:01:35 <priteau> Thanks all!
12:01:37 <armstrong> Thanks @zioproto
12:01:38 <belmoreira> thanks
12:01:38 <b1airo> thanks all! especially makowals and zioproto
12:01:41 <oneswig> #endmeeting