11:00:30 <oneswig> #startmeeting scientific-wg 11:00:31 <openstack> Meeting started Wed Jul 5 11:00:30 2017 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:32 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:34 <openstack> The meeting name has been set to 'scientific_wg' 11:00:46 <oneswig> #link Agenda for today is here https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_July_5th_2017 11:01:03 <oneswig> ... Who noticed the time change? 11:01:24 <b1airo> evening 11:01:27 <oneswig> #chair b1airo 11:01:28 <openstack> Current chairs: b1airo oneswig 11:01:34 <oneswig> Hi b1airo 11:01:42 <noggin143> Hi, Tim joining 11:01:51 <priteau> Hello! 11:01:56 <oneswig> Hi noggin143 priteau 11:02:01 <oneswig> well remembered :-) 11:02:25 <oneswig> We've got a shortish agenda for today 11:02:37 <oneswig> #topic Supercomputing BoF 11:02:51 <verdurin> Good afternoon. 11:03:00 <oneswig> First item was that the submission deadline is approaching for BoFs at SC 11:03:13 <oneswig> Which is in Denver the week after the Sydney summit 11:03:24 <oneswig> b1airo: do you have the deadline to hand? 11:04:06 <oneswig> At SC2016 we did a panel session which pretty much filled a big room 11:05:28 <oneswig> The question to discuss is, is there appetite for putting something on for SC again? 11:05:32 <b1airo> end of July off the top of my head... 11:05:50 <verdurin> oneswig: was there something at ISC? 11:06:01 <oneswig> verdurin: nothing official AFAIK 11:06:10 <oneswig> My colleague John attended, I did no. 11:06:25 <b1airo> yep 31st July 11:06:44 <b1airo> i would like to do something at ISC next year 11:07:11 <oneswig> b1airo: me too 11:07:38 <oneswig> One action on SC - I think we should check with hogepodge to see if he's returning. 11:08:06 <oneswig> There's also the possibility of a second edition of the book we wrote, which was very popular at the conference 11:08:10 <b1airo> yes good idea 11:08:50 <oneswig> #action oneswig to follow up on book and foundation attendance for SC 11:08:59 <b1airo> so who, if anyone, is planning to attend SC17 ? 11:09:23 <oneswig> I might - its on the way home from Sydney for me 11:09:30 <b1airo> hahaha 11:09:37 <oneswig> Less so for you b1airo :-) 11:09:58 <b1airo> at least the Summit will be short, so i can go home in between 11:10:36 <oneswig> re: openstack summit - I had a look at submitting papers, are the categories all changed from last time? 11:11:11 <oneswig> b1airo: I think most WG members at SC16 were from the US time zone, we should follow up with this next week 11:11:26 <b1airo> i haven't looked yet to be honest, don't really think i've got enough new material to warrant speaking 11:11:43 <verdurin> Can't say yet whether I'll be at SC. 11:11:50 <oneswig> wondered if the new format meant fewer categories 11:12:08 <b1airo> the TPC was interesting 11:12:16 <oneswig> TPC? 11:12:27 <b1airo> technical programme committee 11:12:33 <b1airo> for SC17 I mean 11:13:00 <oneswig> submissions already selected for presentations? 11:13:44 <b1airo> mostly 11:13:44 <priteau> I won't be at SC. I will ask Kate if she'll be there, she generally attends. 11:14:05 <oneswig> priteau: she's involved in running it isn't she? 11:14:12 <b1airo> there was not a huge amount of competition from great papers to be frank 11:14:34 <oneswig> b1airo: what was the category? 11:14:37 <b1airo> only a handful accepted for the clouds and distributed computing stream 11:15:11 <b1airo> i'm still working on shepherding one of them, but that's turning out ok now 11:15:15 <priteau> oneswig: she's chair of the Clouds & Distributed Computing committee 11:16:01 <priteau> So it's very likely she'll be there 11:16:10 <b1airo> yeah i expect kate would be at the conference itself given she is chair, but i suppose you never know - she did say her travel diary was pretty crazy 11:16:27 <oneswig> b1airo: after seeing a few talks last time I assumed the bar was pretty high, perhaps it's worth having a go next time round 11:16:47 <b1airo> oneswig, yes i think so 11:17:07 <oneswig> noted for 2018 :-) 11:17:18 <b1airo> probably a practice and experience paper 11:17:51 <b1airo> might be just the thing to get that Maldives meeting lined up ;-) 11:18:25 <oneswig> Oh, if it'll get us to the Maldives I'll do it!! 11:19:06 <b1airo> noggin143, what is happening at CERN? i see luminosity is surpassing expectations since the restart 11:19:21 <noggin143> yup, running great 11:19:37 <noggin143> we’re getting some more tape in stock 11:19:59 <noggin143> also just got ironic working 11:20:15 <oneswig> OK so to sum up on this, I think there's scope for some activity (including possibly a BoF) but we should follow up with the Americas session next week. 11:20:31 <oneswig> noggin143: in a cell or a separate deploy? 11:20:44 <b1airo> cool! i have been wanting to ask you guys about that as i thought i'd heard you had it going with cellsV1 and was interested to know if there were any gremlins there 11:20:47 <noggin143> in a cell, in the production cloud 11:20:59 <noggin143> we’re also just starting testing cells v2 with upstream 11:22:01 <b1airo> is there a blog or some such to follow, or maybe we can get belmiro along to talk about it..? 11:22:39 <noggin143> I’m sure belmiro will share the details, although he’s a bit busy with the new arrival at home 11:22:41 <b1airo> oh, belmoreira, are you here already? 11:22:44 <oneswig> We recently saw some oddity relating to the timing of Ironic registering hypervisors and populating a cell 11:23:31 <b1airo> did he get a new gaming rig...? :-P 11:23:58 <oneswig> b1airo: too good! :-) 11:24:15 <oneswig> OK - want to talk THP? 11:24:28 <oneswig> #topic b1airo's issues with THP 11:25:03 <b1airo> i don't have a great deal of detail on this yet sorry but maybe someone will just know what our problem is 11:25:06 <b1airo> here's hoping! 11:25:26 <oneswig> What are the symptoms? 11:25:45 <b1airo> basically we have noticed on some of our virtualised HPC nodes that certain jobs sometimes run very slowly 11:26:17 <b1airo> the guests are CentOS7 backed by hugepages (either static or THP) on Ubuntu hypervisors (4.4 kernel) 11:26:34 <b1airo> the THP issues are inside the guest itself 11:27:05 <b1airo> the host THP usage stays steady after the guest is started 11:27:12 <verdurin> b1airo: only with CentOS 7 guests, so a wide disparity in guest-host kernels? 11:28:12 <b1airo> so far we only have data for CentOS7 guests as we don't yet have a repeatable trigger and so have not been able to narrow things down or look at kernel bisections etc 11:28:39 <b1airo> in the guests, when performance goes bad we see very high system time 11:29:32 <b1airo> and when khugepaged runs periodically it uses noticeably more CPU than normal 11:29:43 <oneswig> Is there some kind of steady-state activity that can provoke this? What I'm getting at is stochastic kernel profiling using perf - http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html 11:30:18 <b1airo> this seems to be related to THP defrag / memory compaction, as disabled defrag makes the high system time disappear 11:30:52 <b1airo> oneswig, perf is a good suggestion 11:31:12 <oneswig> only problem is, you need to provoke the problem continuously for (say) a minute 11:31:17 <oneswig> to get a sample 11:31:37 <b1airo> as we're suspecting some kind of memory fragmentation issue we've started collecting into influx (with grafana bolted on) vmstat and buddyinfo data 11:32:00 <b1airo> for all compute guests 11:32:35 <b1airo> we also see a high rate of compact_fail in the THP counters 11:33:18 <oneswig> What are the causes of failure - any thoughts? 11:33:39 <b1airo> it seems like it could be some interaction with pagecache as when we drop_caches the compact_fail stops the dramatic increase and hugepage count goes up again 11:34:16 <b1airo> so far we haven't seen any obvious job/application patterns in the slurm logs 11:34:54 <b1airo> but i note that THP only applies to anonymous memory, so currently linux does not hugepage-ify pagecache 11:36:22 <b1airo> the one thing we do have is an application that runs dramatically faster with hugepage backing in the guest (as well as the hypervisor), so we do have a test-case to detect problems that should finish in 1min if working properly 11:36:37 <oneswig> Does it count as a failure if there's insufficient contiguous physical memory to create a hugepage? 11:37:18 <b1airo> oneswig, that's a great question - i have not had a lot of luck finding murky details of the compaction algorithm 11:37:59 <b1airo> there is an LWN article from about 2010 that talks about it generally but it is not in relation to THP 11:38:11 <oneswig> Hopefully, failure is nothing more than that - no room in the inn 11:38:33 <oneswig> and might explain why evicting caches indirectly helps 11:38:52 <b1airo> in the default CentOS7 (and maybe RHEL7) guest setup /sys/kernel/mm/transparent_hugepage/defrag is enabled 11:39:53 <b1airo> which means that the vm (as in linux virtual memory management, not virtual machine) will perform defrag on-demand to allocate a hugepage in response to a page fault 11:40:54 <b1airo> i think that is why we see the system time drop off when 'echo never > /sys/kernel/mm/transparent_hugepage/defrag', as we stop slowing down page faults and just fall back to using 4k pages 11:41:04 <oneswig> I wonder if a paravirtualised guest sees some data passed through on memory mapping, similar to numa awareness 11:41:32 <b1airo> hmm no idea 11:42:05 <oneswig> I'm trying to picture these interactions between the page mapping of host and guest and not doing very well... 11:42:38 <b1airo> i'm still wondering if there is not a much simpler way to do guest to physical memory mappings than the 2-dimensional method used today 11:42:51 <oneswig> I think what you've got here is a top-drawer presentation subject for Sydney - if you get to the bottom of it 11:43:46 <oneswig> Keep us updated? 11:44:08 <b1airo> if you can assume a guest's memory is mlocked (reasonable for high-performance use-cases) then it seems like you can reduce the guest page lockup to an offset 11:44:54 <b1airo> but i'm sure that's a stupidly naive idea and if i find the right mailing list someone will suitably shoot me down :-) 11:45:14 <b1airo> oneswig, fingers crossed! 11:45:58 <oneswig> Should we move on? 11:46:04 <b1airo> we'll be trying to build some nice graphs to visualise it more and hopefully i can share something pictorial next week 11:46:10 <b1airo> yep 11:46:16 <oneswig> Thanks b1airo, be really interesting to see what you find 11:46:31 <oneswig> #topic scientific application catalogue 11:46:52 <oneswig> OK, I had an action to investigate Murano on tripleO 11:46:59 <oneswig> verdurin: you did too - any luch? 11:47:05 <oneswig> I mean luck? 11:47:24 <oneswig> I got my TripleO deploy to the point where I can add Murano, that's all I managed alas 11:48:11 <oneswig> I did however have discussions with several others who were also interested in this kind of service 11:49:19 <oneswig> verdurin: you there Adam? 11:49:32 <noggin143> we’re looking at Murano this summer. Given that the OpenStack application catalog at openstack.org is being retired, it would be interesting to share some scientific apps 11:49:57 <b1airo> is it, totally missed that - how come? 11:50:55 <noggin143> it was competing with other repositories like bitnami and dockerhub 11:51:15 <oneswig> noggin143: is the CERN investigation being done in a way that we can participate in? 11:51:52 <noggin143> it’s a summer student project for the next few weeks and they will do a writeup once it is done. 11:52:25 <verdurin> Back now oneswig 11:52:28 <oneswig> Sounds useful, let's keep it on the agenda. 11:52:39 <oneswig> Hi verdurin - any luck with you? 11:53:10 <verdurin> oneswig: not yet. Hopefully very soon. 11:53:26 <oneswig> me too... it's a three horse race :-) 11:53:59 <oneswig> noggin143: is there an intern assigned for this work? 11:54:08 <oneswig> we could start a therad 11:54:09 <oneswig> thread 11:54:16 <b1airo> noggin143, so was the reason to do with those other places having larger contributor communities and so the openstack catalog looking bad because it is so sparse ? 11:54:58 <noggin143> and because it was viewed as a competitive solution. Better to integrate with other communities than have our own 11:55:40 <priteau> The mailing list thread about the app catalog is http://lists.openstack.org/pipermail/openstack-dev/2017-March/113362.html 11:55:42 <b1airo> is there an alternative that would allow us to share murano apps for example? 11:55:50 <b1airo> thanks priteau 11:56:29 <noggin143> I think we’re probably best off just setting up a github repo and the syncing to it 11:56:30 <b1airo> it predates my recent decision to try and read more of openstack-dev again 11:56:50 <oneswig> b1airo: the NSF guys had a Murano catalog in draft form 11:57:12 <oneswig> and were sharing packagings on that. Don't think it went mainstream but we should pursue that. 11:57:35 <noggin143> NSF work sounds interesting. 11:58:10 <oneswig> #link presentation here https://www.openstack.org/videos/barcelona-2016/image-is-everything-dynamic-hpc-vm-repositories-using-murano 11:59:09 <b1airo> Murano on Nectar is getting a steady flow of contributions now, i haven't actually looked at our index in a while but i see the "new murano package" tickets come in for review, maybe one every fortnight at the moment 11:59:19 <oneswig> Anything more - AOB? 11:59:47 <b1airo> there is a relevant ML thread on rebranding WGs to SIGs 11:59:47 <oneswig> b1airo: sounds like it has enough critical mass on Nectar to be self-sustaining 12:00:14 <b1airo> we would certainly love to share with other science clouds 12:00:30 <verdurin> love to hear more about it, b1airo 12:00:52 <oneswig> out of time... thanks everyone 12:01:13 <b1airo> cheers all, goodnight! 12:01:20 <verdurin> Bye 12:01:24 <oneswig> #endmeeting