11:00:13 #startmeeting scientific_wg 11:00:13 Meeting started Wed Aug 2 11:00:13 2017 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:16 The meeting name has been set to 'scientific_wg' 11:00:31 o/ 11:00:38 o/ 11:00:40 o/ 11:01:09 \o 11:02:28 Hello all :) 11:03:11 Hello \o/ 11:03:11 #link agenda for today is https://wiki.openstack.org/wiki/Scientific_working_group#IRC_Meeting_August_2nd_2017 11:03:11 greetings all! Who said that nothing happens in Europe in August :-) 11:03:11 Blair is in transit currently and may / may not attend. If he makes it, he mentioned he had an update on his THP fragmentation issues 11:03:12 Martial, you there? 11:03:12 OK, lets get going 11:03:13 aha 11:03:14 #chair martial_ 11:03:14 good morning Martial 11:03:15 #topic Workload tracing on Chameleon 11:03:15 Current chairs: martial_ oneswig 11:03:38 BTW I am on a train, am likely to go through some long dark tunnels ... apologies in advance 11:03:49 priteau: you have the floor 11:03:52 Thanks oneswig 11:03:57 I would like to provide an update on our cloud traces work 11:04:05 Evening 11:04:11 #chair b1airo 11:04:12 Current chairs: b1airo martial_ oneswig 11:04:13 Hi Blair you made it 11:04:14 Hi oneswig, hi priteau 11:04:23 Hi martial_ 11:04:24 The idea is to provide something similar to the Parallel Workloads Archive or Grid Workloads Archive 11:04:41 Just made it. Not sure I'll last long to be honest - already in bed... 11:04:42 hi all, sorry to bother I am new here, the reason I joined this meeting is because I am planing to support our hpc system on openstack and was wondering whether this was a good place to ask those questions? 11:04:45 priteau: can you describe what these do and why they are useful? 11:04:45 These focus on HPC cluster and grids rather than clouds 11:04:57 Welcome masber ! 11:05:00 masber: you've come to the right place :-) 11:05:27 thank you 11:05:38 They provide data about all the jobs that have been run on some infrastructures. They can be used by researchers to simulate workloads that match reality. 11:05:53 It can be particularly useful for e.g. a researcher working on job scheduling 11:06:07 If I recall correctly those existing archives are mainly used in scheduling research 11:06:23 kind of like imix for network packets? (a sample of random internet frames) 11:06:49 oneswig: I didn't know Imix, but yes it looks similar in another context 11:06:55 But I think you and Kate were lookin for richer data than just instance requests? 11:07:46 except that IMIX might only be providing a distribution of packets, rather than actual headers 11:08:15 priteau: feasibly - seen but not used it. Carry on :-) 11:08:32 b1airo: eventually yes, I think it could be combined with telemetry data. We started with just instance requests for now. 11:09:01 masber: briefly, this is our group meeting time. We typically have an agenda (posted to the mailing list prior) that includes any other business and discussion time towards the end. But there is also a dedicated channel where we can chat anytime 11:09:12 priteau: is this formed from nova notifications or something higher-level? 11:09:33 We've defined a data format which is heavily based on the structure of the Nova database 11:09:43 http://scienceclouds.org/cloud-traces/cloud-trace-format/ 11:10:16 It's at version 0.1 so it may change in the future depending on feedback 11:10:22 priteau: I am curious why the DB format and not the API format? 11:10:40 Impressive domain name 11:10:42 b1airo, yes I don't want to hijack the meeting, would you mind sharing the channel I can use to ask my questions? 11:11:10 johnthetubaguy: I would assume that they're quite similar, but I will check 11:11:35 priteau: the API format is a public API contract, the DB changes quite a lot, in general anyways 11:11:47 Good point 11:12:06 Basically it's using data from instance_actions_events joined with data from the instances table 11:12:07 priteau: what are you looking for in the data you're collecting? 11:12:15 masber: #scientific-wg 11:12:30 priteau: actually the notification have some versions, so maybe what b1airo said is a better option 11:12:53 oneswig: All events associated with all instances running on a cloud deployment 11:13:34 Hopefully enough for a researcher to analyze and simulate the same kind of activity 11:13:40 priteau: the idea being you can re-run them to see if your job scheduling is better, etc, etc? 11:13:46 ah, that is a yes then 11:13:50 johnthetubaguy: Yes, that's the idea 11:14:32 priteau: pretty impressive :) 11:14:40 priteau: what's interesting here is that chameleon is bare metal. I'm interested to compare this with the usage of a virtualised resource - for scalability considerations 11:15:09 oneswig: Actually we've only provided a trace from our KVM cloud so far: http://scienceclouds.org/cloud-traces/chameleon-openstack-kvm-cloud-trace/ 11:15:25 I guess also Chameleon users are a fairly unique bunch, who may be hard to characterize 11:15:35 We thought it may be more representative and would cover more instance events (such as migration) 11:15:48 priteau: I'll not download that right now or the entire train network will go down :-) 11:16:06 it's only 6 MB zipped, barely more than a modern webpage ;-) 11:16:28 yeah, on a UK train, that might take everyone out 11:16:39 priteau: What are you using to visualise your data? (please say grafana diagram :-) 11:17:23 johnthetubaguy: thinking of alternative expansions of GWR now... 11:17:37 oneswig: We haven't done visualization from these traces, but the student who worked on this is using Grafana for another project related to experiment visualization 11:18:19 From here, I wonder if it is not too many steps to visualise notification data coming from a live system 11:18:30 It's still work in progress, but since the student is leaving soon we wanted to release a trace and format to request feedback 11:18:53 johnthetubaguy: I agree it'd certainly be better if we could do this with the APIs and/or notifications 11:19:15 One question we have is whether folks are interested by high or low-level OpenStack events. For example, an action like “migrate” is composed of four separate action events: cold_migrate, compute_prep_resize, compute_resize_instance, and compute_finish_resize. 11:19:42 Actions that we saw in our KVM cloud are: create/delete, start/stop, reboot, rebuild, migrate, live-migration, suspend/resume, resize/confirmResize, and pause/unpause. 11:19:43 Problem with notifications is it essentially means needing a trace service for capturing, whereas I imagine most admins would be happier doing once off dumps 11:19:58 priteau: most interesting for me is the hierarchical connection of an instigating API reuqest and the sub-events generated. The level of detail's like a fractal, perhaps 11:20:15 there are two ways to look at this I think (1) measure how long each thing took (2) use it for reply 11:20:24 b1airo: don't ceilometer and stacktach do that? 11:20:30 yeah for reply, don't need as many events recorded 11:20:46 b1airo: Personally as an operator I wouldn't mind setting up a notification collector if it seems safe to use and easy to set up. 11:20:49 oneswig: Have you looked at os-profiler? 11:20:51 stacktach would collect all the notifications for you, I guess ceilometer should 11:20:59 Sorry, osprofiler 11:21:09 #link https://github.com/openstack/osprofiler 11:21:12 priteau: ah, knew I'd missed one. Heard the name, like the face, never used it, alas 11:21:26 works with rally, right? 11:21:29 os-profiler reads from ceilometer I think? by default 11:21:41 although that may have changed 11:21:59 I think osprofiler can be used with Rally, but not necessarily 11:22:17 johnthetubaguy: seems so: https://docs.openstack.org/osprofiler/latest/user/integration.html 11:22:45 So I have noted the comments about using the API rather than DB access 11:22:46 priteau: do you think the format you've proposed could be generated through 'distillation' of notification objects, or are they too different? 11:23:28 oneswig: I think it would be possible as long as the same data is available in the notification or via an extra Nova API query 11:23:47 you should be able to get it all from the notifications 11:23:54 priteau: I am not sure, all I know is that notifications are big. 11:24:09 might depend on what data you're doing the join for? 11:24:32 it's to have things like user_id, project_id, etc. on the same row 11:25:26 BTW we anonymized fields like user_id in case Chameleon users didn't want their username to be shared (our KVM cloud is old and still using the LDAP backend for Keystone, which means that user_id == username) 11:25:57 priteau: you mentioned that your student is nearing completion of their placement. What are the next steps for you/ 11:26:08 Yes that step is going to be very important 11:26:56 oneswig: We will release the code to extract this data and would like to see if bigger clouds could share their own traces 11:27:25 I think b1airo was interested to share NeCTAR data 11:27:28 I wonder if those id fields could go through a non-reversable hash 11:27:56 Or is that what you're doing already? 11:27:56 b1airo: I think turbo-hipster from mikal tried to do some of this stuff in the past 11:28:19 I think we've used a SHA1. Obviously if you've got access to a rainbow table you might be able to reverse. 11:28:39 Ah yeah I remember that now you mention it johnthetubaguy 11:28:57 Maybe we should use scrypt? 11:29:59 Turbo-hipster was about capturing prod dbs for migration testing right? Is that still happening at all? 11:30:14 Are you talking about https://github.com/openstack/turbo-hipster? 11:30:54 yeah, it did a bit a anonamising data 11:31:12 interesting 11:31:14 #link http://turbo-hipster.readthedocs.io/en/latest/intro.html 11:31:18 to allow folks to contribute their datasets 11:31:33 I think Rackspace gave their pre-prod dataset at one point 11:31:53 apparently fields are anonymised with https://github.com/mrda/fuzzy-happiness 11:32:01 its just there may be ideas in there worth borrowing 11:32:25 That's a great idea for improving the test coverage. 11:32:37 priteau: maybe worth trying it to see if the trace format can be fulfilled with a turbo-hipster dump? 11:32:49 that did catch some upgrade bugs in the past 11:33:07 hmm, not sure about using its format 11:33:30 I would try the notification or similar format 11:34:00 b1airo: good idea. Do you know if the "variety of real-world databases" that turbo-hipster uses is freely available? 11:34:51 I am not sure there is much variety in the end, maybe three DBs we ran the upgrade migrations on? 11:35:22 I suspect they are too old now, but I would have to go dig to find out 11:36:37 If they are big enough they may be interesting 11:36:37 This discussion is a good example of data gathered for one purpose taking on an entirely different one 11:36:48 Anyway priteau, this is a good start. I'm sure I can convince Nectar Directorate to share an initial set 11:37:21 b1airo: Thanks. We need to do a bit of code cleanup first but I'll be in touch. 11:37:53 +1 seems like a great start 11:37:56 priteau: thanks for sharing your work 11:38:06 Don't be too fussy about it, we can patch/fix/comment code too as needed 11:38:20 priteau: btw, any progress on the Blazar UI? 11:38:22 oneswig: Thanks for allowing me :-) 11:38:39 pierre: any trace that can be analyzed? 11:38:40 oneswig: yes, it's in review, it should be merged by the end of the week! 11:38:43 Is that in horizon? 11:39:08 martial_: what do you mean? 11:39:23 priteau: that's great, well done 11:39:50 OK - few other items to cover, shall we move on? 11:39:50 well I looked at the website and listened to your presentation of the project, but now I am trying to see if we can draw comparables on the collected data 11:39:58 b1airo: yep, https://github.com/openstack/blazar-dashboard 11:40:41 pierre: we can take take outside of the meeting 11:40:47 stig: please go ahead 11:40:47 sure 11:41:04 #topic Update from Blair - THP 11:41:11 b1airo: why don't you fit in here 11:41:24 I would like to look at using Blazar in Nectar too, need to pick your brains at some stage so I can get a 1-pager together on what it might look like 11:41:40 Sure oneswig 11:42:05 I would love to help you with that 11:43:03 I now have a repeatable test case that shows simply having a full pagecache before running a compute job can cause a 2-3x slowdown 11:43:17 in the hypervisor? 11:44:16 The effect seems to be on both BM and VM, but VM emphasises it 11:44:50 as in the 'cache' column in vmstat, 'Cached' in /proc/meminfo? 11:45:25 how memory heavy is that job? 11:45:40 This particular code (SIMPLE) is a CryoEM structural refinement thing. And I'm just using a Singularity container that is setup to run SIMPLE's built-in speed test 11:45:56 What rune do you use to evict the page cache? 11:46:32 And that test only ends up using about 2GB of memory, but parallelises well on SMP 11:47:20 It is very memory heavy I believe, and has ill defined random access patterns 11:47:58 I filled the pagecache by simply cat-ing a single file equal to VM ram size 11:48:00 I wonder what quantity of memory amounts to a full TLB when using 4K pages. 11:48:15 oneswig: very little 11:48:40 TLBs are only a few thousand entries iirc 11:48:55 That's actually more than I thought 11:49:03 is it possible the cache is not local (would be strange but that slowdown is impressive)? 11:49:13 So 2GB random access memory patterns would totally kill it 11:49:55 martial_: you're thinking it would involve a write back to networked storage to evict dirty pages? 11:50:02 yep 11:50:10 The problem appears to be that the Linux mm is for some reason not able to allocate THPs even when all the memory usage is cache and should be able to be dumped immediately 11:50:13 b1airo: are you using the default vm.swappiness value? 11:50:41 seems unlikely and not natural at all to me, but disk is not that slow 11:51:42 priteau: no swap in guest at this stage, but I plan to investigate that as we have seen slub allocation problems and the reading I've been doing there makes me think the kernel allocator might still like having a teeny bit of swap around even if it doesn't really use it 11:52:29 are you doing all the CPU pinning, NUMA passthrough, hugepages etc for the VM? 11:52:41 Yep 11:52:52 b1airo: I wonder if THPs are allocated in a GFP_ATOMIC context, so cannot muck about with other page mappings in fear of a fault 11:53:35 b1airo: When are you going to write all this stuff up? 11:54:22 oneswig: possibly, but I doubt it because normal behaviour is to defrag on fault in order to allocate a THP if possible 11:54:48 b1airo: ah ok, thanks 11:55:01 oneswig: once I've made some decent graphs to pictorialise it 11:55:11 did perf help there? 11:55:42 b1airo: I was thinking about the settings on the host 11:56:07 @blairo did you tweak zone_reclaim_mode at all? 11:56:30 (also the old-fashioned sysadmin in me agrees that having a bit of swap is no bad thing) 11:56:36 Actually still haven't done that as once I discovered this behaviour I figured inspecting guest vmstat was the better place to start 11:56:56 I wonder if you are switching THP too much, so smaller pages would be better, as they are quicker to read in? 11:57:32 daveholland: was just about to say that - interestingly, turning on zone reclaim makes the problem go away (at least for this little test case) 11:57:35 I would second stig here on the 2GB random access on a THP pagecache, seems prone to slowdown even if it is simply jumps 11:58:25 Time call! 11:58:25 With zone reclaim on even after filling pagecache the kernel is keeping over 2GB free on my 8GB test VM 11:58:34 blairo: we came to that via a user seeing soft lockups, https://access.redhat.com/solutions/1560893 (don't know if you need a RH account to view) 11:59:06 By the way, I've confirmed this in both centos7 (3.10 kernel) and with 4.12 from EPEL 11:59:36 guest OS? 11:59:36 Thanks daveholland , will have a look (I do have a RH account) 11:59:45 Centos7 11:59:50 thought you ran Xenial in the hypervisor? 12:00:03 blair, take a look at that https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases 12:00:04 Ubuntu Trusty host with 4.4 kernel 12:00:30 We are out of time, alas. 12:00:38 Thanks b1airo, good update 12:00:55 have to close now 12:00:55 I did read that already martial_ ;-) 12:01:07 #endmeeting