16–19 Oct 2016
Copenhagen University
Europe/Copenhagen timezone

Data Analysis Infrastructure for Diamond Light Source Macromolecular & Chemical Crystallography

18 Oct 2016, 14:30
20m
Marble Hall (Copenhagen University)

Marble Hall

Copenhagen University

Thorvaldsensvej 40
Oral Contribution Contributions 5

Speaker

Dr Markus Gerstel (Diamond Light Source Ltd)

Description

A proposal for the future data analysis infrastructure at Diamond Light Source is presented. Built on a messaging framework a variable number of distributed servers working in parallel replaces monolithic batch jobs running on a single node. This infrastructure is scalable, can be easily extended, and even allows moving heavily CPU-bound tasks, such as the processing of reduced macromolecular data, off-site, e.g. to external cloud providers. Diamond Light Source has 8 MX & CX beamlines with DECTRIS PILATUS detectors, each capable of producing diffraction data at rates between 25 and 100 images per second. Upgrades to new DECTRIS EIGER detectors are planned over the forthcoming year. These offer frame rates up to 133-3,000 Hz concomitant with increased image sizes for compressed data rates of around 18 Gbit/s. The current automated data analysis process consists of mainly two aspects: a very fast and embarrassingly parallel per-image-analysis for timely feedback during data collection, and more involved data reduction and processing designed to give answers to the experimental questions. The existing infrastructure depends on submitting batch jobs to a high performance computing cluster. While appropriate for the current workload this approach alone does not scale to the very high data rates anticipated in the near future. In particular with live processing there are shortcomings in performance when the workload exceeds the capacity of one cluster node. When data rates stay significantly below a node's capacity the cluster is currently not used efficiently. In the proposed infrastructure fine-grained tasks are submitted as messages to a central queue. Servers, running on cluster nodes, consume these messages and process the tasks. Results can be written to a common file system, sent to another queue for further downstream processing, sent to a dynamic number of subscribing observers, or any combination of these. This will increase the availability of high performance nodes to allow increased parallelisation of more computationally expensive tasks, thus increasing the overall efficiency of cluster usage. The resulting distributed infrastructure is resource-optimal, low-latency, fault-tolerant, and allows for highly dynamic data processing.

Primary author

Dr Markus Gerstel (Diamond Light Source Ltd)

Co-authors

Dr Alun Ashton (Diamond Light Source Ltd) Dr Graeme Winter (Diamond Light Source) Dr Richard Gildea (Diamond Light Source Ltd)

Presentation materials