LIVE: A Light-weight Data Workspace for Computational Science Hasan Abbasi College of Computing Georgia Institute of Technology Atlanta, GA, 30332 Matthew Wolf Karsten Schwan College of Computing Georgia Institute of Technology Atlanta, GA, 30332 habbasi@cc.gatech.edu ABSTRACT mwolf@cc.gatech.edu College of Computing Georgia Institute of Technology Atlanta, GA, 30332 schwan@cc.gatech.edu We present the Lightweight Information Validation Environment, LIVE as a solution to the high complexity and data sizes of modern day computational science applications. LIVE is a data workspace that facilitates the creation of dynamic data processing overlays we call I/Ographs. We use LIVE as a platform for dynamic extension of scientific applications using lightweight data extraction, runtime discovery and flexible data selection. Categories and Subject Descriptors C.5.1 [COMPUTER SYSTEM IMPLEMENTATION]: Large and Medium ("Mainframe") Computers General Terms Performance, Experimentation, Design as well as select internal data structures for external access, (2) run-time data filtering at the source of data extraction as well in flight, and (3) a distributed transport environment, termed I/Ographs, to deal with efficient, dynamic annotation, conversion, modification and extension of this data into forms other data workspace components can consume. LIVE has been applied to a Molecular Dynamics (MD) code being used by our collaborators in the materials science domain. The legacy Fortran code used in the MD application has been heavily modified and customized during its lifetime, making it unfeasible to rewrite the application from the ground up. LIVE offers the scientist an infrastructure for integrating its core computational kernel into a prototyping data workspace. In our work we have extended the MD application with efficient data snapshots, annotation tasks like Common Neighbor and Central Symmetry analysis as well as remote visualization. Keywords LIVE, WARP, GTC, XT3, datatap 2. DESIGN AND IMPLEMENTATION 1. INTRODUCTION There has been a recent explosion in the complexity and data volumes of modern scientific applications, leading to the occurrence of a new form of computer science challenge to allow these legacy applications to be easily adapted and redeployed into current and future computing environments. Workflow management, component frameworks, and data sharing are solutions under consideration by the community. We introduce the Lightweight Information Validation Environment, LIVE, to address the needs of a modern computational data workspace. LIVE allows easy integration of complex legacy applications with less complex components for data analysis and visualization. Furthermore, LIVE addresses the issues associated with application and model coupling by providing the user with an innovative framework for building and discovering data overlays. LIVE provides three key features: (1) a light-weight data tap with with a legacy code can expose its normal output Joint appointment with Oak Ridge National Laboratory, Oak Ridge, TN Copyright is held by the author/owner(s). HPDC'07, June 25­29, 2007, Monterey, California, USA. ACM 978-1-59593-673-8/07/0006. LIVE is designed to provide an infrastructure for very high data volume movement, such as the Gyrokinetic Turbulence Code (GTC) which can produce an aggregate data volume in excess of 2 TB/hour. Datataps One of the constraints in large scientific applications is the overhead associated with data output, yet this is overhead that cannot be avoided. Low cost data output from a scientific application can be used for remote visualization, online analysis or for application coupling. In all cases the data output has to occur frequently (especially for application coupling where it maybe be as frequent as a few iteration) and should not result in unacceptable overheads. LIVE provides a low cost alternative to MPI-IO, HDF5 and basic file I/O by providing the application with a low overhead mechanism for extracting generated data based on a self describing portable binary format. In addition LIVE enables the user to insert filtering and transformation functionality, reducing the overhead further as well fixing data format mismatches. I/Ographs LIVE does not directly connect the data workspace components to the application, rather adding a data overlay, called an I/Ograph between them. The I/Ograph is a data computation overlay allowing multiple components to each be served by a single data tap even in cases where the input data specification is different for each component. The I/Ograph allows further customization of the exported data through either a pre-compiled transform (for example, a data redistribution function) or a dynamically com- 227 WARP Simulation WARP Simulation WARP Simulation Overheads associated with multiple LIVE data consumers 7 6 Bonds Common Neighbour Analysis Central Symmetry Snnapshot S apshotot Snapsh Time (s) 5 4 3 2 Visualizer Aggregate Snapshot 1 0 No snapshot/No LIVE No snapshot/LIVE 1 data consumer 2 data consumers Figure 1: WARP: A configuration of the LIVE Data Workspace piled transform or filter (for example, a cutting plane filter). The I/Ograph provides better performance through greater overlap of computation and communication, as well as more flexibility by enabling each component in the workspace to define a local data view. Runtime dynamic filtering and morphing The two aforementioned features of LIVE take advantage of middleware support for runtime code generation for better adaptability for data flows within the workspace. By giving the user options on where filtering and morphing takes place, the data workspace can out perform simpler I/O libraries despite providing more flexibility. Dynamic discovery and extension LIVE's I/Ographs are constructed dynamically whenever and where they are needed. Towards this end, LIVE implements a runtime data discovery service based on the type of data being exported by a module (and/or by the application). Any component within the data workspace can register for specific data types and associate a number of transforms associate with each type. LIVE leverages earlier work on ECho, a distributed publish/subscribe interface. The datataps use PBIO binary formats [1] for efficient description and marshalling of data, the I/Ographs utilize EVPath, a high performance low level data transport that provides support for runtime code generation for filtering and transformations. By design, LIVE's datataps reside on the compute nodes of MPP computers and the I/Ographs is designed to be dynamically constructed on either the I/O nodes, the service nodes or auxiliary clusters. Data extraction method Figure 2: Overheads incurred by attaching variable numb ers of data consumers ysis, (3) Common Neighbor Analysis and (4) Data Visualization. The data flow is shown in Figure 1. We used the built-in snapshot procedure in WARP as the entry point to our datatap but instead of just writing the data to disk we provided input to the above components. 4. EVALUATION We compared the overhead of using WARP with LIVE and a variable number of data consumers to study the overheads associated with dynamic data extraction. The results, in Figure 2, show that adding multiple consumers adds only a minimal overhead to the application runtime. This overhead can be further reduced by using a data broker within the I/Ograph overlay. There is a corresponding increase in the total end-to-end time but the overhead on the application is reduced. The flexibility offered by LIVE allows it to optimize the construction of the I/Ograph based on user specified performance metrics. The data filtering mechanism provided by LIVE can also provide a significant reduction in data extraction overhead. By using a simple filter, such as a cutting plane or a data selection filter, we can improve performance considerably by reducing the output data size. 5. FUTURE WORK Our immediate future work is addressing some of the specific performance issues associated with the current LIVE implementation. More importantly, we are currently completing an implementation of the LIVE data transport for Portals, in order to run LIVE on the Cray XT3 class of supercomputers. We are also working on a parallel Infiniband Verbs implementation, which should dramatically improve our network performance compared to IPoIB. In order to facilitate the usage of LIVE for the creation of high performance data workspaces we are working on a solution to allow easier and more flexible data registration. This would allow the developer of a scientific application to more easily integrate LIVE with her application. 3. APPLICATION EXAMPLE We used the molecular dynamics application, WARP, to provide a use case for LIVE. WARP is used by our collaborators in materials science domain to study material properties like crack propagation and fractures. We enhanced WARP with some additional components required in typical materials research. These tools are usually executed as a post processing step in analyzing the simulation data for specific features. Instead of waiting until the simulation completes, we enable the scientists to perform online analysis and visualization. In the future we plan to enhance this functionality by enabling a feedback mechanism for more fine grained control on the application itself. To evaluate LIVE we implemented four data consumers. (1) Molecular Bonds calculation, (2) Central Symmetry Anal- 6. REFERENCES [1] F. E. Bustamante, G. Eisenhauer, K. Schwan, and P. Widener. Efficient wire formats for high p erformance computing. In Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 39. IEEE Computer So ciety, 2000. 228