High-Performance Computing

The success of the ATLAS Experiment will be defined by more than just the dedicated physicists who pursue research at this highest-energy frontier. It will be defined as well by the reliability and scalability of our computer systems. These massive networks of storage and processing power serve as both a nervous and circulatory system; if they are not reliable, all the brain power in the world won’t be able to analyze the data. SMU is extremely fortunate to have invested in a massive computing system, one which stands as both an independent facility and as part of a global computing grid.

Proton collisions arrive from the Large Hadron Collider at CERN, are turned into computer data by the ATLAS trigger (which tries to only keep the most interesting events), and then is exported to a massive magnetic tape system for distribution to sites across the globe. ATLAS computing is “tiered”: Tier 0 is where the data is originally collected; Tier 1 facilities have full copies of the original data; Tier 2 facilities have a limited suite of data needed by regional groups; Tier 3 facilities serve the specialized needs of a single institution. At SMU, we are very fortunate to be host to a large Tier 3 system, managed by Justin Ross.

Justin Ross, in front of the SMU High-Performance Computing System.

The SMU High-Performance Computing (HPC) system [1] consists in part of 1000 “cores” – processing units which can each be devoted to the requests of a single users. It’s a resource available to the entire University. SMU ATLAS users (researchers with a specific computing need) routinely submit requests to the system to execute data analysis or a simulation of the ATLAS detector. At any moment, a half-dozen SMU Atlanteans might be making requests to the system. Some of these requests are for just a few cores, and some for as many cores as are available. It is the role of the system, tuned by the system manager, to decide how priority is assigned based on recent usage or future needs.

This is a massive technical challenge. Installing 1000 cores requires extensive knowledge of the hardware, a recognition of the electrical demands of the system, and expertise in high-speed, high-performance networking to connect all of it together. It also requires responsiveness to the needs of users, patience with their requests, and a collaborative spirit that helps to connect users to system’s resources. We have been very fortunate to have Justin at the technical helm of this system.

Let me give a few examples that highlight the demands on this system. Members of the SMU ATLAS group (Julia Hoffman and Ryan Rios) needed an extremely large sample of simulated data in order to develop their search for the Higgs Boson, a holy grail of the LHC effort. When they were told that such samples of simulated data could not be quickly made available by ATLAS central simulation production, they turned to the HPC system. Over the course of six months, running almost continuously for weeks at a time, they simulated 2 million proton-proton collisions spread out over 840,000 “jobs” – individual requests to the system. The massive scale of the LHC and ATLAS experiment puts tremendous technical demands on the system; each collision takes 12 minutes to simulate. Being able to run many jobs in parallel on a dedicated system was essential to the success of this effort.

The SMU HPC system, hard at work. Each of the “blades” above contains multiple processors, and each of those is host to multiple processing cores.

Routinely now, SMU ATLAS physicists are analyzing real proton collision data. For any local analysis, this data must be copied from a Tier 1 site, such as Brookhaven National Laboratory. Recently, SMU post-docs Haleh Hadavand and David Joffe, and SMU grad student Kamile Dindar engaged in a serious exercise to test the reliability of the ATLAS systems built for getting data to SMU. The tremendous storage capacity of the system and marriage of central ATLAS bookkeeping tools into the SMU environment made a full test possible, even under stressful conditions preparing research for conferences.

My final example is a bit closer to home and involves some of my own work this summer. In order to meet the challenges of globally distributed data, physicists and engineers have developed “The Grid” – a massive, “cloud-computing” system distributed across the globe [2]. The philosophy of The Grid is simple – there is too much data for any one host, and so rather than bringing data to the user the user should send an analysis to the data. This philosophy works remarkably well, but even still there is tremendous value in having a local HPC system that can help us to keep up with the demands of physics analysis, especially at times when The Grid is swamped. As a comparison, I recently processed a large sample of simulated ATLAS data on The Grid and at SMU; the SMU HPC system did the same job as The Grid, but in some cases in much less time. One set of jobs ran in 51 minutes at SMU; the same set ran in 56 minutes on the Grid. A second batch of jobs ran in 52 minutes at SMU; the same batch required 3.5 hours on the Grid.

SMU has invested tremendous resources in developing a local high-performance computing system. From the perspective of an ATLAS physicist like myself, the HPC system is an invaluable tool on its own and a compliment to the Grid resources our experiment has developed. In a moment of crisis, when discoveries are on the line and time matters, you cannot underestimate the benefits of a reliable, well-maintained, powerful, local system that saves time and promotes spontaneous research.

[1] https://wiki.smu.edu/display/smuhpc/SMUHPC

[2] http://en.wikipedia.org/wiki/Grid_computing

About Stephen Sekula

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *