High-energy physics and genetic research may seem to have little in common. However, they have a similarity in an unusual area: both fields of research have to process mountains of data. For this purpose, they both use the Grid, a powerful virtual supercomputer. It is so powerful that it can do as much in a week as an ordinary PC can do in a century.
30 million gigabytes of data per year
In recent years, physicists have been searching for the Higgs boson. To do this, they analysed atomic collisions using CERN’s particle accelerator, the Large Hadron Collider. The construction and maintenance of the LHC was already a monumental task. Very soon, it became apparent that filtering, storing and processing the generated data would also be a huge challenge. To express it in figures, the experiments in the collider produce 30 petabytes (30 million gigabytes) of data each year.
The Grid: a powerful virtual supercomputer
To solve the problem, the Worldwide LHC Computer Grid – a powerful virtual supercomputer – was devised. It is virtual as it is spread across 170 computer centres in 41 different countries, with every location – or ‘Tier’ – processing part of the data. The Grid consists of a number of layers. Tier-0, located right next to the Large Hadron Collider, conducts the first processing step. There are 13 Tier-1 locations, one of which is collectively formed by Nikhef and SURFsara. The Tier-1 locations store and process the data to enable scientists to start their own analysis. This analysis is largely conducted at the Tier-2 sites. Physicists have developed software and procedures to support this impressive structure, in addition to setting up support teams.
Largest ALS study ever conducted
The success of the Grid during the search for the Higgs boson also generated interest in other research disciplines. One example is research into the disease ALS. More than 200,000 people around the world suffer from this deadly muscle disease. The average life expectancy following diagnosis of ALS is three years. The cause of the disease is unknown. To try and change this, the large-scale study Project MinE was set up. The initiators of this international project are two Dutch entrepreneurs who also suffer from ALS. Crowdfunding platforms have been set up in every participating country. In addition, funding was provided by the Dutch ALS Foundation, which received a huge number of donations following initiatives such as the Ice Bucket Challenge and the Amsterdam City Swim. Researchers at Utrecht University Medical Centre are now working with a team of international experts on the largest ALS study ever conducted.
The goal of the project is to analyse the DNA of 15,000 ALS patients and a control group of 7,500 people. By comparing the results, the researchers hope to identify the genetic cause of ALS. The solution is most likely written in our DNA. DNA analysis of 22,500 people requires a massive amount of data, producing roughly 75-100 gigabytes per person (2 million gigabytes in total). This amount of data requires a stack of hard drives the size of the Dom Tower in Utrecht (112 metres).
The Grid is the perfect computing infrastructure for performing DNA analysis. The data consists of fragments. The first step is to join together these fragments to create a number of chromosomes, which is equivalent to completing a jigsaw with several million pieces. And this has to be done for every single person in the study. Thankfully, every ‘jigsaw’ can be solved independently of the others. This application resembles the LHC experiment as every collision is also separate from the other collisions. As a result, the foundations that supported the high-energy physics research also enabled analysis of genetic material.
Valuable data set
Once the data set is complete (it is currently at 23%), Project MinE will possess a DNA database that is unique in terms of both size and quality. In particular, the data from the 7,500 people in the control group could be of great value to a huge number of research projects into conditions such as dementia, autism or schizophrenia. All kinds of common diseases that require control groups for research can benefit from this data, as you have to look further than just family relationships. The data from Project MinE will be made available to these kinds of research initiatives.
Our contribution (and yours)
In this way, SURFsara and Nikhef are making a contribution to Project MinE and we hope the researchers can make ALS a thing of the past as soon as possible. Every contribution truly makes a difference. To see how you could make a contribution, see the website of this large-scale ALS study.
More applications of The Grid
The European Grid Infrastructure (EGI) lists a number of research projects that have capitalised on the opportunities that the Grid offers. The common denominator of all of these projects is their scale: no computer in the world would be able to handle these projects alone. The Grid is ready and waiting to help these projects and many others that require extra computing capacity.
See the EGI’s website.
Jan Bot is an advisor at SURFsara. He provides advice on the latest technology in the field of High Performance Computing and big data for science, predominantly to researchers in the discipline of life sciences.
You can contact him by e-mail via firstname.lastname@example.org.
This blog article was previously published on Computerworld.