With the recent announcement by CERN of the probable discovery of the Higgs Boson, the Standard Model of quantum physics has been completed. We now understand the basic constituents of the universe and how they interact. The existence of such a particle and its associated field was hypothesized over 40 years ago by British scientist Peter Higgs, but until recently, there was no direct evidence of its existence. The discovery was made by two teams at CERN working with the Large Hadron Collider, located in a 17-mile tunnel under the Swiss-French border. The LHC is capable of accelerating particles to within a hair of the speed of light, imparting huge energies, which, when the particles collide, results in reactions that allow scientists to observe the behavior of elementary particles.
The discovery of the Higgs Boson is not only a story of high-energy collisions in billion-dollar accelerators. It is also a story of data and computation. The Large Hadron Collider is capable of creating 600 million collisions per second, and each of those collisions produces approximately 1 MB of data. That amounts to about 1 Petabyte of data each second, which is well beyond the capacity of current data collection and storage systems.
Scientists are interested only in certain collisions: those that produce results that might be relevant to their experiments. The vast majority of the collisions are uninteresting and so can be ignored. In particle accelerators, the systems that choose which data to keep are called triggers. A complex array of these triggers, including calorimeters and pattern comparators sift the data according to predefined criteria. Even after this filtering, the LHC produces about 15 Petabytes of raw data per year that needs to be stored and analyzed. Analyzing and storing all this data in one purpose-built data center is unfeasible. The cost would be enormous. Instead the LHC parcels out chunks of data to data centers around the world.
The LHC Computing Grid is a four-tiered network. The first level (Tier-0) is at the LHC, where a primary backup is kept on tape. After some initial processing, chunks of the data are distributed via dedicated fiber lines to 12 Tier-1 data centers around the world, all of which are capable of receiving 10 Gigabytes of data per second. After further processing and division, the data is sent to Tier-2, which is made up of over a hundred major data centers, often located in universities. The connections between Tier-1 and Tier-2 are standard network connections. These sites are where much of the data analysis is carried out, overseen by thousands of scientists accessing Tier-2 with their PCs. After the analysis, data is sent back up the tier levels.
It is this distributed grid system that allows scientist to handle the enormous amounts of data produced at the LHC. The distributed nature of the LHC Computing Grid is also an important part of making sure that the data is open and available to those who need it. University data centers on Tier-2 are able to connect to any of the Tier-1 hubs, allowing them access to the information they need to focus on specific areas of research.
So, is that it? Do we know all there is to know about the nature of the universe? Far from it. There are still deep mysteries at the quantum level, including the possible unification of quantum mechanics and general relativity, the verification (or not) of string theory, and supersymmetry. One thing we can be sure of is that all of these potential discoveries will rely on improving technologies in data management, storage, distribution, and analysis.