Blog: Ready, Set, Process: Preparing for Rubin Observatory's Data Deluge A look inside the data processing infrastructure built by the NSF–DOE Vera C. Rubin Observatory to handle the Universe’s greatest data challenge.

15 May 2025

In astrophysics, data are a precious commodity. They are scientists’ raw material, their starting point, their playground. Here, data begin with images of the sky, of distant galaxies, of stars in motion. From each image, scientists extract a trove of information such as brightness, position, velocity, and color. The principle is simple: The more sky you observe, and the more often you observe it, the greater your chances of catching the Universe’s rarest and most fleeting phenomena.

In late 2025, the NSF–DOE Vera C. Rubin Observatory, equipped with the world’s largest digital camera built at the DOE’s SLAC National Accelerator Laboratory, will begin the Legacy Survey of Space and Time (LSST). It will be the widest, fastest, and deepest sky survey ever made. Every night, the camera will scan the sky with clockwork precision, capturing a new 3200-megapixel image every 40 seconds, each one 8 gigabytes. Every three nights, the telescope will revisit the same region of the sky, constructing a time lapse of the cosmos that will unfold over a decade.

NSF–DOE Vera C. Rubin Observatory is jointly funded by the U.S. National Science Foundation (NSF) and the U.S. Department of Energy’s Office of Science (DOE/SC). Rubin Observatory is a joint Program of NSF NOIRLab and DOE’s SLAC National Accelerator Laboratory, who will cooperatively operate Rubin.

“I think of the Rubin Observatory as the dashcam for the sky,” says Yusra AlSayyad, the Princeton University researcher who oversees the Rubin image processing algorithms. “Wide-field imaging surveys to date have just given us snapshots. But the sky is not static — it’s alive.” As with a dashboard camera, the value of the 10-year LSST lies not just in seeing what happens but also in being able to go back. “If something strange appears — an explosion, an object vanishing — we can rewind and see what led up to it,” she states.

But with ambition comes volume. The numbers are staggering: 20 terabytes of raw images every night; 60 petabytes by the end of the survey. And that’s just the beginning. Once processed, analyzed, and cataloged, the full data volume will reach 500 petabytes, the equivalent of all the written content ever produced throughout human history.

To manage this deluge of data, Rubin requires innovations that, just two decades ago, were not yet available. “The technology simply didn’t exist twenty years ago,” AlSayyad says. “The same advancements that made services like streaming video possible also make LSST possible: advancements in more abundant storage, faster and more parallel computing, networks to move large volumes of data over long distances, and algorithms.”

Let's take a closer look at this.

Within 7 seconds of each exposure, images are transferred from the mountaintop in Chile to the U.S. Data Facility (USDF) at SLAC in Menlo Park, California. There, initial processing begins: comparing each new image to a reference, flagging differences, and issuing alerts when something new appears.

Stanford Research Computing Facility (SRCF) at sunrise on SLAC campus

Credit: Olivier Bonin/SLAC National Accelerator Laboratory' legend='Hosted by the Stanford Research Computing Facility (SRCF) on the SLAC campus, the U.S. Data Facility is the main hub of Rubin's data infrastructure.

Then each night up to 10 million alerts are generated, each one a potential cosmic event. These alerts are routed to an ecosystem of specialized software — known as “brokers” and based on machine learning classification algorithms — that classify and distribute the alerts to scientists across the globe. Such rapid-response systems are essential to researchers studying transient events. A gamma-ray burst may last only seconds; a supernova evolves over days or weeks. Detecting in near real-time when something happens is what makes follow-up observation possible, such as the early identification of potentially hazardous asteroids.

Within 24 hours, catalogs of detected events are published. After 80 hours, images are transferred to the servers of the France Data Facility (FrDF) at the IN2P3/CNRS Computing Center and the UK Data Facility (UKDF), where they are mirrored and stored. To this end, the team can draw on expertise from particle physics. “We’re using software originally developed for the ATLAS experiment at CERN, which faced similar challenges: managing huge volumes of data and tens of billions of individual objects distributed across multiple sites,” explains Wei Yang, Information Systems Specialist at SLAC, responsible for deploying the software-based data catalog for Rubin.

And then, the real magic begins.

Every year, a catalog with all the new images will be released. After being crunched by the three data centers, hundreds of snapshots will be merged into ultra-deep composite images. “You stack and stack again the images of the sky. Add them all and you create an incredibly deep picture,” explains Eli Rykoff, Staff Scientist at SLAC, in charge of Rubin image calibration. Scientists will then perform a new round of measurements, pushing the limits of detection even further and generating yet more data.

“Once you’ve got the deep image, you go back to every single frame,” explains Rykoff. “Then you ask: ‘What was the light like at this spot, at this moment?’ By repeating this process for all images, we can reconstruct light curves and track the evolution of an object's brightness over time.” This technique will enable studies that map the Milky Way and enhance the understanding of supermassive black holes. It will also lead to the discovery of millions of Type Ia supernovae, the cosmic explosions that trace the matter and energy content of the Universe and help unlock the mysteries of dark energy.

Rubin's data processing infrastructure is distributed across three main centers: the U.S. Data Facility at SLAC, the UK Data Facility, and, here, the France Data Facility at the IN2P3/CNRS Computing Center.

Credit: Cyril Fresillon/CC IN2P3/CNRS Images

By the end of the LSST, all this processing will form an astronomical dataset without precedent: a catalog of billions of objects, each tracked across time and tagged with dozens of physical characteristics. Handling such a massive trove of information is a challenge in itself. That’s why the powerful Rubin Science Platform was built from the ground up to enable scientists to access this ocean of data and equip them with a rich set of tools that will allow them to explore, analyze, and ask their most ambitious questions about the Universe, and get answers.

Right now, the data processing system is undergoing its final tests. Scientists are running real data from previous surveys and from the LSST Commissioning Camera through the full pipeline, end to end, to ensure every piece holds under pressure. “We need to move data back and forth, process it, store it, and keep track of it. At this scale, you’re constantly dealing with system reliability and efficiency challenges,” says Yang.

But it is not just about hardware and software. “This kind of distributed processing requires excellent communication — between systems, but also between people,” Yang emphasizes. “It’s a collaborative effort on a global scale.”

Links

On the Horizon

Ready, Set, Process: Preparing for Rubin Observatory's Data Deluge

A look inside the data processing infrastructure built by the NSF–DOE Vera C. Rubin Observatory to handle the Universe’s greatest data challenge.

Profile

Links

NOIRLab Contact